xmame-0.83.1

ROUND 64.... FIGHT!

  • linux 2.6.7

    7-1-2004

    GAME32-bit performance64-bit performance%speedup with 64-bit64-bit with gcc 3.4.0%speedup with gcc 3.4.0

    xmame-0.81.1

    5-26-2004

    The purpose of this experiment is to quantify performance differences between a 32-bit and 64-bit compile of xmame on a native 64-bit X86-64 (aka AMD64) Linux OS. This is in response to the frequently-updated mame32 benchmarks and the X86-64xmame mailing list thread.

    Background: Provided one has a native amd64 OS, a set of 64-bit and 32-bit libraries, and a compiler that allows 32-bit and 64-bit compilation, it is possible to run 64-bit programs on the same machine as 32-bit programs. It is not possible to mix 32-bit assembly code within a 64-bit program, so assembly CPU emulators and the MIPS dynamic recompiler are not being tested here. For this experiment I chose Linux for the 64-bit OS due to its current level of maturity compared to Windows, and I can't stand Windows anyway.

    Hardware

  • Athlon64 FX-53 (2.4 GHz)
  • ASUS SK8V
  • 2 GB registered ECC RAM (4x512 MB, 3-3-3-8, PC3200) [memtested]
  • ATI All-in-Wonder Radeon [see below for driver notes]
  • desktop resolution 1024x768x24

    Software

  • Linux 2.6.6, stock kernel, 64-bit
  • Gentoo Linux distribution (amd64)
  • gcc-3.3.3
  • glibc-2.3.2-r9
  • Whole system conservatively compiled (-O2 -pipe) for safety
  • xorg-x11-6.7.0 with open-source radeon driver
  • NOTE The All-in-Wonder Radeon (R100) was used instead of a Radeon 9700 Pro (R300) because the open-source (non-ATI) radeon driver IS TERRIBLE. 2D is always slow on R300 regardless of mode. 2D is fast on R100 except in xmame DGA mode. And X loves to lockup regardless of card.

    Test Methodology

  • I vehemently disagree with the mame32 assertion that -ftr 500 is sufficient. 500 isn't even enough frames to get out of the diagnostics in V-unit games. I'm not benchmarking as many games so I use -ftr 10000.
  • I used a very small perl script to manage the benchmarks. The overhead should be small and consistent across all tests.
  • A variety of games were chosen. Some to match the mame32 choices, some to match my own old benchmarks, and others for variety. The goal was to stress several CPU types, vector/tile/bitmap graphics, sound, programming styles, scaling, small and large games, old and new games, etc. Consequently there are a few Neo Geo games, which is somewhat redundant.
  • xmame-0.81.1 was chosen instead of 0.82.1 due to some known broken-ness in 0.82.1.
  • A regular windowed X11 display mode was used instead of DGA for speed reasons. 140 FPS in pacman is a joke compared to 1500+ and again points to driver problems.
  • xscreensaver was disabled of course!
  • The ASM 68000 CPU and MIPS_DRC were not used in any test to make the comparison fair.
  • xmame was compiled manually.
  • 64-bit settings: CFLAGS = -O2 -Wall -Wno-unused -pipe -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math
  • 32-bit settings (linked with 32-bit libraries): CFLAGS = -O2 -Wall -Wno-unused -pipe -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math -m32
  • Tested with -rompath [path] -samplepath [path] -b 32 -arbheight 0 -heightscale 1 -widthscale 1 -x11-mode 0 -effect 0 -noautodouble -noscanlines -frameskipper 0 -nothrottle -nosleepidle -noautoframeskip -frameskip 0 -noartwork -nobezel -nooverlay -geometry 1024x768 -xsync -noprivatecmap -noxil -skip_disclaimer -skip_gameinfo -noloadconfig -ftr 10000 -nop -sf 44100
    GAME32-bit performance64-bit performance%speedup with 64-bit
    crusnusa49.65499456.15659913.09 %
    dkong1265.0351001306.4208893.27 %
    ga2257.016928230.427070-10.35 %
    kinst25.8722046.3414507.99 %
    kof2000368.906005382.6101543.71 %
    mk2144.338047144.263163-0.05 %
    mk241.290272246.3292932.09 %
    mslugx350.864648358.7621692.25 %
    pacman1488.4633671541.3557303.55 %
    pitfight368.660985395.2569957.21 %
    punchout590.176837613.4016113.94 %
    rastan657.029360698.5288496.32 %
    samsho390.677070403.0903003.18 %
    soldivid350.878510crashundefined %
    souledgb49.57375959.17750319.37 %
    ssf2t365.694318379.8384343.87 %
    stunrun181.899974crashundefined %
    tempest319.210055324.3825481.62 %
    umk3143.769653143.247326-0.36 %
    wargods52.31657859.15309713.07 %
    xmen409.296365447.5041209.33 %

    Performance Conclusions

    The best aspect of X86-64 is the extra registers offset the bloat of the 64-bit extensions, and these numbers demonstrate that only a few games are consistenly worse under the 64-bit xmame. This makes X86-64 one of the few (only?) 64-bit ISAs where 32-bit code is generally slower. (On MIPS 32-bit is preferred for speed.) I consistently see mk2 and umk3 are slower on 64-bit xmame by a small margin, and ga2 is consistently slower by 10%. It would be interesting to see what these drivers do that makes them so slow, and of course, don't write code that way.

    Generally older games benefit least from X86-64. The difference is under 10%. Newer games benefit more, generally 8-20%. It's almost like having a "free overclock" relative to a traditional 32-bit PC.

    gcc-3.3.3 does not have good K8 pipeline knowledge, nor does it have new features like gcc-3.4.0's -funit-at-a-time (implied by -O2?) or -fweb (implied by -O3?). It's probable gcc 3.4.0 would boost the above scores by a noticeable percentage. Some have speculated 3.4.0 to produce executables 10-15% faster. This is worth testing!

    MAME Problems Encountered

  • During 64-bit compiles there are many warnings about pointers (64-bit) being cast to integers (32-bit) of a different size. It would be good to clean those up.
  • soldivid crashes in 64-bit
  • stunrun crashes in 64-bit Still broken in 0.82.1, but please confirm if OK in 0.82u3
  • biofreak hangs on black screen in 64-bit Seems OK in 0.82.1

    gcc 3.3.3 with -O3 -fomit-frame-pointer

    GAME32-bit performance64-bit performance%speedup with 64-bit
    crusnusa49.08675558.93671820.07 %
    dkong1140.7642641415.44780124.08 %
    ga2257.936726234.239977-9.19 %
    kinst25.8393366.3662269.02 %
    kof2000366.505529406.74189310.98 %
    mk2146.685502145.081501-1.09 %
    mk242.281070248.5129362.57 %
    mslugx345.958596371.8711607.49 %
    pacman1451.2734921629.66034012.29 %
    pitfight370.377723396.1073746.95 %
    punchout606.254791643.6048466.16 %
    rastan664.283351733.62323710.44 %
    samsho390.650759420.7133867.70 %
    soldivid344.246640crashundefined %
    souledgb51.16653261.04272819.30 %
    ssf2t361.827581390.7388647.99 %
    stunrun181.875532crashundefined %
    tempest272.897737340.18475624.66 %
    umk3145.880191143.936515-1.33 %
    wargods52.63871460.42872114.80 %
    xmen420.078615472.65289412.52 %

    Conclusions

    -O3 is not universally better than my -O2 + options settings in 32-bit xmame, but it is a universal win in 64-bit xmame. The result is a somewhat larger performance percentage in some games, like kof2000, which don't tend to vary as much in successive runs like pacman and dkong do.


    gcc 3.4.0 compared with gcc 3.3.3

    This is to test 3.4.0's alleged 10-15% speedup with -march=k8 and other enhancements. As you can see below, we never achieve this goal.

  • gcc 3.3.3 CFLAGS = -O2 -Wall -Wno-unused -pipe -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math
  • gcc 3.4.0 CFLAGS = -O2 -march=k8 -Wall -Wno-unused -pipe -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math
  • gcc 3.3.3 CFLAGS = -O3 -Wall -Wno-unused -pipe -fomit-frame-pointer
  • gcc 3.4.0 CFLAGS = -O3 -march=k8 -Wall -Wno-unused -pipe -fomit-frame-pointer
    GAME64-bit gcc 3.3.3 -O264-bit gcc 3.4.0 -O2%speedup with 3.4.064-bit gcc 3.3.3 -O364-bit gcc 3.4.0 -O3%speedup with 3.4.0
    crusnusa56.15659960.2708557.33 %58.93671861.0503433.59 %
    dkong1306.4208891295.098234-0.87 %1415.4478011345.679675-4.93 %
    ga2230.427070235.6208882.25 %234.239977235.6039670.58 %
    kinst26.3414506.4086741.06 %6.3662266.8878908.19 %
    kof2000382.610154408.0041946.64 %406.741893391.631754-3.71 %
    mk2144.263163145.1841730.64 %145.081501148.4292032.31 %
    mk246.329293246.219163-0.04 %248.512936250.9660750.99 %
    mslugx358.762169369.5341693.00 %371.871160381.3727482.56 %
    pacman1541.3557301556.9409911.01 %1629.6603401753.8232037.62 %
    pitfight395.256995404.4578042.33 %396.107374419.6413125.94 %
    punchout613.401611618.8105790.88 %643.604846647.0532320.54 %
    rastan698.528849716.8907422.63 %733.623237740.1136730.88 %
    samsho403.090300421.4987464.57 %420.713386432.7193462.85 %
    soldividcrashcrashundefined %crashcrashundefined %
    souledgb59.17750359.6802600.85 %61.04272859.501684-2.52 %
    ssf2t379.838434388.2487512.21 %390.738864392.9145410.56 %
    stunruncrashcrashundefined %crashcrashundefined %
    tempest324.382548346.5571696.84 %340.184756338.134105-0.60 %
    umk3143.247326144.6294900.96 %143.936515147.0084952.13 %
    wargods59.15309761.8100824.49 %60.42872162.9263414.13 %
    xmen447.504120457.2068142.17 %472.652894475.1771500.53 %

    Conclusions

    When using gcc 3.4.0 I never see more than 8.19% improvement on games and generally less than 5%. This deflates the idea of getting 10-15% from -march=k8 and the new options (-funit-at-a-time should be included in -O2 by default). However, I do see a general improvement from 3.4.0 and for most games -O3 is a win, except souledgb surprises with best speed with gcc 3.3.3 -O3. dkong behaves similarly, but that game has a large margin of error between successive runs.


    BACK