xmame-0.81.1

ROUND 3.... FIGHT!

xmame-0.81.1 on the K6-3+ compiled with gcc-3.3.3. Here we compare the preset flags with -march=i586 to -O2 with -march=i586 and the preset flags with -march=k6. This setting used to ICE with gcc-3.2.1, but appears to work OK now.

GAMEPRESET-O2-O2 with k6 arch
pacman82.57065682.15175787.929299
tempest16.70757816.48903416.886475
samsho38.47293137.56845040.814139
ssf2t35.13464634.96995937.506007
xmen43.46336942.06147746.157284
mslugx36.00228334.93607538.027312
mk218.11409718.09911919.483827

Here we see a gain from using the preset flags (listed below) over -O2, but mating the preset flags with the K6-specific architecture optimization makes another big improvement in scores. Clearly optimizing for a specific pipeline versus a generic "i586" is important when you want every ounce of speed from gcc-3.3.3.


xmame-0.70.1



TEST SYSTEM OVERVIEWS


Systems used for testing xmame.

SYSTEM #1 SGI Octane w/gcc

  • CPU: dual R10000 @ 225 MHz, dual 32KB I/D L1 cache, 1 MB L2 cache
  • 1792 MB RAM, dual SI graphics with texture option, IRIX 6.5.20m
  • gcc 3.2.2

    SYSTEM #2 SGI Octane w/MIPSPro compiled version

    SYSTEM #3 SGI O2 w/gcc

  • CPU: RM5200 @ 300 MHz (R5K), dual 32KB I/D L1 cache, 1 MB L2 cache
  • 512 MB RAM, CRM graphics, IRIX 6.5.20m
  • gcc 3.2.2

    SYSTEM #4 SGI O2 w/MIPSPro compiled version

    SYSTEM #5 AMD K6-3+

  • CPU: AMD K6-3+ @ 450 MHz, 256 KB L2 cache on-chip
  • 120 MB RAM, Aladdin 7 chipset with dual-channel SDRAM, but used with fbdev in 32bpp
  • Linux 2.2.x kernel, glibc 2.1.3, XFree86 3.3.6 (no DGA, Xv)
  • gcc 2.95.3

    SYSTEM #6 AMD K6-2

  • CPU: AMD K6-2 @ 500 MHz
  • 256 MB RAM, generic VIA chipset, PCI voodoo3
  • Linux 2.2.x kernel, glibc 2.1.3, XFree86 3.3.6 (no DGA, Xv), accelerated 2D
  • gcc 2.95.3

    SGI gcc flags used

    CFLAGS = -O2 -Wall -Wno-unused -march=r5k -mabi=n32 -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math
    

    Linux gcc flags used

    CFLAGS = -O2 -Wall -Wno-unused -march=i586 -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math -pipe
    

    Testing methodology

  • Tests were done with minimal daemons running, at most a few xterms on-screen, and the screensaver was never invoked.
  • Vector AA was left turned on (default) for tempest.
  • Sound was forced to 44100 Hz.
  • Xv, DGA were not available. Used standard X11 display without any scaling or effects. Did use MIT Shared memory extensions.
  • ./xmame.x11.70 -rompath ../yawn/ -samplepath ../yawn -b 32 -arbheight 0 -heightscale 1 -widthscale 1 -effect 0 -noautodouble -noscanlines -frameskipper 0 -nothrottle -nosleepidle -noautoframeskip -frameskip 0 -noartwork -nobezel -nooverlay -geometry 1024x768 -xsync -noprivatecmap -noxil -skip_disclaimer -skip_gameinfo -noloadconfig -ftr 10000 -nop -sf 44100 $1

    BUGS and other observations

  • xmame-0.70.1 compiled fine on the O2, but tms34010.c did not compile correctly on the Octane. Caused a link problem with an unresolved symbol in the STATE_SIZE computation. Since this is just used for a memcpy I hacked the STATE_SIZE to use a large number and hoped it was good enough for mk2. mk2 ran fine subsequently. Strange that this was not resolved correctly. The OS, libraries, and compiler are the same on both systems--just with alot more other packages installed on Octane.
  • The vector graphics in tempest are wrong on SGIs. The grid is red instead of blue, and the main ship is cyan instead of yellow.
  • The SGIs use private color map even when you tell them not to!
  • The SGIs sometimes return negative fps values on slow games.
  • The Octane gcc compile coredumps on kinst2 and returns a negative value on the O2.
    GAMER10K 225 mipsproR10K 225 gccR5K 300 mipsproR5K 300 gccK6-3+ 450K6-2 500
    pacman137.203003117.94076567.86987164.53616384.59879847.205285
    tempest39.99706030.91437217.67017115.10036716.65168611.597895
    samsho36.45365527.07929621.70956018.92204640.90900129.152977
    ssf2t34.42636829.35082618.80707717.03661536.10717524.225178
    xmen41.29109632.94077424.56553922.71977243.61182730.895375
    mslugx31.01890022.62551418.10969116.16553837.52592826.111407
    mk236.96146730.82627922.55916819.23205935.97670124.091125
    crusnusa??-4.932447????-6.897904??????????7.1524225.583496
    kinst26.805552COREDUMP!5.547025??-5.04973??11.89207811.128569

    observations

  • Old games seem cache-limited. We may be seeing this when comparing Athlon64 3000+ to 3200+ as well. 1 MB L2 cache lets the 225 MHz R10000 beat the K6-3+ 450 by over 50%.
  • Even though the K6-3+ was hampered by 10% less MHz than the K6-2, shared video memory, and VESA framebuffer, it is always faster than the K6-2 500. The biggest difference between these two chips is the L2 cache is now on-chip. Also I think there is better write-combining. The speedup is large in older games, moderate in larger games, and minimal on newer games.
  • MIPSpro demonstrably produces faster code than gcc 3.2.2

    questions

  • L2 cache helps pacman tremendously but cannot salvage performance in newer games. Is it possible that L2 cache is irrelevant for MAME until the entire game emulation image can fit inside? Seems that adding enough L2 cache to keep the core fed helps games from the early '90s and '80s but beyond a certain point only increasing the core efficiency/speed matters. The K6-2 was always somewhat starved and the K6-3 alleviated this. But having 4 times the L2 cache on the SGIs doesn't stop them from being half as fast on large games.
  • Can any MIPS-based SGI hit 30 fps in kinst2? Since even the R16000A is a R10K descendant we would expect an 800 MHz version to scale--at most--linearly in xmame. The best SGI MIPS CPUs are 800 MHz with a few MB L2 cache.
  • Would moving the L2 cache on-chip boost the SGIs' performance as it did on the K6-3+?

    ROUND 2.... FIGHT!

    The comparison above was illustrative of the differences between a MIPSpro compile and a gcc 3.2.2 compile, and compared to a slightly newer gcc compile on Linux. Now let's test things more fairly.


    Which CFLAGS are best?

    First we hold gcc versions constant and try different CFLAGS.

    Configurations

    Tests performed on the K6-3+
    PRESET:

    CFLAGS = -O2 -Wall -Wno-unused -march=i586 -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math -pipe
    

    -O2
    CFLAGS = -O2 -Wall -Wno-unused -march=i586
    

    -O3
    CFLAGS = -O3 -Wall -Wno-unused -march=i586
    

    gcc 2.95.3
    GAMEPRESET-O2-O3
    pacman84.59879880.67898180.058366
    tempest16.65168616.60099316.581389
    samsho40.90900139.38709639.637885
    ssf2t36.10717535.79358536.161937
    xmen43.61182743.05141542.871190
    mslugx37.52592836.29477536.812401
    mk235.97670134.63519134.669343
    crusnusa7.1524226.8815786.067334
    kinst211.89207811.01703811.604298

    For gcc 2.95.3 the preset flags in makefile.unix are generally better than -O2 or -O3 alone. The only exception was ssf2t where -O3 was 0.06fps better. -O3 helps newer games somewhat and hurts older games slightly, with early '90s games somewhere in between.

    gcc 3.2.2
    GAMEPRESET-O2-O3
    pacman83.79308384.31267784.895467
    tempest16.76336917.25269816.642994
    samsho39.36840939.60994639.928090
    ssf2t35.51003035.62419935.970064
    xmen42.58937143.56659643.614736
    mslugx35.83119936.13399236.989401
    mk236.45212636.23959836.224637
    crusnusa7.3057726.9811597.239103
    kinst211.41825610.90209410.231836

    Strange behavior! Here we see newer games again benefiting from the preset CFLAGS, but generally O3 has the best performance by a slim margin. However, O3 miscompiles on the Octane (see below) so there clearly is danger in using it on alternative architectures. This makes it hard to recommend a CFLAGS for gcc 3.2.2. -O2 is a good compromise except for newer games.


    Tests performed on the Octane
    PRESET:

    CFLAGS = -O2 -Wall -Wno-unused -march=r5k -mabi=n32 -fomit-frame-pointer -fstrict-aliasing -fstrength-reduce -ffast-math
    

    -O2
    CFLAGS = -O2 -Wall -Wno-unused -march=r5k -mabi=n32
    

    -O3
    CFLAGS = -O3 -Wall -Wno-unused -march=r5k -mabi=n32
    

    gcc 3.2.2
    GAMEPRESET-O2-O3
    pacman117.940765118.288844hang on black screen
    tempest30.91437230.968461miscompile--vertex coords all wrong
    samsho27.07929626.900544hang on BIOS mess
    ssf2t29.35082629.360782hang on black
    xmen32.94077432.970164hang on "bad" ROM screen
    mslugx22.62551422.922950hang on BIOS mess
    mk230.82627930.384780hang with corrupt screen
    crusnusanegative number??-7.15??-6.80???
    kinst2coredumpcoredump!hang

    First let me note kinst2 hangs on a black screen even with -O and no architecture flags! I have not gotten kinst2 to run correctly on gcc 3.2.2 on the Octane at all. Additionally I get an internal compiler error using N64 mode on the preset CFLAGS, so as yet I cannot compare 64-bit performance on the Octane. (The libraries are installed.) Obviously O3 kills us here. There isn't much difference between the preset CFLAGS and -O2.


    BACK