Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pentium Pro processor instructions #300

Open
rodrigoandrigo opened this issue Jul 8, 2022 · 24 comments
Open

Add Pentium Pro processor instructions #300

rodrigoandrigo opened this issue Jul 8, 2022 · 24 comments
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@rodrigoandrigo
Copy link

rodrigoandrigo commented Jul 8, 2022

features it would be nice to add pae(Physical Address Extension), cmov(Conditional Move) and cx8(CMPXCHG8B) instructions for better compatibility.

keep up the good work.

@schellingb
Copy link
Owner

These are all features first introduced in the Pentium Pro processor (predecessor to the Xeon processors) which weren't really intended for games of that era. Do you know many games that that would profit from having these instructions?

In my opinion, DOSBox Pure is focused on running games up to around, let's say,1997-ish. Anything newer than that is cool if it runs playable for people, but it might ask a bit too much of good old DOSBox. More specialized virtual machine solutions or something like Wine might make more sense at some point.

@xttx
Copy link

xttx commented Jul 9, 2022

In my opinion, DOSBox Pure is focused on running games up to around, let's say,1997-ish. Anything newer than that is cool if it runs playable for people, but it might ask a bit too much of good old DOSBox.

You underestimate the power of your creation :)
From 1999 voodoo 1/2 is considered obsolete. Not many 3d accelerated games from 2000+ will work fast enough on this hardware (if at all). But some do, and if you set core to dynamic, and disable force normal core, most of those games runs smoothly with 100% emulating speed, set to "Pentium 2 300mhz, 200000 cycles". Examples: Croc, Star Wars Racer, Tomb Raider 4

And if we take 2D games, we can go even further. All of my 2D games up to 2001 (Civilization III) run flawlessly. Unfortunately some of them are not working at all, but if the game is working - it work fast enough to be fully playable!

@PoloniumRain
Copy link
Contributor

PoloniumRain commented Jul 10, 2022

In my opinion, DOSBox Pure is focused on running games up to around, let's say,1997-ish. Anything newer than that is cool if it runs playable for people, but it might ask a bit too much of good old DOSBox.

Have you seen the Digital Foundry video on DOSBox-Pure? It was a nice surprise seeing this! They have DBP running on an Xbox Series X. There's some interesting benchmarks too, which replicate my experience on PC where using 3dfx acceleration is actually slower than software rendering, which is the opposite to real hardware of course. But it makes sense due to having to emulate the Voodoo 1 which is stressing the CPU more. With a Threadripper 3960X i can run Quake 2 at 1024x760 at 60fps in software rendering mode, but only 640x480 at 45 - 60fps with acceleration.

If you're interested i can post some benchmarks with AMD Zen 4/Ryzen 7000 or the Intel 13th gen Raptor Lake CPU's when these come out in a few months. Performance wise i'm expecting them to run 3D games from 2000 at very playable frame rates. The biggest problem will likely be the small 12MB VRAM of the Voodoo 1 at that point.

@Torinde
Copy link
Contributor

Torinde commented Dec 18, 2023

@PoloniumRain, Voodoo benchmarks on Zen4/Raptor Lake will be very interesting, yes! Please post those.

It seems at one point even if you emulate Voodoo 5 6000 (via multiple host cores), the emulated CPU performance can be a bottleneck as it relies on a single host thread (thus #370).

Although that depends on the workload - these benchmarks show that Games will benefit with CPU emulation speed matching Pentium II 450, so host requirements shouldn't be outrageous.

@schellingb schellingb changed the title Resources that would be nice to have Add Pentium Pro processor instructions Dec 18, 2023
@schellingb schellingb added enhancement New feature or request wontfix This will not be worked on labels Dec 18, 2023
@PoloniumRain
Copy link
Contributor

Voodoo benchmarks on Zen4/Raptor Lake will be very interesting, yes! Please post those.

I didn't get either of those CPU's in the end, but i'll definitely get Zen 5/9950X (or whatever it will be called) whenever the X3D version is released. I'll benchmark that :)

@Torinde
Copy link
Contributor

Torinde commented Mar 10, 2024

DOSBox SVN patch for PPro and its X implementation (there are further additions to bring that to PII/III level).

@Torinde
Copy link
Contributor

Torinde commented Apr 10, 2024

With a Threadripper 3960X i can run Quake 2 at 1024x760 at 60fps in software rendering mode, but only 640x480 at 45 - 60fps with acceleration.

What do you mean by "software rendering" vs "with acceleration" - the renderer selected in Quake2?

What was the CPU utilization on the host while doing those tests? I assume one thread was at maximum in both cases, while in the case of emulating the Voodoo ("with acceleration"?) - how many threads were utilized?

For reference - Quake2 OpenGL 1024x768 (Voodoo2 SLI - 2 cards):

  • 44fps on K6-2 333MHz without 3DNow! driver
  • 67fps on K6-2 333MHz with 3DNow! driver
  • 69fps on P2 333MHz

So, I'm also wondering if emulating 3DNow! will help, since:

  • on real hardware it brings 50% improvement
  • enabling MMX emulation in Staging improved the speed multiple times! (for some tests that support MMX)

@PoloniumRain
Copy link
Contributor

I was running this excellent DOS port of Quake II and using the games 3dfx OpenGL renderer, so it's using Voodoo. That's what i meant by acceleration.

This is just a very quick (and bad) test with 3dfx OpenGL but surprisingly a decent amount of threads are used...

Q2perf

I have about 300 browser tabs open though... But roughly 85% of the CPU usage is from Q2. I also expected one thread to be at 100%, but instead 10 or so threads show increased usage with 6 of them increasing more than others. Task Manager is poor for monitoring this stuff though so i wouldn't pay that much attention to it.

And those MMX results are interesting! But i can't wait for new CPU's to arrive like Zen 5 so i can finally stop caring about these things. I mean i'm sure these will be able to brute force almost any emulator/settings for the next several years, maybe even PCem with Voodoo 3 + the fastest supported Pentium II, which is 450MHz. P2 233MHz is the limit for my 3960X.

@schellingb
Copy link
Owner

DOSBox Pure does software emulation of Voodoo 3dfx, always with 4 threads. This probably should be changed to have the code figure out how many cores the host CPU has and use more threads if there are more available. I'll put this on the TODO list.

@Torinde
Copy link
Contributor

Torinde commented Apr 15, 2024

Thanks, @PoloniumRain for the further info!
OK, so you are saying that this Quake2 port is faster in its "software rendering running on CPU emulated by Pure 1 thread" mode than its "OpenGL rendering mode running on Voodoo emulated by Pure 4 threads + CPU emulated by Pure 1 thread"? While having plenty of physical cores on the host...
Does that mean the graphics quality is different? E.g. "software rendering mode" has much less fidelity than "OpenGL mode"? Otherwise I don't get it...

@schellingb, great to hear that, looking forward to see what that will deliver on Threadripper (OK, and on most regular modern CPUs, who go well above 4 threads).

@PoloniumRain
Copy link
Contributor

It's simple, both the Digital Foundry video and myself are only saying that it makes sense that running a game with Voodoo acceleration would produce a lower frame rate than software rendering. It's not specific to Quake II, that's just an example. It happens with 100% of games and it's because emulating a 3dfx Voodoo GPU, or literally any GPU for that matter, will require more processing power from the host machine to emulate. Just like running a PS3 emulator will be slower than running a PS2 emulator because the PS3 is a far more powerful console that requires much faster host hardware for emulation.

So while a physical GPU in a 1990's PC will make games run at higher frame rates, with emulation any emulated GPU has the exact opposite effect, even in cases where a game may have identical graphics and resolution to software rendering (but Q2 isn't one of the games, it looks very different with the Voodoo. Arguably much worse aesthetically but technically better lol).

But if many more threads can be used then i'd expect the frame rate gap will disappear...

@Torinde
Copy link
Contributor

Torinde commented Apr 16, 2024

OK, so basically game developer can make more efficient 'direct' software rendering than the Glide emulation developer (due to lack of API overhead). Of course, if the result looks different it's another thing to consider.

Let's see how many host threads are needed to bridge that gap.

Bochs extends DOSbox Voodoo to Banshee and Voodoo3 with further improvements, LGPL BIOS, etc. and also there is one attempt to add Voodoo3 VGA to DOSbox-X.

schellingb added a commit that referenced this issue Jun 1, 2024
…or triangle rendering based on the number of cores available on the host CPU (#300)
@schellingb
Copy link
Owner

I made the voodoo multi threaded renderer now scale the number of threads based on the number of cores available on the host CPU. I'm not sure this is always an improvement. A very simple test on my 8-core android phone showed a very slight decrease in performance, but it might just be a bad test. Hopefully it gives more significant improvements on other types of hardware. Would be interesting to know how it fares on a threadripper CPU :-)

@PoloniumRain
Copy link
Contributor

Tried to test it but DOSBox-Pure wont compile with Visual Studio 2022, even though just a few hours earlier it was working lol. Looks like the Voodoo multi-threaded changes broke something:

Errors

schellingb added a commit that referenced this issue Jun 2, 2024
…or triangle rendering based on the number of cores available on the host CPU (#300)
@schellingb
Copy link
Owner

Sorry I missed a file in the commit, I added it now. It should compile now.

@PoloniumRain
Copy link
Contributor

The improved multi-threading has some interesting results, especially with one specific thing...

QUAKE II (640x480)

Map New MT renderer Average frame rate On-screen transparency
Outer Base No 25 fps Very low
Outer Base Yes 24 fps Very low
Outer Base (2) No 36 fps Very high
Outer Base (2) Yes 51 fps Very high
Ammo Depot No 25 fps None
Ammo Depot Yes 27 fps None
Ammo Depot (2) No 23 fps None
Ammo Depot (2) Yes 20 fps None
Guard House No 18 fps Low
Guard House Yes 18 fps Low
Guard House (2) No 32 fps Very high
Guard House (2) Yes 54 fps Very high
Drilling Area No 19 fps None
Drilling Area Yes 19 fps None
Lower Mines No 26 fps None
Lower Mines Yes 25 fps None
Processing Plant No 26 fps None
Processing Plant Yes 25 fps None
Toxic Waste Dump No 33 fps Moderate
Toxic Waste Dump Yes 36 fps Moderate
Research Lab No 28 fps High
Research Lab Yes 35 fps High

Q2's frame rate is roughly 1 - 2 fps lower with the new multi-threading renderer unless transparency is being displayed. The more transparency on the screen, the higher the frame rate. When there's a ton of transparency (like looking through a window that's right in your face) the frame rate can increase by up to ~80%. Overall i'd say it's better performance, but barely, because quite often there's either none or a low amount of transparency on the screen in this game. So in those cases the frame is the same or slightly lower.

But... when there's a lot of transparency like this:

Q2
Q2improved

...Huge improvement in these cases with drawn frames (and the Q2 frame rate counter in the bottom right also confirms this).

Screamer 2 (640x480)

Track New MT renderer Average frame rate
England No 46 fps
England Yes 51 fps
Egypt No 50 fps
Egypt Yes 51 fps
California No 41 fps
California Yes 46 fps

Screamer Rally (640x480)

Track New MT renderer Average frame rate
China No 45 fps
China Yes 49 fps
Canada No 44 fps
Canada Yes 44 fps
Italy No 52 fps
Italy Yes 59 fps

Both Screamer games are very consistent with this change. Both games typically have slightly higher frame rates with the improved multi-threaded rendering.

Carmageddon (640x480)

Track New MT renderer Average frame rate
Maim Street No 36 fps
Maim Street Yes 35 fps
Costal Carnage No 47 fps
Costal Carnage Yes 45 fps

Carmageddon is the opposite of the Screamer games, where this time i get slightly lower frame rates.

Fatal Racing/Whiplash

Practically identical results of the 3 tracks tested.

Tomb Raider

Game is locked at 30fps. It's always hitting 30fps so it can't be tested and i only get 4% CPU utilisation because of this.

Grand Theft Auto, GTA: London 1961 and GTA: London 1969

Again these games are annoyingly locked at a low frame rate (~24fps for the GTA games when using 3dfx). So they also can't be tested (3.5% CPU utilisation btw lol).

Thread usage is definitely higher than before, but it's still extremely low at around 8 - 9% total CPU usage for RetroArch (previously around 5 - 6.5%):

Previously it looked like this:

Cores_old

And now with the improved multi-threading:

Cores

I don't know if anything else can be improved with multi-threading but if it could reach a point where all games perform either the same or better, then you could just have it enabled by default. Otherwise a toggle option might be best so it can be set for each game depending on how that game performs :)

@schellingb
Copy link
Owner

Wow amazing tests and data, thank you so much. It's... ok I guess? Not quite the boost we were hoping for but I'm not sure how we could tweak things from here.

By the way the implementation works like this:

  • Get the number of cores
  • If there's 16 or more cores, use 15 threads
  • Otherwise spawn 1 thread less than there are CPU cores

The reason for the "1 less thread" is because the main thread participates in the calculations in addition to the spawned threads. So that should make it use all the cores (up to 16) during rendering.
It would be very interesting to see what results you get if you were to change line 273 of /src/hardware/voodoo.cpp the MAX_TRIANGLE_THREADS = 15 to switch from 15 to 7, 12 or 23. If you have some more time for this, maybe choose one game you can easily start up and see what differences we get?

Thanks again for your time and detailed testing!

@PoloniumRain
Copy link
Contributor

PoloniumRain commented Jun 2, 2024

Here ya go...

QUAKE II (640x480)

Map 4 threads FPS 7 threads FPS 10 threads FPS 12 threads FPS 15 threads FPS 23 threads FPS Transparency shown
Outer Base 25 28 26 25 24 19 Very low
Outer Base (2) 36 47 51 52 51 43 Very high
Ammo Depot 25 30 29 28 27 20 None
Ammo Depot (2) 23 24 22 21 20 15 None
Guard House 18 20 19 19 18 14 Low
Guard House (2) 32 45 50 50 51 45 Very high
Drilling Area 19 21 20 20 19 16 None
Lower Mines 26 28 27 26 25 19 None
Processing Plant 18 20 19 18 16 13 None
Toxic Waste Dump 33 38 37 36 36 30 Moderate
Research Lab 28 36 36 35 35 29 High

Screamer 2 (640x480)

Track 4 threads FPS 7 threads FPS 10 threads FPS 12 threads FPS 15 threads FPS 23 threads FPS
England 46 51 50 50 50 42
Egypt 50 51 51 51 50 48
California 41 48 48 48 46 38

The Screamer 2 results also apply to Screamer Rally, it's basically the same game with a new skin.

Edit: More tests...

Shadow Warrior (640x480)

Map 4 threads FPS 7 threads FPS 10 threads FPS 12 threads FPS 15 threads FPS 23 threads FPS
Seppuku Station 44 43 40 39 37 33
Master Leep's Temple 36 36 35 35 34 33
Killing Fields 46 44 39 38 36 31

Shadow Warrior is the one game where 4 threads is slightly faster than 7. But games created with the Build Engine look poor with 3dfx/GPU acceleration, because all it does is blur everything and limit you to a low resolution. It looks and runs much better in software renderer mode. So these results are a bit pointless.

Descent II (640x480)

Map 4 threads FPS 7 threads FPS 10 threads FPS 12 threads FPS 15 threads FPS 23 threads FPS
Ahayweh Gate 138 194 209 212 209 189
Seaspring Gorge 154 220 243 249 252 235

Unlike other games, Decent II scales best with 12 or 15 threads, but the game only outputs 60fps and seems to internally render far higher frame rates according to the in-game FPS counter. 4 threads easily has the worst performance.

Carmageddon (640x480)

Track 4 threads FPS 7 threads FPS 10 threads FPS 12 threads FPS 15 threads FPS 23 threads FPS
Fridge Racer 33 34 33 32 31 28

Battle Arena Toshinden (640x480)
Locked at 60fps on every stage with any thread count.

Tomb Raider (640x480)
Locked at 30fps with any thread count due to this games 30fps cap.

So overall i'd say 7 threads are best. 7 is always faster than the previous 4 threads in any games i try apart from Shadow Warrior. It also has the best average performance in the most demanding games like Q2!

I also tried 42 threads just for a laugh... and err performance was shitty to say the least. Totally unplayable frame rates with lots of sound crackling. Frame rates usually start to tank when going over 15 threads.

10, 12 and 15 threads are the best if there's a LOT of transparency on the screen, but this many threads isn't realistically helpful, because 99% of time there wont be anywhere near that much transparency in a scene. To give these thread counts an advantage over 7 threads in QUAKE II i needed to go right up to a transparent window so that it took up ~70% or more of the entire screen (as shown with "Very high" in the transparency column on the chart). In all other instances 7 threads is faster.

@schellingb
Copy link
Owner

Thank you again for these exhaustive test results!

I ended up switching to 7, it seems a good middle ground. I also thought of maybe doing something like a percentage (use 50% of cores, use 75% of cores) but it seems to heavily depend on the hardware. On my Android phone which has 8 cores it seems using half of the cores for voodoo rendering (3 threads + main thread) gives the best performance, but using all cores results in a slight (8%) performance decrease. On an older 4 core Intel desktop CPU on the other hand using even one less of the cores results in a more significant 15% performance loss. Making it configurable is still an option, but it would be really nice if the core could just be as smart as possible.

@PoloniumRain
Copy link
Contributor

I've edited my above post and added even more tests... just to make absolutely sure, but long story short - 7 is still the best.

On an older 4 core Intel desktop CPU on the other hand using even one less of the cores results in a more significant 15% performance loss.

Does it have Intel Hyperthreading? So 4 cores with 8 threads. I suspect that 7 threads might be the best for x86 CPU's if the CPU has at least 8 threads, or maybe at least 12. With Android/mobile SoC's they typically have 1 to 4 performance cores plus much slower energy saving cores. Maybe that's why 3 threads is the best there. The AMD Ryzen 9950X was announce today, so when that's available in July (if it's not just a paper launch) i'll be able to test that. Not just for thread performance but also to see just what that thing can do :D

Making it configurable is still an option, but it would be really nice if the core could just be as smart as possible.

For x86 CPU's maybe have it use a maximum of 7 threads, even with CPU's with tons of threads being as the performance often decreases when going over 7, but if a CPU supports less than 8 threads then maybe revert back to 4 threads like it was using previously. I think more testing is needed with different CPU's but 7 is probably fine for now. I'll try to test my ancient Intel 2 core/4 thread Windows laptop in the next few days.

@PoloniumRain
Copy link
Contributor

Just tested my crusty old laptop...

Specs:
Intel i7 3517U @ 2.4GHz
2 cores 4 threads
Windows 11

Descent II (640x480)
41fps for ALL threads, from 3, 4, 7, 12, 15, 23 and even 42 threads it made no difference.
So i tried 2 threads and it dropped to 33fps.

Screamer 2 (640x480)
3 - 42 threads: 15fps
2 threads: 13fps

QUAKE II (640x480)
4 - 42 threads: 9fps
3 threads: 8fps
2 threads: 7fps

So there's no issues with 7 threads on these old Intel 2 core CPU's.

@schellingb
Copy link
Owner

Amazing, the more data the better. Thanks again!

Keep in mind, if you're just changing MAX_TRIANGLE_THREADS, it will not actually affect anything unless the system has more cores than MAX_TRIANGLE_THREADS + 1:

	unsigned cores = cpu_features_get_core_amount();
	v->tworker.triangle_threads = ((voodoo_pci_sstdevice.perf & 1) ? (cores <= (MAX_TRIANGLE_THREADS+1) ? (UINT8)(cores - 1) : MAX_TRIANGLE_THREADS) : 0);

So on a system where cpu_features_get_core_amount() returns 2, only MAX_TRIANGLE_THREADS = 0 would actually have an impact - but that would return in a compile error I think, 0 is not valid. So with just 2 cores I don't think there is any option except disabling multi threading. For a test, you could change that one line to something like

	unsigned cores = cpu_features_get_core_amount() * 2;

so it thinks there are twice as many cores available as there actually are, but I've only ever seen negative effects in doing so (using more threads than there are physical host CPU cores).

@PoloniumRain
Copy link
Contributor

but that would return in a compile error I think

It actually compiled ok but frame rates were almost cut in half, so it performed the worst so far.

unsigned cores = cpu_features_get_core_amount() * 2;

so it thinks there are twice as many cores available as there actually are, but I've only ever seen negative effects in doing so (using more threads than there are physical host CPU cores).

Decreased performance again. It had very similar results to using 2 threads in my last post... Oh well, it was worth a try!

@Torinde
Copy link
Contributor

Torinde commented Jun 8, 2024

Amazing data!

Can the number of threads be made a user/conf option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants