-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pentium Pro processor instructions #300
Comments
These are all features first introduced in the Pentium Pro processor (predecessor to the Xeon processors) which weren't really intended for games of that era. Do you know many games that that would profit from having these instructions? In my opinion, DOSBox Pure is focused on running games up to around, let's say,1997-ish. Anything newer than that is cool if it runs playable for people, but it might ask a bit too much of good old DOSBox. More specialized virtual machine solutions or something like Wine might make more sense at some point. |
You underestimate the power of your creation :) And if we take 2D games, we can go even further. All of my 2D games up to 2001 (Civilization III) run flawlessly. Unfortunately some of them are not working at all, but if the game is working - it work fast enough to be fully playable! |
Have you seen the Digital Foundry video on DOSBox-Pure? It was a nice surprise seeing this! They have DBP running on an Xbox Series X. There's some interesting benchmarks too, which replicate my experience on PC where using 3dfx acceleration is actually slower than software rendering, which is the opposite to real hardware of course. But it makes sense due to having to emulate the Voodoo 1 which is stressing the CPU more. With a Threadripper 3960X i can run Quake 2 at 1024x760 at 60fps in software rendering mode, but only 640x480 at 45 - 60fps with acceleration. If you're interested i can post some benchmarks with AMD Zen 4/Ryzen 7000 or the Intel 13th gen Raptor Lake CPU's when these come out in a few months. Performance wise i'm expecting them to run 3D games from 2000 at very playable frame rates. The biggest problem will likely be the small 12MB VRAM of the Voodoo 1 at that point. |
@PoloniumRain, Voodoo benchmarks on Zen4/Raptor Lake will be very interesting, yes! Please post those. It seems at one point even if you emulate Voodoo 5 6000 (via multiple host cores), the emulated CPU performance can be a bottleneck as it relies on a single host thread (thus #370). Although that depends on the workload - these benchmarks show that Games will benefit with CPU emulation speed matching Pentium II 450, so host requirements shouldn't be outrageous. |
I didn't get either of those CPU's in the end, but i'll definitely get Zen 5/9950X (or whatever it will be called) whenever the X3D version is released. I'll benchmark that :) |
DOSBox SVN patch for PPro and its X implementation (there are further additions to bring that to PII/III level). |
What do you mean by "software rendering" vs "with acceleration" - the renderer selected in Quake2? What was the CPU utilization on the host while doing those tests? I assume one thread was at maximum in both cases, while in the case of emulating the Voodoo ("with acceleration"?) - how many threads were utilized? For reference - Quake2 OpenGL 1024x768 (Voodoo2 SLI - 2 cards):
So, I'm also wondering if emulating 3DNow! will help, since:
|
I was running this excellent DOS port of Quake II and using the games 3dfx OpenGL renderer, so it's using Voodoo. That's what i meant by acceleration. This is just a very quick (and bad) test with 3dfx OpenGL but surprisingly a decent amount of threads are used... I have about 300 browser tabs open though... But roughly 85% of the CPU usage is from Q2. I also expected one thread to be at 100%, but instead 10 or so threads show increased usage with 6 of them increasing more than others. Task Manager is poor for monitoring this stuff though so i wouldn't pay that much attention to it. And those MMX results are interesting! But i can't wait for new CPU's to arrive like Zen 5 so i can finally stop caring about these things. I mean i'm sure these will be able to brute force almost any emulator/settings for the next several years, maybe even PCem with Voodoo 3 + the fastest supported Pentium II, which is 450MHz. P2 233MHz is the limit for my 3960X. |
DOSBox Pure does software emulation of Voodoo 3dfx, always with 4 threads. This probably should be changed to have the code figure out how many cores the host CPU has and use more threads if there are more available. I'll put this on the TODO list. |
Thanks, @PoloniumRain for the further info! @schellingb, great to hear that, looking forward to see what that will deliver on Threadripper (OK, and on most regular modern CPUs, who go well above 4 threads). |
It's simple, both the Digital Foundry video and myself are only saying that it makes sense that running a game with Voodoo acceleration would produce a lower frame rate than software rendering. It's not specific to Quake II, that's just an example. It happens with 100% of games and it's because emulating a 3dfx Voodoo GPU, or literally any GPU for that matter, will require more processing power from the host machine to emulate. Just like running a PS3 emulator will be slower than running a PS2 emulator because the PS3 is a far more powerful console that requires much faster host hardware for emulation. So while a physical GPU in a 1990's PC will make games run at higher frame rates, with emulation any emulated GPU has the exact opposite effect, even in cases where a game may have identical graphics and resolution to software rendering (but Q2 isn't one of the games, it looks very different with the Voodoo. Arguably much worse aesthetically but technically better lol). But if many more threads can be used then i'd expect the frame rate gap will disappear... |
OK, so basically game developer can make more efficient 'direct' software rendering than the Glide emulation developer (due to lack of API overhead). Of course, if the result looks different it's another thing to consider. Let's see how many host threads are needed to bridge that gap. Bochs extends DOSbox Voodoo to Banshee and Voodoo3 with further improvements, LGPL BIOS, etc. and also there is one attempt to add Voodoo3 VGA to DOSbox-X. |
…or triangle rendering based on the number of cores available on the host CPU (#300)
I made the voodoo multi threaded renderer now scale the number of threads based on the number of cores available on the host CPU. I'm not sure this is always an improvement. A very simple test on my 8-core android phone showed a very slight decrease in performance, but it might just be a bad test. Hopefully it gives more significant improvements on other types of hardware. Would be interesting to know how it fares on a threadripper CPU :-) |
…or triangle rendering based on the number of cores available on the host CPU (#300)
Sorry I missed a file in the commit, I added it now. It should compile now. |
The improved multi-threading has some interesting results, especially with one specific thing... QUAKE II (640x480)
Q2's frame rate is roughly 1 - 2 fps lower with the new multi-threading renderer unless transparency is being displayed. The more transparency on the screen, the higher the frame rate. When there's a ton of transparency (like looking through a window that's right in your face) the frame rate can increase by up to ~80%. Overall i'd say it's better performance, but barely, because quite often there's either none or a low amount of transparency on the screen in this game. So in those cases the frame is the same or slightly lower. But... when there's a lot of transparency like this: ...Huge improvement in these cases with drawn frames (and the Q2 frame rate counter in the bottom right also confirms this). Screamer 2 (640x480)
Screamer Rally (640x480)
Both Screamer games are very consistent with this change. Both games typically have slightly higher frame rates with the improved multi-threaded rendering. Carmageddon (640x480)
Carmageddon is the opposite of the Screamer games, where this time i get slightly lower frame rates. Fatal Racing/Whiplash Practically identical results of the 3 tracks tested. Tomb Raider Game is locked at 30fps. It's always hitting 30fps so it can't be tested and i only get 4% CPU utilisation because of this. Grand Theft Auto, GTA: London 1961 and GTA: London 1969 Again these games are annoyingly locked at a low frame rate (~24fps for the GTA games when using 3dfx). So they also can't be tested (3.5% CPU utilisation btw lol). Thread usage is definitely higher than before, but it's still extremely low at around 8 - 9% total CPU usage for RetroArch (previously around 5 - 6.5%): Previously it looked like this: And now with the improved multi-threading: I don't know if anything else can be improved with multi-threading but if it could reach a point where all games perform either the same or better, then you could just have it enabled by default. Otherwise a toggle option might be best so it can be set for each game depending on how that game performs :) |
Wow amazing tests and data, thank you so much. It's... ok I guess? Not quite the boost we were hoping for but I'm not sure how we could tweak things from here. By the way the implementation works like this:
The reason for the "1 less thread" is because the main thread participates in the calculations in addition to the spawned threads. So that should make it use all the cores (up to 16) during rendering. Thanks again for your time and detailed testing! |
Here ya go... QUAKE II (640x480)
Screamer 2 (640x480)
The Screamer 2 results also apply to Screamer Rally, it's basically the same game with a new skin. Edit: More tests... Shadow Warrior (640x480)
Shadow Warrior is the one game where 4 threads is slightly faster than 7. But games created with the Build Engine look poor with 3dfx/GPU acceleration, because all it does is blur everything and limit you to a low resolution. It looks and runs much better in software renderer mode. So these results are a bit pointless. Descent II (640x480)
Unlike other games, Decent II scales best with 12 or 15 threads, but the game only outputs 60fps and seems to internally render far higher frame rates according to the in-game FPS counter. 4 threads easily has the worst performance. Carmageddon (640x480)
Battle Arena Toshinden (640x480) Tomb Raider (640x480) So overall i'd say 7 threads are best. 7 is always faster than the previous 4 threads in any games i try apart from Shadow Warrior. It also has the best average performance in the most demanding games like Q2! I also tried 42 threads just for a laugh... and err performance was shitty to say the least. Totally unplayable frame rates with lots of sound crackling. Frame rates usually start to tank when going over 15 threads. 10, 12 and 15 threads are the best if there's a LOT of transparency on the screen, but this many threads isn't realistically helpful, because 99% of time there wont be anywhere near that much transparency in a scene. To give these thread counts an advantage over 7 threads in QUAKE II i needed to go right up to a transparent window so that it took up ~70% or more of the entire screen (as shown with "Very high" in the transparency column on the chart). In all other instances 7 threads is faster. |
Thank you again for these exhaustive test results! I ended up switching to 7, it seems a good middle ground. I also thought of maybe doing something like a percentage (use 50% of cores, use 75% of cores) but it seems to heavily depend on the hardware. On my Android phone which has 8 cores it seems using half of the cores for voodoo rendering (3 threads + main thread) gives the best performance, but using all cores results in a slight (8%) performance decrease. On an older 4 core Intel desktop CPU on the other hand using even one less of the cores results in a more significant 15% performance loss. Making it configurable is still an option, but it would be really nice if the core could just be as smart as possible. |
I've edited my above post and added even more tests... just to make absolutely sure, but long story short - 7 is still the best.
Does it have Intel Hyperthreading? So 4 cores with 8 threads. I suspect that 7 threads might be the best for x86 CPU's if the CPU has at least 8 threads, or maybe at least 12. With Android/mobile SoC's they typically have 1 to 4 performance cores plus much slower energy saving cores. Maybe that's why 3 threads is the best there. The AMD Ryzen 9950X was announce today, so when that's available in July (if it's not just a paper launch) i'll be able to test that. Not just for thread performance but also to see just what that thing can do :D
For x86 CPU's maybe have it use a maximum of 7 threads, even with CPU's with tons of threads being as the performance often decreases when going over 7, but if a CPU supports less than 8 threads then maybe revert back to 4 threads like it was using previously. I think more testing is needed with different CPU's but 7 is probably fine for now. I'll try to test my ancient Intel 2 core/4 thread Windows laptop in the next few days. |
Just tested my crusty old laptop... Specs: Descent II (640x480) Screamer 2 (640x480) QUAKE II (640x480) So there's no issues with 7 threads on these old Intel 2 core CPU's. |
Amazing, the more data the better. Thanks again! Keep in mind, if you're just changing unsigned cores = cpu_features_get_core_amount();
v->tworker.triangle_threads = ((voodoo_pci_sstdevice.perf & 1) ? (cores <= (MAX_TRIANGLE_THREADS+1) ? (UINT8)(cores - 1) : MAX_TRIANGLE_THREADS) : 0); So on a system where unsigned cores = cpu_features_get_core_amount() * 2; so it thinks there are twice as many cores available as there actually are, but I've only ever seen negative effects in doing so (using more threads than there are physical host CPU cores). |
It actually compiled ok but frame rates were almost cut in half, so it performed the worst so far.
Decreased performance again. It had very similar results to using 2 threads in my last post... Oh well, it was worth a try! |
Amazing data! Can the number of threads be made a user/conf option? |
features it would be nice to add pae(Physical Address Extension), cmov(Conditional Move) and cx8(CMPXCHG8B) instructions for better compatibility.
keep up the good work.
The text was updated successfully, but these errors were encountered: