-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize the benchmarks I use across multiple boards #2
Comments
Just to quickly clarify: What are numbers worth if they're too low since too much background activitiy or throttling happened or CPU frequency capping (RPi with input voltage dropping below ~4.65V) or even CPU cores having been shut down due to overheating? Which conclusion can be drawn about efficiency of a certain SoC or ARM/RISC-V core when we blindly trust into clockspeeds reported via sysfs when in reality they differ: with some Amlogic SoCs we still have a mismatch of -40% comparing 'advertised' clockspeed vs. real and
Same goes for Raspberries with inappropriate powering and here it's even worse: the main OS (ThreadX on the VC) silently drops ARM clockspeeds to 600 MHz while the Linux kernel running on the ARM core still happily reports to run at 1200/1500/1800 MHz. IMO stuff like this needs to be closely monitored so you're able to throw away results that are invalid (and since benchmarking server/storage systems is part of my day job I can confirm that throwing away results instead of relying on them is the 'norm' or what we mostly do since something went wrong we didn't know upfront so we need to adjust setup/methodology and repeat). In that sense As for the few individual benchmarks chosen two are more load generators / stressors (cpuminer, stockfish) and there is actually just one that really generates numbers to compare and that's But while 7-zip's focus is pretty limited it has one huge advantage: it generates scores that are persistent accross OS and compiler releases (at least true for the 16.02 version that's shipped with Debian/Ubuntu since years). Most other benchmarks generate numbers that will change once you or your OS's maintainers update OS or compiler version. Sometimes it goes into the opposite direction and with more recent GCC the benchmark scores drop (cpuminer for example on Debian 11). As such that's the huge advantage of those 7-zip scores even if they represent performance metrics the average SBC user can not make lot use of: they represent 'hardware performance' way more precise in situations where software changed significantly and as such it makes sense to compare scores made in different years (not true for the majority of other passive benchmarking tools where numbers fluctuate over time 'by design'). This (ability to compare/validate results made in different years) is also the reason why It's important to know why numbers differ and it's also important to realise when numbers made in different years are still valid or not (and that's something ignored in this SBC world almost everywhere - the basic rules of passive benchmarking simply do not apply since it's a true hobbyist's world) |
You're exactly right and it's actually worse than you're saying. The results aren't invalidated by years. They can be invalidated by days from how fast these apt packages can change. They can be invalidated on the same day if they're running slightly different operating systems or firmware versions. Having a lot of tests certainly helps get closer to the truth though. Pi Benchmarks is kind of crowd sourcing that for the storage aspect to the entire community. On Pi Benchmarks we can actually go back and figure out when they changed gcc and the benchmarks started dropping for example. That's because we have tens of thousands of tests to work with (probably a dozen or so per day on average for years across various boards). That means it's not really a bug that these scores can change over time when we're crowdsourcing the benchmark. It's a feature. Now we can go back and look really carefully at the dates/scores/system configurations and figure out what changed and why. You're right though that if you are taking individual benchmarks in a bubble and you are trying to compare them with one another that is a totally invalid/useless comparison. Once we have hundreds or thousands of benchmarks though this turns from a weakness into a strength. Now we have enough data to answer some of those more important questions: what changed, when and why. You're right about everything you're saying though. Crowdsourcing is a way to turn this from a weakness when comparing two individual results into a strength where you're comparing a result against hundreds or thousands of results across all different times / configurations. |
My main goal with this repository is to document my benchmarking results for the times I test these boards for a particular purpose (comparison, a review, a new project)—and almost every time I do a comparison, I will re-benchmark using the latest OS/updates since that does make a huge difference a lot of times. But the secondary purpose is to put up all my numbers (and any surprising notes, like the RK3588S on the Orange Pi being noticeably faster than the RK3588 on the Rock 5 B, both using the vendor Debian images), so others who benchmark or just have an interest in the boards can either confirm or deny the results, or give any feedback if desired (e.g. if some of my methodology is lacking... which it is, sometimes!). So I always welcome any and all feedback! |
Beautiful system, that's a great way to investigate the root causes of some of these issues and differences as well. I will sometimes do head-to-head testing between different boards and I always do fresh tests every time. Like even though I've ran my same CPU benchmarks on the RPI 4 a dozen times I don't just reuse the old numbers each time. I want to make sure they have all the apt updates and firmware updates available as they would be if a user bought the board and set it up for the first time. I think it's a great methodology. These are all great tools for us to help diagnose performance related issues. Different tools are appropriate for different jobs as well. I love having lots of tools available and that you shared / linked / documented a bunch of different ways to do it! |
Simple reason: Rockchip's Dynamic Memory Interface (DMC) is 'at fault'. Xunlong hasn't discovered this feature/mechanism yet (so DMC is inactive in their kernel config just as it was with Radxa last July). Radxa in the meantime uses a newer/different kernel branch where DMC is enabled, the governor defaults to This default results in a drop in idle consumption of about 600mW since DRAM is clocked almost all the time now at just 528 MHz instead of 2112 MHz as before! But the rather conservative Once you adjust this to SoCs like RK3588/RK3588s are that complex that we have to deal with trade-offs at various locations. Though that's not entirely new since DMC and other such governors are there since 2015/RK3399. The DMC governor's impact on RK3588 benchmarks was the main reason why And the results output lists all information about the DMC governor available (the below is from an OPi 5 running with Armbian. The Armbian guys use Radxa's kernel branch for the OPi 5 so DMC is available/adjustable but of course they don't give a sh*t about settings like
The DRAM clockspeed transition table is also listed when available:
Once you set I had the impression within the last months that Rockchip seems to put all the 'strong' silicon into RK3588 while the 'weak' silicon ends up in RK3588s SoCs where the A76 then will clock significantly lower than 2400 MHz. That's at least what I thought to observe after checking a bunch of But on the aforementioned OPi 5 the PVTM values show a rather 'strong' silicon variant (allowing the A76 cores to clock at ~2315 MHz):
While this OPi 5 shows weaker PVTM values (it's not only silicon variation but also temperature and Xunlong never thinking about appropriate cooling adds to the problem):
I hope this painfully lenghty explanation backs my point of collecting all 'environmental data' known and available while benchmarking since only those provide the answer for 'Why do the numbers look different?!' |
@TheRemote please correct me if I'm wrong. Nevermind, I read through |
Why would Orange Pi not implementing the DMC controller driver into the kernel make it faster than the Rock 5B? Why does this technology exist then? Shouldn't we want to just disable it for the Rock 5B if it makes the device perform poorly? Or are you saying the default setting is just so bad that it's actually worse than not having it at all? You don't think the disabling of the Specter mitigations / unprivileged eBPF is even a factor after that write-up? Does sbc-bench test for this? You should start testing for this. That's known to have a huge performance impact as I'm sure you're well aware because I am well aware of it and you are way more in the weeds with this than I am. People still do that all the time on Linux/Windows to get a big performance gain at the cost of security. Why not test both? The best thing for Jeff to do would be to update his kernel and retest so all of the security mitigations match. Next it would probably be to change upthreshold to 25 and retest. I suspect the answer is going to be it's both are impacting it. I'll also predict the security settings have much larger of an impact as they are well known to do. |
When DMC is not enabled in kernel config DRAM is clocked at 2112 MHz by defauilt (or whatever highest value is defined in the boot BLOBs, with recent RK3328 BLOBs it's 1056 MHz and with RK3399 it depends on type of RAM: DDR for whatever reasons being clocked slightly higher than LPDDR). Once DMC is enabled the same could be achieved by setting the DMC governor to But the default is As a result of my testing I came up first with As for Spectre mitigations... isn't that stuff that requires at least 5.17 or something? RK3588 running with mainline kernel isn't an option today since most of the HW IP blocks aren't supported yet. And with an insecure 5.10 BSP kernel a few vulnerabilities more or less shouldn't matter anyway... |
Thanks for the explanation on DMC. I can absolutely see why this could happen from your explanation. It does indeed sound like a matter of they implemented it so badly it was actually worse than nothing! So for the specter mitigations this is actually user controllable to an extent. I tried playing with this a little bit. The specific setting that doesn't seem to match between the two (until you update the kernels) is:
This will not exist on the original Rock 5B kernel. It's too old. It will exist on the Orange Pi 5. You can actually toggle this until a reboot with:
Before:
After:
There is actually also a "2" setting in newer kernels but this one is too old to have it. It wouldn't surprise me if that is the 5.17 version that is popping into your head. In these old kernels it's either off or on. In newer kernels there's a 2 which lets certain trusted system processes use it but nothing else. When I used to CPU mine years ago for Monero we would always disable the CPU security mitigations. You could get a good 15-20% boost out of it sometimes for certain types of CPU tasks since it enables a bunch of predictive branch code that while is very fast has known security risks. It's interesting that this is a user-controllable one until reboot that doesn't match. Theoretically you could test disabling this on the Orange Pi 5 to see if just toggling this specific setting would do it (or have some kind of impact). If people want to "cheat" on the benchmarks this is definitely what it would look like. Some of these security mitigations can have dramatic impacts. I'm not sure very many benchmarks check for it. It's possible to disable these in Windows as well like here's a special patch that shuts a bunch of these CPU-expensive protections off for Windows. There's a lot more settings related to this within Linux as well. I should mention that I don't think this one single Specter related setting will give us 15-20% and that's not how big of a gap we have to begin with to close. If you disabled all of them though you could easily hit that. There's a whole bunch of settings. I've definitely done those before on mining rigs to get that performance back when mining used to make sense. Simply booting with 'mitigations=off' on the kernel cmdline will shut off most of them and give you a huge advantage on the benchmarks as a whole bunch of unsafe tech will be enabled that is shut off for a very good reason. In other words they weren't trying to cheat. This is just pure incompetence. Someone trying to cheat would disable all of them. It's just like you said earlier. It's like they don't even know the technology is in there and they're slowly discovering it and figuring it out throughout these various kernel updates (probably as people open issues and complain). My guess is if I fired one of these up with mitigations=off I'd likely curb stomp the top scores on Geekbench. I don't see any outliers that are outlying enough to have tried this. The effects of disabling this are very real though and have been for a long time. People still do this on all operating systems (inadvisable usually but maybe they're airgapped machines and the people know what they're doing). It's an option that's available to professionals and I'd think that's exactly where you'd use something like this legitimately. Some kind of airgapped system of computers where you know you don't care that they're more vulnerable. It shows just how challenging benchmarking really is. No matter who I hear talking about it or how qualified they are the reality is always worse. The only reason you don't see a lot more corrupted/cheated benchmarks is because there's not really a financial incentive to do so (and where there is in places like extreme overclocking they do get caught cheating sometimes). The normal benchmarks have basically zero protection against any of this though. They aren't even trying and it's not clear to me they even know it's possible. I'm not as close as you are to believing it's actually possible to trust these benchmarks. I think to really do this you would have to be the operating system itself. sbc-bench would need to be a bootable image but that would be unrealistic to be able to support that for all the different boards. Right now though you have to trust that the OS, the user and the manufacturer of the board are not playing you and none of those 3 are to be trusted. With a bootable image you'd eliminate 2 of those 3 (assuming the user can't do anything inside the OS) and only have the board manufacturer to deal with and with the OS in your control you would be able to beat them at this game (at least for a few years until they evolved different techniques to cheat as they always do). I don't have the answer on how to solve this though at all. Pi Benchmarks has a trust element and it could completely break down if people started to try to cheat on it and I would have to fight it. I'm definitely not saying my system is better. I think I'm saying we're all a long way from where we'd like to be on benchmarking accuracy! It's also a cat and mouse game with the manufacturers. The tricks they used on us 10 years don't work anymore but they have a whole new set of much more complicated tricks. Even if we catch up it's only temporary. They have a lot more resources and a lot more incentives to keep cutting their own costs and saving money even if it means misleading or dodging the benchmarking methods the community is using at any given time. I predict we're both 100% right about this issue. It's both. I don't know what the proportions will be. It sounds like the clock speeds are very far off here. I will take back my guarantee that the security mitigations will be larger of a difference than the DMC issue. I'm legitimately not sure which one will be bigger now that you've explained the DMC issue! It's very possible the security mitigation might only be a few percentage points and the DMC issue / clock speeds may end up explaining twice as much of the difference. That's entirely possible from what you're saying / showing here! Apparently BHB can have up to a 20% impact on ARM CPUs. This is a hot topic. Linux 6.1 is getting (or I should say received at this point) a switch to disable it. It definitely matters that this setting is being messed with across these images though. Big time it looks like. They literally state that the security mitigations of this can specifically impact benchmarks by 20% (also specifically high performance ARM CPUs). |
I checked out sbc-bench and I actually really like it. I'll try shutting off the security mitigations and testing with your sbc-bench tool a little bit later here. It would actually integrate really well with PiBenchmarks (at least some of the tests you've done here). Like some of your tests could just be popped right into PiBenchmarks and submitted as an additional field with curl. I think Pi Benchmarks shares a lot of your same philisophy. Like it gathers a ton of data on each machine. Lshw, inxi, udev, everything to capture all of this. The script works almost exactly the same way as in it's a bash script that takes care of everything including the dependencies for it. The only difference is you seem to do more processing of brands and stuff like that in the script and I do more of that in the backend. The backend isn't released but there's nothing really any different than what you're doing in the bash script happening. It's just parsing a whole bunch of brands / models / manufacturers. If people really wanted to see it I could embarrass myself and release the crappy self-made PHP code that parses it. I guess the valuable code probably is the processing code but it's just a big list basically. Even then if people want to see it I would release it because in reality the valuable part is maintaining it. People upload new ones every day. The site wouldn't work if I let all of those turn into broken images and placeholders like most benchmarking sites. This was exactly how I was planning to expand Pi Benchmarks eventually into other type of tests but when I did it I wanted to do it the right way. I wanted to do it the way you're doing here. Crazy amounts of information so that we can go through and see these differences especially the weird ones. If you aren't opposed to it I might pick your brain at some point to see about properly integrating some of this kind of testing into Pi Benchmarks (or I might just look at how you did it and credit you and your repository/work for sure). You're definitely doing some good work here though in an area I don't really cover right now! |
I decided to implement a
I guess the help text will read then:
More information and an early implementation follows soon! |
@ThomasKaiser - Haha I'm flattered :D Maybe in the future a Red Shirt Jeff mode could be added which attempts to turn off all thermal safeguards and gets the SoC to cook itself! |
Here we go. This commit contains early functionality:
Makes use of colors in Terminal and is hopefully Github issues friendly: And will soon be extended. I want to implement a mode where all of this info is quickly gathered, then all governors are switched to |
With latest commit the observation of other benchmarks that were fired up works now. As can be seen after
|
To elaborate on why the collected info matters and why adjusting all governors is important... let's start with the latter first. On today's CPUs (even MCUs) there's conflicting goals to meet: max performance and min consumption and obviously you don't get both at the same time. There's stuff like DVFS (dynamic voltage frequency scaling --> ramping up chip's supply voltages only when needed) or 'race to idle' (e.g. sending everything storage to deep sleep to just wake the storage components every now and then to quickly write stuff and then send them to deep sleep again) and many more. This stuff needs to be configured and SoC vendors ship with settings that meet their customer's criteria to provide a high performing device with also an excellent battery life (low overall consumption). These settings obviously are designed around Android use cases since Linux doesn't exist as a market for the SoCs that can be found on today's SBC (the exceptions are TI, Renesas, NXP, maybe RPi/Broadcom and a few others). But what works well for a TV box or a tablet where someone watches movies, surfs the Internet or plays a game from time to time is almost never a good fit for a general purpose Linux system. That's where settings matter: are those SoC vendor's defaults made for Android use cases a proper fit for benchmarking? Especially if the goal is 'benchmarking the hardware' and not 'benchmarking the SoC vendor's compromises to achieve low power consumption with Android'. Unfortunately this is what you're really benchmarking if you stay with the defaults. The following is all Geekbench 5.5 (a software that IMO has many flaws) running on the same Rock 5B with the same software stack:
This is plain defaults vs. all governors set to With 'latest and greatest' mainline kernel it might get even worse since at the SoC vendor hardware and software experts worked together while with mainline it's pure luck which individual out there upstreams which portion of these adjustable mechanisms like DMC. BTW: those governors at defaults do not only affect CPU benchmarks but storage and even network as well, see again this Rock 5B example and there the Phoronix comparison at the bottom (the SQLite test is almost pure random I/O). The proposed @TheRemote maybe I missed it but I haven't found any governor settings when looking through the info collected by your script. Default governors do not affect storage benchmarks that much on x86 and with Raspberry Pi OS but on other SBCs based on the kernel version it can make a difference like night and day. |
@ThomasKaiser inxi captures all of this information (and the information I talked to in my other posts which was fixed in the kernel by Radxa):
inxi really is very good. I have a lot of other tests that show this as well that aren't shown by default but they don't need to be since inxi encapsulates all of it and that entire report is shown. The issues I was concerned about have already been confirmed and patched into the kernel by Radxa in another one of Jeff's threads (his Rock 5B issues thread). Not sure about your specific issue you found related to the clock speeds. I don't know if they know about that one and I haven't confirmed that one so I don't really know much about it and didn't push to get that fixed. |
@TheRemote unfortunately it does not. It just displays the And my point was and still is that it's not only about reporting governors but adjusting them prior to benchmarking. At least if the goal is benchmarking the hardware and not just SoC vendor settings made for Android use cases. As explained in my last comment: this here is two times Geekbench on an otherwise identical system. The only difference is DMC settings (even cpufreq governors were already set to Memory clock is important and ignoring this just introduces a random number generator into 'measurements'. That's why when adding the Stockfish load generator to sbc-bench a while ago I took my time to check influence of DRAM clock on RK3588 on a variety of benchmarks (benchmarking the benchmark): ThomasKaiser/sbc-bench#55 (comment) And wrt storage benchmarks... as mentioned in the last comment already: not setting at least all cpufreq governors to |
@geerlingguy I think after latest commit the Basic idea (for now – later Ansible integration is a different topic) is to execute a benchmark or a set of benchmarks on a freshly booted SBC, save the results, then run
Now the aforementioned benchmarks are executed again (this time generating higher/real scores) while sbc-bench shows potential background and swapping activity in real time (the The amount of collected info now as follows (see also the link with full info):
|
I added support for NetIO switchable powermeters to sbc-bench a while ago. In case you have something from here https://www.gridconnect.com/collections/netio (the 'PowerCable 101S' is already sufficient) then you could already enjoy precise power monitoring. Those NetIO thingies measure power in two different ways at the same time ('actual' as well as cumulative consumption) and with fluctuating loads like e.g. Geekbench with its way too short execution times only cumulative mode is able to provide meaningful numbers. Some more info to this mode here but I don't know whether it makes more sense... |
@ThomasKaiser Oh, well that was only one test that captures it. I also do lscpu and cat /proc/cpuinfo:
Then there is also cat /proc/cpuinfo:
Those are just off the top of my head. Some of my other tests likely gather this information. I don't see any changes that need to be made to pibenchmarks.com whatsoever here. We have all of this data if it's needed but I'm a lot more interested in fixing the boards and working with the manufacturers. I don't know if handing them that mountain of data will really get anything done. I mean the stuff I'm worried about is already fixed. Good luck with this one. Maybe we're just trying to do different things here. You're really, really in the weeds and pibenchmarks.com is more about making things more clear. There's nothing clear to me about what you're doing here. Still. It seems you need to talk to the engineers about this issue. Have you done that yet? What is the point of your investigations if it isn't to resolve these issues? I don't think pibenchmarks is trying to do the same thing as you (and I don't think I want to). You've basically got nowhere here. The entire time you've been doing this (which I disabled notifications but you keep directly replying so I keep getting them) I was doing my thing and getting these fixed. That's why I do this. Why do you? I'm trying to tell you that this is already a solved issue for me via the benchmarking and that is why I do this. If none of these changes are going to help you get this issue fixed via Radxa then yes, we are working toward very different goals. If you are trying to make some kind of engineering tool go for it. PiBenchmarks is a tool for everyone to make comparing these things more easy. I'd suggest to you that many tools for this type of work exist already. If the benchmark isn't for ordinary people or ordinary people can't follow what data you're trying to convey or get fixed or show there's some kind of issue I'd suggest that this is not the way to go about it. From my perspective if you can't use this tool to make your case to the Radxa engineers (and even regular users) that there is an issue then it has very little value. Even if you have good data being gathered here it doesn't seem to be condensed/refined enough for people to draw meaningful conclusions from it. I don't feel like I'm speculating here. Here we have two issues that supposedly impact performance on Radxa boards. Mine is fixed. Yours is... nowhere I believe. You are still trying to argue with me! You haven't even got to trying to fix this yet best I can tell and mine has been at least a week now. This board is a closed case for me and just another notch in my belt. I've moved on to other challenges. I really suggest wrapping this one up and moving on to your next challenge as well. It seems like you described the problem weeks ago and that you're stuck. Maybe the tool does need some further refinement though if it's not making the case that this issue exists and then getting it fixed. It's probably all explained by me not having the same goals as you or understanding what you're trying to do. If your goal is to fix this then everything I've said here makes sense. If your goal is purely technical (which I'm getting that impression) then there's not really much for us to talk about. My goals are more practical. I've already explained that the more technical you get with the benchmarks the less sense they make. You start to get to problems like every apt patch will invalidate the benchmark. Every firmware update will invalidate the benchmark. These are released all the time. You haven't gone all the way down this road in your mind yet it seems to me. You haven't look at where the road you're going down leads to and you've lost sight of the fact that the benchmark is supposed to give data that is meaningful and useful to the user running the benchmark. I'm capable of following this issue and all of the information but to be honest with you when I look at it I'm just like, goodness, if we're going to use engineer tools let's use engineer tools or write a tool that can break this down in a more meaningful/digestable way. I'd imagine it's the same for the Radxa kernel developers. You need to make it more digestable. I did not find your explanations very convincing at all initially because it's just a wall of data. I'd still be interested to find out what they say about it if you ever do try to get this actually fixed. I kind of want to just ask one of them "does this guy know what he's doing"? The answer almost doesn't matter though if you aren't able to use that information to convince other people and eventually get this fixed. Thanks for the suggestions though! |
So let's stop here. What I'm currently doing is preparing an
Nothing of this has any special relationship with Radxa, Rockchip or the Rock 5B. The Rock 5B is just here to illustrate why 'sane settings' matter since I went through all the details months ago. And the point I tried to make (obviously to no avail if you think nothing needs to be changed with pibenchmarks.com) is that setting all performance related governors always to This example (editing out https://openbenchmarking.org/result/2208036-NE-2208032NE57
Cpufreq governors matter even with storage benchmarks and on platforms with dynamic DRAM clocking these other settings of course matter as well. They are the reason for benchmark score differences of up to 200%, same with real-world tasks. That's the whole point: adjust settings first since otherwise benchmark numbers are partially meaningless as long as the claim is 'benchmarking hardware'. BTW: if your point of posting irrelevant |
For the time being, I've tacked on some phoronix tests (which are easily accessible to anyone who's tried PTS before, and don't take too long to complete, and added on a Still haven't settled on some simple benchmarks for GPU. |
Just a few notes/thoughts on the infamous Geekbench especially wrt RISC-V. These are Geekbench 6 results for Sophon SG2042 (64 C910 cores): https://browser.geekbench.com/search?utf8=✓&q=rv64imafdcv Single-core results are all pretty much the same but multi-core the four results at the bottom look even more weird than the rest since being between 521 and 760 (or in other words: 3.5 to 5 times the single-core score). What do these insanely low scores at the bottom all have in common? Made with Geekbench 6.2.0 and 6.2.1 while the 'better' scores are all made with 6.2.2 so obviously the Geekbench guy slowly improves his RISC-V coding skills. And even these 'better' multi-core scores are a bad joke since the 'best' is at 1458 or only 10 times the single-core score. This is still a 64-core SoC and with other benchmarks the single-core to multi-core ratio is for example 1:25 (Geekbench 5) or even 1:37 (7-zip) |
Some of the more standard benchmarks I've run include:
I've looked at some other tests, too, like @ThomasKaiser's sbc-bench scripts.
I'd like to have a nice standard suite of benchmarks on a basic level for the main things—like CPU, GPU (maybe), RAM, Disk, and Network). Then it would be nice to have a deeper set if I'm trying to get a better understanding of edge cases.
The text was updated successfully, but these errors were encountered: