-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable icount mode for failure-prone platforms #22904
Conversation
All checks are passing now. Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages. |
Can you make this change to qemu_cortex_r5 as well, see another of hangs there as well. |
Nice. Did this suddenly start working on recent qemu versions? When struggling with this ~1.5 years back, it was routine that some tests would get into a state where interrupts would stop arriving. |
@@ -2,7 +2,6 @@ CONFIG_ZTEST=y | |||
CONFIG_PRINTK=y | |||
CONFIG_LOG=y | |||
CONFIG_POLL=y | |||
CONFIG_QEMU_TICKLESS_WORKAROUND=y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this remove this workaround for all tests that use it? If not, it should, and we should remove the whole feature. It only exists because of timer nondeterminism.
If not, and some things still seem to need it, it implies we have some more work to do somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyross In this PR, I remove CONFIG_QEMU_TICKLESS_WORKAROUND in all tests cases, but just three Qemu platforms enabled icount so far, so maybe other Qemu platforms still need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then if it needs to stay we probably want to force it off at the kconfig level somehow so tests can say "give me the qemu tick workaround if the platform needs it" but the platform layer somewhere can say "this configuration is running under qemu with -icount, don't hack the timer driver".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change the patch, thanks
Thanks for sending this, I will be running some tests today. If this causes tests to take considerably more time to execute, we might want to consider tagging the time-sensitive tests, and only running those tests with icount, everything else with the current scheme. |
maybe there are some fix for that, started this work from last summer didn't find that kind of issue, but at the beginning I tried to enable every icount shift value(0-10) for qemu_x86 and mps2_an385, but most of that still have timing issue for Zephyr test cases, so have basic qemu code review(icount part) and gdb debug, maybe icount shift value should be selectd based on cpu clock frequency, both virtual cpu and real cpu(e.g. mps2_an385) should have most closet frequency. |
not every case will take more time to execute, but if Zephyr test case need wait some time interval(e.g. spin_for_ms), the execution time(wall time) will increase because of Qemu icount mode, tcg mode and some extra code execution. |
will try |
rdtsc is just a CPU timestamp counter which places its result in EDX:EAX. https://c9x.me/x86/html/file_module_x86_id_278.html It seems the problem here is that for whatever reason we're not getting any entropy. I'm taking a closer look today. |
Oooh, that reminds me: in fact one of the problems I was seeing when mucking with this long ago was that the timing sources were wrong in icount mode: the ratio between the HPET timer (used by the timer driver and k_cycle_get_32()) and the TSC was different depending on whether the app was running in icount or host timing mode. So apps that touch the TSC on their own are likely to see different behavior. In this particular case it would just be a different random seed, but it's something to check anyway. I never did much to investigate; this wasn't as big a deal as the missing interrupts. |
We need a better way to get random numbers for QEMU tests. Here's an assortment of "random" values produced by rand32_timestamp.c:
There is a pretty coarse granularity to TSC values reported, note that the low 4 bits are always zero and the lack of odd values in bits 5...8. This gets compounded by thread.c's adjust_stack_size() by using a modulo operation (in this case, 64). The computed fuzz values tend to be uniform. There's actually a bug open on this #6493. In other words, the "random" values reported tend to have the low 5 bits zeroed, and we're modulo-ing the result by 64...the results aren't good. I'm wondering if it would be better for testing scenarios to use a xoroshiro generator that gets seeded with some predetermined value, at least we'd get a better distribution of values. Or fix the linked issue and do something besides modulo operations. However, as a workaround, we can make this pass reliably if we increase CONFIG_STACK_POINTER_RANDOM=256 instead of the current 64 in prj.conf, we have a lot more bits to work with, it works reliably for me in that case. |
With this trivial patch things are working reliably for me with qemu_x86:
However, the build for qemu_x86_64 fails on nine tests. The result is thankfully reproducible, but the root cause(s) not yet clear:
|
If I switch to rand32_timer.c instead of rand32_timestamp.c I get what intuitively looks like a much better distribution:
I don't see why rand32_timestamp.c even exists. It's basically a carbon-copy of rand32_timer.c anyway. It works much worse. I'm going to simply delete it. #22951 |
I found some issues in tests/kernel/interrupt which I hopefully resolved here: #22958 The qemu_x86_64 build with this PR passes if I set CONFIG_MP_NUM_CPUS=1, so whatever's causing the failures is at least tangentially SMP-related. I'm continuing to investigate. One data point: enabling icount disables multi-threaded TCG in QEMU. So instead of separate host threads to emulate each CPU, it does it all in one thread running CPUs round-robin style. |
Thanks for the analysis and fix! |
091340e
to
4a8aff3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As a future enhancement, I'd like the "shift" value to also be settable in Kconfig.
Thanks, sure, will do. |
Remove the existing qemu icount configuration because icount mode will be controlled by Kconfig QEMU_ICOUNT so that none suitable cases(especially networking cases) can exclude icount configuration Signed-off-by: Wentong Wu <wentong.wu@intel.com>
This CPU-bound test on qemu_riscv32 platform is very slow when QEMU icount mode enabled, taking upwards of several minutes. There's little value here, this is a unit test of library code and we have coverage of the RISC-V 32 bit arch via hifive1. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Qemu icount mode enabled, remove QEMU_TICKLESS_WORKAROUND. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
In Qemu icount mode, busy wait will cause lots of wall time and it's very easy to get sanitycheck timeout(this case will be successful if given enough timeout value for sanitycheck), so reduce test interval to save execution time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Sometimes QEMU can't handle SIGTERM signal correctly and just run as normal, in that case kill -9 QEMU process directly and leave sanitycheck judge the testing result by console output. For failures caused by other reasons, terminate all releated processes to avoid CI waits 2 hours and reports timeout. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Add cpu time for QEMUHandler because the guest virtual time in QEMU icount mode isn't host time and it's maintained by counting guest instructions, we use QEMU process CPU time to mostly simulate the time of guest OS. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_x86 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_cortex_m0 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_cortex_m3 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_cortex_a53 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_riscv32 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_riscv64 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for hifive1 platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_xtensa platform, The icount shift value is selectd based on cpu clock frequency of this platform. The virtual cpu will execute one instruction every 2^shift ns of virtual time. Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Could it be that |
@nordic-krch There is actually an issue for that: #26227. |
changes:
value should be selectd based on cpu clock frequency of every platform.
The virtual cpu will execute one instruction every 2^shift ns of virtual
time.
very easy to get sanitycheck timeout(this case will be successful if
given enough timeout value for sanitycheck), so reduce test interval
to save execution time.
have done:
need address:
Fixes #14173