Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable icount mode for failure-prone platforms #22904

Merged
merged 14 commits into from
May 14, 2020

Conversation

wentongwu
Copy link
Contributor

@wentongwu wentongwu commented Feb 18, 2020

changes:

  1. add cpu time in sanity-check for qemu thread
  2. Enable icount mode for mps2_an385, mps2_an521 and qemu_x86, icount shift
    value should be selectd based on cpu clock frequency of every platform.
    The virtual cpu will execute one instruction every 2^shift ns of virtual
    time.
  3. remove CONFIG_QEMU_TICKLESS_WORKAROUND for some timer test cases.
  4. In Qemu icount mode, busy wait will cause lots of wall time and it's
    very easy to get sanitycheck timeout(this case will be successful if
    given enough timeout value for sanitycheck), so reduce test interval
    to save execution time.

have done:

  • already test this PR on Unbuntu18.04 with sanity-check running at the same time on 10 different folders, some of which apply the patch and others don't, and see a lot of failed cases in the not enable icount folders, see very very little timeout cases on icount enabled folders(but I'm sure they will success if given large timeout value in sanity-check script).

need address:

  • tests/kernel/mem_protect/stack_random/kernel.memory_protection.stack_random always fail on qemu_x86 platform. I take a look the file rand32_timestamp.c used by x86 platform, not familiar with rdtsc instruction(asm volatile("rdtsc" : "=a" (rv) : : "%edx");), maybe I need take a look x86 doc and review qemu code to see how qemu handle this on x86 platform, appreciate any help.
  • 7 test cases fail on qemu_x86_64 platform with latest commit.
  • find very little timeout cases when icount enabled, that's maybe because on very heavy machine the cpu time is not accurate, maybe we can give large timeout value in sanity-check.
  • tests/kernel/context/kernel.common always fail on qemu_cortex_r5 platform.
  • tests/kernel/mem_slab/mslab_threadsafe/kernel.memory_slabs.threadsafe always timeout on qemu_cortex_m0
  • only platform qemu_nios2 doesn't work well for icount mode.
  • will test this PR with same method on Fedora.

Fixes #14173

@zephyrbot zephyrbot added area: Boards area: Tests Issues related to a particular existing or missing test area: Kernel area: Sanitycheck Sanitycheck has been renamed to Twister labels Feb 18, 2020
@zephyrbot
Copy link
Collaborator

zephyrbot commented Feb 18, 2020

All checks are passing now.

Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages.

@galak
Copy link
Collaborator

galak commented Feb 18, 2020

Can you make this change to qemu_cortex_r5 as well, see another of hangs there as well.

@andyross
Copy link
Contributor

Nice. Did this suddenly start working on recent qemu versions? When struggling with this ~1.5 years back, it was routine that some tests would get into a state where interrupts would stop arriving.

@@ -2,7 +2,6 @@ CONFIG_ZTEST=y
CONFIG_PRINTK=y
CONFIG_LOG=y
CONFIG_POLL=y
CONFIG_QEMU_TICKLESS_WORKAROUND=y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this remove this workaround for all tests that use it? If not, it should, and we should remove the whole feature. It only exists because of timer nondeterminism.

If not, and some things still seem to need it, it implies we have some more work to do somewhere.

Copy link
Contributor Author

@wentongwu wentongwu Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross In this PR, I remove CONFIG_QEMU_TICKLESS_WORKAROUND in all tests cases, but just three Qemu platforms enabled icount so far, so maybe other Qemu platforms still need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then if it needs to stay we probably want to force it off at the kconfig level somehow so tests can say "give me the qemu tick workaround if the platform needs it" but the platform layer somewhere can say "this configuration is running under qemu with -icount, don't hack the timer driver".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change the patch, thanks

@andrewboie
Copy link
Contributor

Thanks for sending this, I will be running some tests today.

If this causes tests to take considerably more time to execute, we might want to consider tagging the time-sensitive tests, and only running those tests with icount, everything else with the current scheme.

@wentongwu
Copy link
Contributor Author

wentongwu commented Feb 19, 2020

Nice. Did this suddenly start working on recent qemu versions? When struggling with this ~1.5 years back, it was routine that some tests would get into a state where interrupts would stop arriving.

maybe there are some fix for that, started this work from last summer didn't find that kind of issue, but at the beginning I tried to enable every icount shift value(0-10) for qemu_x86 and mps2_an385, but most of that still have timing issue for Zephyr test cases, so have basic qemu code review(icount part) and gdb debug, maybe icount shift value should be selectd based on cpu clock frequency, both virtual cpu and real cpu(e.g. mps2_an385) should have most closet frequency.

@wentongwu
Copy link
Contributor Author

wentongwu commented Feb 19, 2020

Thanks for sending this, I will be running some tests today.

If this causes tests to take considerably more time to execute, we might want to consider tagging the time-sensitive tests, and only running those tests with icount, everything else with the current scheme.

not every case will take more time to execute, but if Zephyr test case need wait some time interval(e.g. spin_for_ms), the execution time(wall time) will increase because of Qemu icount mode, tcg mode and some extra code execution.
And how is the test going on your side? @andrewboie

@wentongwu
Copy link
Contributor Author

Can you make this change to qemu_cortex_r5 as well, see another of hangs there as well.

will try

@andrewboie
Copy link
Contributor

tests/kernel/mem_protect/stack_random/kernel.memory_protection.stack_random always fail on qemu_x86 platform. I take a look the file rand32_timestamp.c used by x86 platform, not familiar with rdtsc instruction(asm volatile("rdtsc" : "=a" (rv) : : "%edx");), maybe I need take a look x86 doc and review qemu code to see how qemu handle this on x86 platform, appreciate any help.

rdtsc is just a CPU timestamp counter which places its result in EDX:EAX. https://c9x.me/x86/html/file_module_x86_id_278.html

It seems the problem here is that for whatever reason we're not getting any entropy. I'm taking a closer look today.

@andyross
Copy link
Contributor

Oooh, that reminds me: in fact one of the problems I was seeing when mucking with this long ago was that the timing sources were wrong in icount mode: the ratio between the HPET timer (used by the timer driver and k_cycle_get_32()) and the TSC was different depending on whether the app was running in icount or host timing mode. So apps that touch the TSC on their own are likely to see different behavior. In this particular case it would just be a different random seed, but it's something to check anyway.

I never did much to investigate; this wasn't as big a deal as the missing interrupts.

@andrewboie
Copy link
Contributor

andrewboie commented Feb 19, 2020

We need a better way to get random numbers for QEMU tests.

Here's an assortment of "random" values produced by rand32_timestamp.c:

rand? bfe63a0
rand? bff84e0
rand? c00a680
rand? c01c8a0
rand? c02ea80
rand? c040c60
rand? c052e80
rand? c065060
rand? c077280
rand? c089460
rand? c09b640
rand? c0ad820
rand? c0bfa00
rand? c0d1c20
rand? c0e3e00
rand? c0f6020
rand? c108240
rand? c11a420
rand? c12c5c0
rand? c13e760
rand? c150900
rand? c162b20
rand? c174cc0
rand? c186e60
rand? c199000
rand? c1ab220
rand? c1bd3c0
rand? c1cf560
rand? c1e1700
rand? c1f38e0
rand? c205a80
rand? c217c60
rand? c229e00
rand? c23bfe0
rand? c24e180
rand? c260320
rand? c272500
rand? c2846e0
rand? c296880
rand? c2a8a20
rand? c2babc0
rand? c2ccd60
rand? c2def00
rand? c2f10e0
rand? c3032c0
rand? c3154a0
rand? c327640
rand? c3397e0
rand? c34b980
rand? c35db20
rand? c36fcc0
rand? c381e60
rand? c394000
rand? c3a6220
rand? c3b83c0
rand? c3ca560
rand? c3dc700
rand? c3ee8e0
rand? c400a80
rand? c412ca0
rand? c424e40
rand? c436fe0
rand? c449180
rand? c45b320

There is a pretty coarse granularity to TSC values reported, note that the low 4 bits are always zero and the lack of odd values in bits 5...8. This gets compounded by thread.c's adjust_stack_size() by using a modulo operation (in this case, 64). The computed fuzz values tend to be uniform. There's actually a bug open on this #6493.

In other words, the "random" values reported tend to have the low 5 bits zeroed, and we're modulo-ing the result by 64...the results aren't good.

I'm wondering if it would be better for testing scenarios to use a xoroshiro generator that gets seeded with some predetermined value, at least we'd get a better distribution of values. Or fix the linked issue and do something besides modulo operations.

However, as a workaround, we can make this pass reliably if we increase CONFIG_STACK_POINTER_RANDOM=256 instead of the current 64 in prj.conf, we have a lot more bits to work with, it works reliably for me in that case.

@andrewboie
Copy link
Contributor

With this trivial patch things are working reliably for me with qemu_x86:

diff --git a/tests/kernel/mem_protect/stack_random/prj.conf b/tests/kernel/mem_protect/stack_random/prj.conf
index ff20000218..98f14d8446 100644
--- a/tests/kernel/mem_protect/stack_random/prj.conf
+++ b/tests/kernel/mem_protect/stack_random/prj.conf
@@ -1,5 +1,5 @@
 CONFIG_ZTEST=y
-CONFIG_STACK_POINTER_RANDOM=64
+CONFIG_STACK_POINTER_RANDOM=256
 CONFIG_ENTROPY_GENERATOR=y
 CONFIG_TEST_RANDOM_GENERATOR=y
 

However, the build for qemu_x86_64 fails on nine tests. The result is thankfully reproducible, but the root cause(s) not yet clear:

INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 167 test configurations selected, 578 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   42/ 167  25%  skipped:   10, failed:    0
ERROR   - qemu_x86_64               tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool.threadsafe FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool.threadsafe/handler.log
INFO    - Total complete:   43/ 167  25%  skipped:   10, failed:    1
ERROR   - qemu_x86_64               tests/kernel/mbox/mbox_usage/kernel.mailbox.usage  FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mbox/mbox_usage/kernel.mailbox.usage/handler.log
INFO    - Total complete:   48/ 167  28%  skipped:   11, failed:    2
ERROR   - qemu_x86_64               tests/kernel/interrupt/arch.interrupt              FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/interrupt/arch.interrupt/handler.log
INFO    - Total complete:   62/ 167  37%  skipped:   11, failed:    3
ERROR   - qemu_x86_64               tests/kernel/lifo/lifo_usage/kernel.lifo.usage     FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_usage/kernel.lifo.usage/handler.log
INFO    - Total complete:   80/ 167  47%  skipped:   13, failed:    4
ERROR   - qemu_x86_64               tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace FAILED: Exited with 2
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace/handler.log
INFO    - Total complete:   99/ 167  59%  skipped:   17, failed:    5
ERROR   - qemu_x86_64               tests/kernel/spinlock/kernel.multiprocessing.spinlock FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/spinlock/kernel.multiprocessing.spinlock/handler.log
INFO    - Total complete:  100/ 167  59%  skipped:   17, failed:    6
ERROR   - qemu_x86_64               samples/portability/cmsis_rtos_v1/timer_synchronization/sample.portability.cmsis_rtos_v1.timer_synchronization FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/samples/portability/cmsis_rtos_v1/timer_synchronization/sample.portability.cmsis_rtos_v1.timer_synchronization/handler.log
INFO    - Total complete:  101/ 167  60%  skipped:   17, failed:    7
ERROR   - qemu_x86_64               samples/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/samples/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
INFO    - Total complete:  110/ 167  65%  skipped:   18, failed:    8
ERROR   - qemu_x86_64               tests/kernel/smp/kernel.multiprocessing.smp        FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/smp/kernel.multiprocessing.smp/handler.log
INFO    - Total complete:  167/ 167  100%  skipped:   18, failed:    9
INFO    - 140 of 149 tests passed (93.96%), 9 failed, 18 skipped with 0 warnings in 97.41 seconds
INFO    - In total 756 test cases were executed on 1 out of total 229 platforms (0.44%)

@andrewboie
Copy link
Contributor

andrewboie commented Feb 19, 2020

If I switch to rand32_timer.c instead of rand32_timestamp.c I get what intuitively looks like a much better distribution:

rand? b2d13aa0
rand? 2a06ec6d
rand? a13c9e44
rand? 18725015
rand? 8fa801eb
rand? 6ddb3c2
rand? 7e13650f
rand? f54916e6
rand? 6c7ec8b7
rand? e3b47a87
rand? 5aea2c57
rand? d21fde28
rand? 49558ff8
rand? c08b41c9
rand? 37c0f39f
rand? aef6a576
rand? 262c5747
rand? 9d620917
rand? 1497baee
rand? 8bcd6cbe
rand? 3031e8f
rand? 7a38cfe2
rand? f16e81b3
rand? 68a43383
rand? dfd9e553
rand? 570f9724
rand? ce4548fb
rand? 457afacb
rand? bcb0ac9b
rand? 33e65e72
rand? ab1c1043
rand? 2251c219
rand? 998773ea
rand? 10bd25ba
rand? 87f2d791
rand? ff288961
rand? 765e3b32

I don't see why rand32_timestamp.c even exists. It's basically a carbon-copy of rand32_timer.c anyway. It works much worse. I'm going to simply delete it. #22951

@andrewboie
Copy link
Contributor

andrewboie commented Feb 19, 2020

I found some issues in tests/kernel/interrupt which I hopefully resolved here: #22958

The qemu_x86_64 build with this PR passes if I set CONFIG_MP_NUM_CPUS=1, so whatever's causing the failures is at least tangentially SMP-related. I'm continuing to investigate.

One data point: enabling icount disables multi-threaded TCG in QEMU. So instead of separate host threads to emulate each CPU, it does it all in one thread running CPUs round-robin style.

@wentongwu
Copy link
Contributor Author

wentongwu commented Feb 20, 2020

I don't see why rand32_timestamp.c even exists. It's basically a carbon-copy of rand32_timer.c anyway. It works much worse. I'm going to simply delete it. #22951

Thanks for the analysis and fix!

Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As a future enhancement, I'd like the "shift" value to also be settable in Kconfig.

boards/arm/qemu_cortex_r5/board.cmake Outdated Show resolved Hide resolved
@wentongwu
Copy link
Contributor Author

As a future enhancement, I'd like the "shift" value to also be settable in Kconfig.

Thanks, sure, will do.

.shippable.yml Outdated Show resolved Hide resolved
boards/Kconfig Show resolved Hide resolved
wentongwu added 14 commits May 14, 2020 07:56
Remove the existing qemu icount configuration because icount mode
will be controlled by Kconfig QEMU_ICOUNT so that none suitable
cases(especially networking cases) can exclude icount configuration

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
This CPU-bound test on qemu_riscv32 platform is very slow when
QEMU icount mode enabled, taking upwards of several minutes.
There's little value here, this is a unit test of library code
and we have coverage of the RISC-V 32 bit arch via hifive1.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Qemu icount mode enabled, remove QEMU_TICKLESS_WORKAROUND.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
In Qemu icount mode, busy wait will cause lots of wall time and it's
very easy to get sanitycheck timeout(this case will be successful if
given enough timeout value for sanitycheck), so reduce test interval
to save execution time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Sometimes QEMU can't handle SIGTERM signal correctly and just
run as normal, in that case kill -9 QEMU process directly and
leave sanitycheck judge the testing result by console output.

For failures caused by other reasons, terminate all releated
processes to avoid CI waits 2 hours and reports timeout.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Add cpu time for QEMUHandler because the guest virtual time
in QEMU icount mode isn't host time and it's maintained by
counting guest instructions, we use QEMU process CPU time to
mostly simulate the time of guest OS.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_x86 platform, The icount shift value is
selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_cortex_m0 platform, The icount shift value
is selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_cortex_m3 platform, The icount shift value
is selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_cortex_a53 platform, The icount shift
value is selectd based on cpu clock frequency of this platform.
The virtual cpu will execute one instruction every 2^shift ns of
virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_riscv32 platform, The icount shift value
is selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_riscv64 platform, The icount shift value
is selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for hifive1 platform, The icount shift value is
selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
Enable icount mode for qemu_xtensa platform, The icount shift value
is selectd based on cpu clock frequency of this platform. The virtual
cpu will execute one instruction every 2^shift ns of virtual time.

Signed-off-by: Wentong Wu <wentong.wu@intel.com>
@nashif nashif added this to the v2.3.0 milestone May 14, 2020
@carlescufi carlescufi merged commit 845abb0 into zephyrproject-rtos:master May 14, 2020
@nordic-krch
Copy link
Contributor

Could it be that qemu_arc_hs and qemu_arc_em need the same fix? I see some tests failing on those platforms due to timing issues.

@stephanosio
Copy link
Member

Could it be that qemu_arc_hs and qemu_arc_em need the same fix? I see some tests failing on those platforms due to timing issues.

@nordic-krch There is actually an issue for that: #26227.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Boards area: Build System area: Kernel area: Networking area: QEMU QEMU Emulation area: Samples Samples area: Sanitycheck Sanitycheck has been renamed to Twister area: Test Framework Issues related not to a particular test, but to the framework instead area: Tests Issues related to a particular existing or missing test area: Timer Timer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configure QEMU to run independent of the host clock
9 participants