Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: kernel tests hardfault on nucleo_l073rz #37119

Closed
ABOSTM opened this issue Jul 21, 2021 · 12 comments · Fixed by #38681 or #38703
Closed

tests: kernel tests hardfault on nucleo_l073rz #37119

ABOSTM opened this issue Jul 21, 2021 · 12 comments · Fixed by #38681 or #38703
Assignees
Labels
area: ARM ARM (32-bit) Architecture area: Kernel bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: medium Medium impact/importance bug

Comments

@ABOSTM
Copy link
Collaborator

ABOSTM commented Jul 21, 2021

Describe the bug
Hardfault occurs on nucleo_l073rz, while executing some tests on automatic test bench.
Mainly kernel tests, but not exclusively.
HardFault is reproducible easily under some circumstances (see analysis below).
List of faulty tests

tests/arch/arm/arm_runtime_nmi/arch.interrupt.arm.nmi, nucleo_l073rz
tests/benchmarks/latency_measure/benchmark.kernel.latency.stm32, nucleo_l073rz
tests/kernel/context/kernel.common, nucleo_l073rz
tests/kernel/fifo/fifo_api/kernel.fifo, nucleo_l073rz
tests/kernel/fifo/fifo_timeout/kernel.fifo.timeout, nucleo_l073rz
tests/kernel/interrupt/arch.interrupt, nucleo_l073rz
tests/kernel/lifo/lifo_usage/kernel.lifo.usage, nucleo_l073rz
tests/kernel/mbox/mbox_api/kernel.mailbox.api, nucleo_l073rz
tests/kernel/mem_heap/k_heap_api/kernel.k_heap_api, nucleo_l073rz
tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser, nucleo_l073rz
tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace, nucleo_l073rz
tests/kernel/mem_slab/mslab_api/kernel.memory_slabs.api, nucleo_l073rz
tests/kernel/mem_slab/mslab_concept/kernel.memory_slabs.concept, nucleo_l073rz
tests/kernel/msgq/msgq_api/kernel.message_queue, nucleo_l073rz
tests/kernel/msgq/msgq_usage/kernel.message_queue_usage, nucleo_l073rz
tests/kernel/mutex/mutex_api/kernel.mutex, nucleo_l073rz
tests/kernel/mutex/sys_mutex/system.mutex.nouser, nucleo_l073rz
tests/kernel/pending/kernel.objects, nucleo_l073rz
tests/kernel/pipe/pipe/kernel.pipe, nucleo_l073rz
tests/kernel/pipe/pipe_api/kernel.pipe.api, nucleo_l073rz
tests/kernel/profiling/profiling_api/kernel.common.profiling, nucleo_l073rz
tests/kernel/sched/deadline/kernel.scheduler.deadline, nucleo_l073rz
tests/kernel/sleep/kernel.common.timing, nucleo_l073rz
tests/kernel/threads/dynamic_thread/kernel.threads.dynamic, nucleo_l073rz
tests/kernel/threads/thread_init/kernel.threads.init, nucleo_l073rz
tests/kernel/threads/tls/kernel.threads.tls, nucleo_l073rz
tests/kernel/threads/tls/kernel.threads.tls.userspace, nucleo_l073rz
tests/kernel/timer/timer_monotonic/kernel.timer.monotonic, nucleo_l073rz
tests/kernel/workq/work/kernel.work.api, nucleo_l073rz
tests/kernel/workq/work_queue/kernel.workqueue, nucleo_l073rz
tests/lib/time/libraries.libc.time, nucleo_l073rz
tests/subsys/logging/log_core_additional/logging.add.async, nucleo_l073rz
tests/subsys/pm/power_mgmt/subsys.pm.device_pm, nucleo_l073rz

To Reproduce
Steps to reproduce the behavior:

  1. Enable debug and PowerManagement in prj.conf: CONFIG_DEBUG=y and CONFIG_PM=y (to force the bug systematically)
  2. west build -p auto -b nucleo_l073rz tests/kernel/mutex/mutex_api/
  3. west flash
  4. See error

Logs and console output

*** Booting Zephyr OS build zephyr-v2.6.0-1131-ge287639479d7  ***
Running test suite mutex_api
===================================================================
START - test_mutex_lock_unlock
 PASS - test_mutex_lock_unlock in 0.2 seconds
===================================================================
START - test_mutex_reent_lock_forever
access resource from main thread
access resource from main thread
E: ***** HARD FAULT *****
E: r0/a1:  0x2000077c  r1/a2:  0x00000000  r2/a3:  0x00000000
E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x08007575
E:  xpsr:  0x21000000
E: Faulting instruction address (r15/pc): 0x08003c32
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x200003e0 (idle 00)
E: Halting system

Environment (please complete the following information):

  • OS: Linux and Windows
  • Toolchain Zephyr SDK
  • Commit SHA 3501148
@ABOSTM ABOSTM added bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 labels Jul 21, 2021
@ABOSTM
Copy link
Collaborator Author

ABOSTM commented Jul 21, 2021

Analysis
After some (difficult) analysis, I came to the conclusion that this Hardfault comes with a conjunction of circumstances:
ALL the following conditions need to be true to reproduce this issue:

  1. at least one bit of the DBGMCU_CR is set (DBG_STANDBY, DBG_STOP or DBG_SLEEP).
    This could happens when flashing with OpenOCD or when enabling both CONFIG_DEBUG and CONFIG_PM (in stm32_power_init(), call to LL_DBGMCU_EnableDBGStopMode() )
    It is also possible to directly write those bitfields in soc init for test purpose. Note: Power Management is not a requirement to reproduce this issue.
    Those bits prevent to disable HCLK and FCLK when MCU is going to Standby, Stop or sleep. This is useful to use a debugger while using lowpower.
    When those bits are forces to 0, the problem vanished.

  2. Following single commit should be merged: "kernel/idle: Replace stolen IRQ lock"
    sha1 39a8f3b4f957ed0e50c848414891ad5fab4500bb (from PR #32848)
    Thanks to git bisect, I found that this issue appears after merge of this commit.
    When reverting this commit on main branch, problem vanished.

  3. CONFIG_ZTEST=y
    I am not sure this is absolutely necessary, but it has direct or indirect impact:
    If I test sample/basic/blinky, I don't reproduce the issue,
    but if I transform this blinky test with CONFIG_ZTEST=y (with thread, stack, test, ...) then I reproduce the HardFault

Hardfault analysis:
Unwinding Hardfault call stack, I found that Program Counter pc=0x08003c32 (same address also provided by console log) is not aligned on an instruction, but in the middle of an instruction ... causing the HardFault. But I could not found why this pc is no aligned (corrputed stack, )
It is to be noticed that, this is always the middle of the same instruction whatever the test executed.

  	/* Enter low power state */
  	wfi
   8003c2c:	bf30      	wfi

  	/*
  	 * Clear PRIMASK and flush instruction buffer to immediately service
  	 * the wake-up interrupt.
  	 */
  	cpsie	i
   8003c2e:	b662      	cpsie	i
  	isb
   8003c30:	f3bf 8f6f 	isb	sy

  	bx	lr
   8003c34:	4770      	bx	lr
   8003c36:	46c0      	nop			; (mov r8, r8)

It is asm function "arch_cpu_idle" (arch/arm/core/aarch32/cpu_idel.s)
Note that pc address is very close to "cpsie i" instruction which will enable Interrupts.

I also found that adding asm instruction: cpsid i
just after this comment (despite it is said not necessary), problem vanished.

	/*
	 * For all the other ARM architectures that do not implement BASEPRI,
	 * PRIMASK is used as the interrupt locking mechanism, and it is not
	 * necessary to set PRIMASK here, as PRIMASK would have already been
	 * set by the caller as part of interrupt locking if necessary
	 * (i.e. if the caller sets _kernel.idle).
	 */
	 cpsid	i

Note when debugging step by step, I could not reproduce the Hardfault, so it is very difficult to get to the root cause of the issue. (maybe due to something linked to interrupt enabling ??)

@andyross,
do you have any idea how this issue could be linked to your commit 39a8f3b ?
Your commit add an a instruction "cpsie i" (ARMV6-M), which clear PRIMASK and enable interrupts
so I currently found 2 workaround, one is to remove cpsid (revert your commit), the other is to add cpsid (disabling interrupt). Thus either we are not disabling interrupt (revert your patch), or we are disabling interrupts, both are around disabling/enabling interrupts.
Would that be possible that disabling/enabling interrupts should come by pair, but there is a path in which this is not respected, causing hardfault ?

@ioannisg,
your Cortex M expertise is welcome too.

@tagunil
Copy link
Collaborator

tagunil commented Jul 21, 2021

All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M, but I can't see where exactly 39a8f3b enables interrupts. Is that the right commit?

@tagunil
Copy link
Collaborator

tagunil commented Jul 21, 2021

Looks like there is no arch_irq_lock() in some place where it should be present.

@ABOSTM
Copy link
Collaborator Author

ABOSTM commented Jul 22, 2021

@tagunil,

I can't see where exactly 39a8f3b enables interrupts. Is that the right commit?.

My bad, arch_irq_lock will disable interrupts (set primask). I updated my description

All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M,

Yes #22078 looks very similar, thanks for point this. The fix #23511
lead to the comment I already mentioned for ARMV6 arch:

	/*
	 * For all the other ARM architectures that do not implement BASEPRI,
	 * PRIMASK is used as the interrupt locking mechanism, and it is not
	 * necessary to set PRIMASK here, as PRIMASK would have already been
	 * set by the caller as part of interrupt locking if necessary
	 * (i.e. if the caller sets _kernel.idle).
	 */

which I don't understand (I don't have enough zephyr kernel knowledge)

@erwango erwango added area: ARM ARM (32-bit) Architecture priority: medium Medium impact/importance bug labels Jul 22, 2021
@erwango
Copy link
Member

erwango commented Jul 22, 2021

@ioannisg I've set the issue to medium, don't hesitate to raise to high if requested

@tagunil
Copy link
Collaborator

tagunil commented Jul 22, 2021

@ABOSTM What I can't understand is why your bisection points to the commit that disables interrupts, while your experiment shows that disabling interrupts by adding "cpsid i" helps.

@tagunil
Copy link
Collaborator

tagunil commented Jul 22, 2021

Also it could be related with idle API fragility discussed in #24255.

@ABOSTM
Copy link
Collaborator Author

ABOSTM commented Jul 22, 2021

^^ @stephanosio

@ABOSTM ABOSTM closed this as completed Jul 22, 2021
@ABOSTM ABOSTM reopened this Jul 22, 2021
@ABOSTM
Copy link
Collaborator Author

ABOSTM commented Jul 22, 2021

closed by mistake

@erwango erwango assigned ABOSTM and unassigned erwango Jul 28, 2021
@erwango
Copy link
Member

erwango commented Jul 28, 2021

@ioannisg, @andyross would you have time answering questions in this comment #37119 (comment) ?

@erwango erwango assigned andyross and unassigned ABOSTM Aug 23, 2021
@FRASTM
Copy link
Collaborator

FRASTM commented Aug 26, 2021

Since commit e0bed3b, a similar hardfault occurs when testing the stm32g071rb nucleo board with "test suite timer_api" :

*** Booting Zephyr OS build zephyr-v2.6.0-2072-ge0bed3b989ef  ***           
Running test suite timer_api                                                
===================================================================         
START - test_time_conversions                                               
 PASS - test_time_conversions in 0.189 seconds                              
===================================================================         
START - test_timer_duration_period                                          
E: ***** HARD FAULT ***** 

This hardfault is definitely linked to the USERSPACE and that PR "Cortex-R MPU support" #28231 applied on a cortex M0+ with MPU devices like stm32g071 or stm32l073
especially the first commit " arch: arm: cortex_r: Add MPU and USERSPACE support "
When CONFIG_TEST_USERSPACE=n the testcase tests/kernel/timer/timer_api can run to its end.

@ABOSTM
Copy link
Collaborator Author

ABOSTM commented Sep 9, 2021

@FRASTM, Hardfault on stm32g071rb nucleo board is not link to the current issue (see issue 38421)

ABOSTM added a commit to ABOSTM/zephyr that referenced this issue Sep 20, 2021
Enabling DBGMCU bits Sleep/Stop/Standby on STM32L0
causes Hardfault. See zephyrproject-rtos#37119
As a workaround, force those bits to 0

Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>
ABOSTM added a commit to ABOSTM/zephyr that referenced this issue Sep 21, 2021
On STM32L0, there are some hardfault when DBGMCU bit Sleep, Stop
or Standby are enabled. See zephyrproject-rtos#37119
For unclear reason, enabling DMA clock fixes this issue.
(similarly than zephyrproject-rtos#38561, DMA clock comes with DBGMCU bits)

Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>
cfriedt pushed a commit that referenced this issue Sep 21, 2021
On STM32L0, there are some hardfault when DBGMCU bit Sleep, Stop
or Standby are enabled. See #37119
For unclear reason, enabling DMA clock fixes this issue.
(similarly than #38561, DMA clock comes with DBGMCU bits)

Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>
Rushybrook pushed a commit to Rushybrook/zephyr that referenced this issue Oct 21, 2021
On STM32L0, there are some hardfault when DBGMCU bit Sleep, Stop
or Standby are enabled. See zephyrproject-rtos#37119
For unclear reason, enabling DMA clock fixes this issue.
(similarly than zephyrproject-rtos#38561, DMA clock comes with DBGMCU bits)

Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM ARM (32-bit) Architecture area: Kernel bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: medium Medium impact/importance bug
Projects
None yet
5 participants