tests: kernel tests hardfault on nucleo_l073rz #37119

ABOSTM · 2021-07-21T16:23:16Z

Describe the bug
Hardfault occurs on nucleo_l073rz, while executing some tests on automatic test bench.
Mainly kernel tests, but not exclusively.
HardFault is reproducible easily under some circumstances (see analysis below).
List of faulty tests

tests/arch/arm/arm_runtime_nmi/arch.interrupt.arm.nmi, nucleo_l073rz
tests/benchmarks/latency_measure/benchmark.kernel.latency.stm32, nucleo_l073rz
tests/kernel/context/kernel.common, nucleo_l073rz
tests/kernel/fifo/fifo_api/kernel.fifo, nucleo_l073rz
tests/kernel/fifo/fifo_timeout/kernel.fifo.timeout, nucleo_l073rz
tests/kernel/interrupt/arch.interrupt, nucleo_l073rz
tests/kernel/lifo/lifo_usage/kernel.lifo.usage, nucleo_l073rz
tests/kernel/mbox/mbox_api/kernel.mailbox.api, nucleo_l073rz
tests/kernel/mem_heap/k_heap_api/kernel.k_heap_api, nucleo_l073rz
tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser, nucleo_l073rz
tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace, nucleo_l073rz
tests/kernel/mem_slab/mslab_api/kernel.memory_slabs.api, nucleo_l073rz
tests/kernel/mem_slab/mslab_concept/kernel.memory_slabs.concept, nucleo_l073rz
tests/kernel/msgq/msgq_api/kernel.message_queue, nucleo_l073rz
tests/kernel/msgq/msgq_usage/kernel.message_queue_usage, nucleo_l073rz
tests/kernel/mutex/mutex_api/kernel.mutex, nucleo_l073rz
tests/kernel/mutex/sys_mutex/system.mutex.nouser, nucleo_l073rz
tests/kernel/pending/kernel.objects, nucleo_l073rz
tests/kernel/pipe/pipe/kernel.pipe, nucleo_l073rz
tests/kernel/pipe/pipe_api/kernel.pipe.api, nucleo_l073rz
tests/kernel/profiling/profiling_api/kernel.common.profiling, nucleo_l073rz
tests/kernel/sched/deadline/kernel.scheduler.deadline, nucleo_l073rz
tests/kernel/sleep/kernel.common.timing, nucleo_l073rz
tests/kernel/threads/dynamic_thread/kernel.threads.dynamic, nucleo_l073rz
tests/kernel/threads/thread_init/kernel.threads.init, nucleo_l073rz
tests/kernel/threads/tls/kernel.threads.tls, nucleo_l073rz
tests/kernel/threads/tls/kernel.threads.tls.userspace, nucleo_l073rz
tests/kernel/timer/timer_monotonic/kernel.timer.monotonic, nucleo_l073rz
tests/kernel/workq/work/kernel.work.api, nucleo_l073rz
tests/kernel/workq/work_queue/kernel.workqueue, nucleo_l073rz
tests/lib/time/libraries.libc.time, nucleo_l073rz
tests/subsys/logging/log_core_additional/logging.add.async, nucleo_l073rz
tests/subsys/pm/power_mgmt/subsys.pm.device_pm, nucleo_l073rz

To Reproduce
Steps to reproduce the behavior:

Enable debug and PowerManagement in prj.conf: CONFIG_DEBUG=y and CONFIG_PM=y (to force the bug systematically)
west build -p auto -b nucleo_l073rz tests/kernel/mutex/mutex_api/
west flash
See error

Logs and console output

*** Booting Zephyr OS build zephyr-v2.6.0-1131-ge287639479d7  ***
Running test suite mutex_api
===================================================================
START - test_mutex_lock_unlock
 PASS - test_mutex_lock_unlock in 0.2 seconds
===================================================================
START - test_mutex_reent_lock_forever
access resource from main thread
access resource from main thread
E: ***** HARD FAULT *****
E: r0/a1:  0x2000077c  r1/a2:  0x00000000  r2/a3:  0x00000000
E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x08007575
E:  xpsr:  0x21000000
E: Faulting instruction address (r15/pc): 0x08003c32
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x200003e0 (idle 00)
E: Halting system

Environment (please complete the following information):

OS: Linux and Windows
Toolchain Zephyr SDK
Commit SHA 3501148

The text was updated successfully, but these errors were encountered:

ABOSTM · 2021-07-21T16:51:14Z

Analysis
After some (difficult) analysis, I came to the conclusion that this Hardfault comes with a conjunction of circumstances:
ALL the following conditions need to be true to reproduce this issue:

at least one bit of the DBGMCU_CR is set (DBG_STANDBY, DBG_STOP or DBG_SLEEP).
This could happens when flashing with OpenOCD or when enabling both CONFIG_DEBUG and CONFIG_PM (in stm32_power_init(), call to LL_DBGMCU_EnableDBGStopMode() )
It is also possible to directly write those bitfields in soc init for test purpose. Note: Power Management is not a requirement to reproduce this issue.
Those bits prevent to disable HCLK and FCLK when MCU is going to Standby, Stop or sleep. This is useful to use a debugger while using lowpower.
When those bits are forces to 0, the problem vanished.
Following single commit should be merged: "kernel/idle: Replace stolen IRQ lock"
sha1 39a8f3b4f957ed0e50c848414891ad5fab4500bb (from PR #32848)
Thanks to git bisect, I found that this issue appears after merge of this commit.
When reverting this commit on main branch, problem vanished.
CONFIG_ZTEST=y
I am not sure this is absolutely necessary, but it has direct or indirect impact:
If I test sample/basic/blinky, I don't reproduce the issue,
but if I transform this blinky test with CONFIG_ZTEST=y (with thread, stack, test, ...) then I reproduce the HardFault

Hardfault analysis:
Unwinding Hardfault call stack, I found that Program Counter pc=0x08003c32 (same address also provided by console log) is not aligned on an instruction, but in the middle of an instruction ... causing the HardFault. But I could not found why this pc is no aligned (corrputed stack, )
It is to be noticed that, this is always the middle of the same instruction whatever the test executed.

  	/* Enter low power state */
  	wfi
   8003c2c:	bf30      	wfi

  	/*
  	 * Clear PRIMASK and flush instruction buffer to immediately service
  	 * the wake-up interrupt.
  	 */
  	cpsie	i
   8003c2e:	b662      	cpsie	i
  	isb
   8003c30:	f3bf 8f6f 	isb	sy

  	bx	lr
   8003c34:	4770      	bx	lr
   8003c36:	46c0      	nop			; (mov r8, r8)

It is asm function "arch_cpu_idle" (arch/arm/core/aarch32/cpu_idel.s)
Note that pc address is very close to "cpsie i" instruction which will enable Interrupts.

I also found that adding asm instruction: cpsid i
just after this comment (despite it is said not necessary), problem vanished.

	/*
	 * For all the other ARM architectures that do not implement BASEPRI,
	 * PRIMASK is used as the interrupt locking mechanism, and it is not
	 * necessary to set PRIMASK here, as PRIMASK would have already been
	 * set by the caller as part of interrupt locking if necessary
	 * (i.e. if the caller sets _kernel.idle).
	 */
	 cpsid	i

Note when debugging step by step, I could not reproduce the Hardfault, so it is very difficult to get to the root cause of the issue. (maybe due to something linked to interrupt enabling ??)

@andyross,
do you have any idea how this issue could be linked to your commit 39a8f3b ?
Your commit add an a instruction "cpsie i" (ARMV6-M), which clear PRIMASK and enable interrupts
so I currently found 2 workaround, one is to remove cpsid (revert your commit), the other is to add cpsid (disabling interrupt). Thus either we are not disabling interrupt (revert your patch), or we are disabling interrupts, both are around disabling/enabling interrupts.
Would that be possible that disabling/enabling interrupts should come by pair, but there is a path in which this is not respected, causing hardfault ?

@ioannisg,
your Cortex M expertise is welcome too.

tagunil · 2021-07-21T19:59:22Z

All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M, but I can't see where exactly 39a8f3b enables interrupts. Is that the right commit?

tagunil · 2021-07-21T20:07:51Z

Looks like there is no arch_irq_lock() in some place where it should be present.

ABOSTM · 2021-07-22T07:44:00Z

@tagunil,

I can't see where exactly 39a8f3b enables interrupts. Is that the right commit?.

My bad, arch_irq_lock will disable interrupts (set primask). I updated my description

All of that reminds me of #22078, which was fixed in #23511 for ARMv7-M,

Yes #22078 looks very similar, thanks for point this. The fix #23511
lead to the comment I already mentioned for ARMV6 arch:

	/*
	 * For all the other ARM architectures that do not implement BASEPRI,
	 * PRIMASK is used as the interrupt locking mechanism, and it is not
	 * necessary to set PRIMASK here, as PRIMASK would have already been
	 * set by the caller as part of interrupt locking if necessary
	 * (i.e. if the caller sets _kernel.idle).
	 */

which I don't understand (I don't have enough zephyr kernel knowledge)

erwango · 2021-07-22T09:23:19Z

@ioannisg I've set the issue to medium, don't hesitate to raise to high if requested

tagunil · 2021-07-22T11:10:11Z

@ABOSTM What I can't understand is why your bisection points to the commit that disables interrupts, while your experiment shows that disabling interrupts by adding "cpsid i" helps.

tagunil · 2021-07-22T12:54:37Z

Also it could be related with idle API fragility discussed in #24255.

ABOSTM · 2021-07-22T13:46:30Z

^^ @stephanosio

ABOSTM · 2021-07-22T13:52:29Z

closed by mistake

erwango · 2021-07-28T08:08:30Z

@ioannisg, @andyross would you have time answering questions in this comment #37119 (comment) ?

FRASTM · 2021-08-26T08:50:57Z

Since commit e0bed3b, a similar hardfault occurs when testing the stm32g071rb nucleo board with "test suite timer_api" :

*** Booting Zephyr OS build zephyr-v2.6.0-2072-ge0bed3b989ef  ***           
Running test suite timer_api                                                
===================================================================         
START - test_time_conversions                                               
 PASS - test_time_conversions in 0.189 seconds                              
===================================================================         
START - test_timer_duration_period                                          
E: ***** HARD FAULT *****

This hardfault is definitely linked to the USERSPACE and that PR "Cortex-R MPU support" #28231 applied on a cortex M0+ with MPU devices like stm32g071 or stm32l073
especially the first commit " arch: arm: cortex_r: Add MPU and USERSPACE support "
When CONFIG_TEST_USERSPACE=n the testcase tests/kernel/timer/timer_api can run to its end.

ABOSTM · 2021-09-09T15:17:29Z

@FRASTM, Hardfault on stm32g071rb nucleo board is not link to the current issue (see issue 38421)

Enabling DBGMCU bits Sleep/Stop/Standby on STM32L0 causes Hardfault. See zephyrproject-rtos#37119 As a workaround, force those bits to 0 Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>

On STM32L0, there are some hardfault when DBGMCU bit Sleep, Stop or Standby are enabled. See zephyrproject-rtos#37119 For unclear reason, enabling DMA clock fixes this issue. (similarly than zephyrproject-rtos#38561, DMA clock comes with DBGMCU bits) Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>

On STM32L0, there are some hardfault when DBGMCU bit Sleep, Stop or Standby are enabled. See #37119 For unclear reason, enabling DMA clock fixes this issue. (similarly than #38561, DMA clock comes with DBGMCU bits) Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>

On STM32L0, there are some hardfault when DBGMCU bit Sleep, Stop or Standby are enabled. See zephyrproject-rtos#37119 For unclear reason, enabling DMA clock fixes this issue. (similarly than zephyrproject-rtos#38561, DMA clock comes with DBGMCU bits) Signed-off-by: Alexandre Bourdiol <alexandre.bourdiol@st.com>

ABOSTM added bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 labels Jul 21, 2021

erwango added area: ARM ARM (32-bit) Architecture priority: medium Medium impact/importance bug labels Jul 22, 2021

ABOSTM closed this as completed Jul 22, 2021

ABOSTM reopened this Jul 22, 2021

cfriedt assigned erwango Jul 27, 2021

erwango assigned ABOSTM and unassigned erwango Jul 28, 2021

erwango added the area: Kernel label Jul 30, 2021

erwango assigned andyross and unassigned ABOSTM Aug 23, 2021

ABOSTM mentioned this issue Sep 20, 2021

soc: stm32l0: force DBGMCU Sleep/Stop/Standby bits to 0 #38677

Closed

ABOSTM mentioned this issue Sep 20, 2021

soc: stm32l0: enable DMA clock to fix Hardfault linked to DBGMCU bits #38681

Merged

cfriedt closed this as completed in #38681 Sep 21, 2021

ABOSTM mentioned this issue Sep 21, 2021

soc: stl32l0: Enable DMA clock instead of DBGMCU clock #38703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: kernel tests hardfault on nucleo_l073rz #37119

tests: kernel tests hardfault on nucleo_l073rz #37119

ABOSTM commented Jul 21, 2021

ABOSTM commented Jul 21, 2021 •

edited

Loading

tagunil commented Jul 21, 2021 •

edited

Loading

tagunil commented Jul 21, 2021

ABOSTM commented Jul 22, 2021 •

edited

Loading

erwango commented Jul 22, 2021

tagunil commented Jul 22, 2021

tagunil commented Jul 22, 2021 •

edited

Loading

ABOSTM commented Jul 22, 2021

ABOSTM commented Jul 22, 2021

erwango commented Jul 28, 2021

FRASTM commented Aug 26, 2021 •

edited

Loading

ABOSTM commented Sep 9, 2021

tests: kernel tests hardfault on nucleo_l073rz #37119

tests: kernel tests hardfault on nucleo_l073rz #37119

Comments

ABOSTM commented Jul 21, 2021

ABOSTM commented Jul 21, 2021 • edited Loading

tagunil commented Jul 21, 2021 • edited Loading

tagunil commented Jul 21, 2021

ABOSTM commented Jul 22, 2021 • edited Loading

erwango commented Jul 22, 2021

tagunil commented Jul 22, 2021

tagunil commented Jul 22, 2021 • edited Loading

ABOSTM commented Jul 22, 2021

ABOSTM commented Jul 22, 2021

erwango commented Jul 28, 2021

FRASTM commented Aug 26, 2021 • edited Loading

ABOSTM commented Sep 9, 2021

ABOSTM commented Jul 21, 2021 •

edited

Loading

tagunil commented Jul 21, 2021 •

edited

Loading

ABOSTM commented Jul 22, 2021 •

edited

Loading

tagunil commented Jul 22, 2021 •

edited

Loading

FRASTM commented Aug 26, 2021 •

edited

Loading