Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with CONFIG_STACK_SENTINEL #32261

Closed
StefJar opened this issue Feb 12, 2021 · 30 comments
Closed

problem with CONFIG_STACK_SENTINEL #32261

StefJar opened this issue Feb 12, 2021 · 30 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Waiting for response Waiting for author's response

Comments

@StefJar
Copy link

StefJar commented Feb 12, 2021

After pulling the latest zephyr commit my firmware breaks when a task switch/isr is done - somehow reliably

[00:00:05.859,000] <dbg> BATTERY.battery_thread: start battery sensing
[00:00:05.8[00:00:05.867,000] <err> os: r0/a1:  0x00000002  r1/a2:  0x20004a80  r2/a3:  0xf0f0f0f0
[00:00:05.875,000] <err> os: r3/a4:  0x20004c80 r12/ip:  0x00000020 r14/lr:  0x08019f73
[00:00:05.884,000] <err> os:  xpsr:  0x41000000
[00:00:05.889,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:05.900,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:05.910,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:05.920,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:05.931,000] <err> os: fpscr:  0x00000000
[00:00:05.936,000] <err> os: EXC_RETURN: 0x0
[00:00:05.941,000] <err> os: Faulting instruction address (r15/pc): 0x0801a38a
[00:00:05.949,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:05.957,000] <err> os: Current thread: 0x20001678 (batteryThread_id)

checking the map file:

.text.z_impl_k_thread_name_set
                0x000000000801a33c       0x24 kernel/libkernel.a(thread.c.obj)
                0x000000000801a33c                z_impl_k_thread_name_set
 .text.z_check_stack_sentinel
                0x000000000801a360       0x30 kernel/libkernel.a(thread.c.obj)
                0x000000000801a360                z_check_stack_sentinel
 .text.schedule_new_thread
                0x000000000801a390       0x1c kernel/libkernel.a(thread.c.obj)```

reveals that the z_check_stack_sentinel function fails.

Inside the battery thread an ADC(DMA driven) conversion is triggered.

// adc dma isr handler
ISR_DIRECT_DECLARE(DMA2_Stream0_IRQHandler) {
	ISR_DIRECT_HEADER();
	// Check transfer-complete interrupt
	if (LL_DMA_IsEnabledIT_TC(ADC1_DMA, ADC1_DMA_STREAM) && LL_DMA_IsActiveFlag_TC0(ADC1_DMA)) {
		LL_DMA_ClearFlag_TC0(ADC1_DMA);             // Clear half-transfer complete flag
		LL_ADC_Disable(ADC1); // disable adc
		// indicate that we are done
		k_sem_give(&adc1_dma.adcDone);
	}
	ISR_DIRECT_FOOTER(1);
	ISR_DIRECT_PM(); // PM done after servicing interrupt for best latency
	return 1; // We should check if scheduling decision should be made
}

A semaphore is set at the isr. The battery thread is waiting for that semaphore. Guess that mechanism breaks the sentinel.

When disabling CONFIG_STACK_SENTINEL my firmware works perfectly.

Anyone that has the same problem?

My arch STM32F412 + custom board + custom shield

@StefJar StefJar added the bug The issue is a bug, or the PR is fixing a bug label Feb 12, 2021
@erwango erwango added the platform: STM32 ST Micro STM32 label Feb 12, 2021
@erwango erwango self-assigned this Feb 12, 2021
@erwango
Copy link
Member

erwango commented Feb 12, 2021

@StefJar I'd rather move this to question since this rather looks like a question on STACK_SENTINEL behavior for now.

@nashif I let you check how can have a check on this as it rather looks like a kernel question (STM32 code extract is not upstream code)

@erwango erwango removed the platform: STM32 ST Micro STM32 label Feb 12, 2021
@erwango erwango assigned nashif and unassigned erwango Feb 12, 2021
@carlescufi
Copy link
Member

carlescufi commented Feb 12, 2021

@StefJar isn’t this just a stack overflow? Have you tried increasing stack sizes? In particular the battery_thread stack

@StefJar
Copy link
Author

StefJar commented Feb 12, 2021

@StefJar isn’t this just a stack overflow? Have you tried increasing stack sizes? In particular the battery_thread stack

That is what the error is saying. So I increased the stack for the thread from 1024B to 2048B. But the exception still comes up
Is this might related to logging (I have synchronous logging enabled)?

@StefJar
Copy link
Author

StefJar commented Feb 15, 2021

hey ho
looks like threads have different functions
I am a bit confused
there are thread samples with no parameters (https://github.com/zephyrproject-rtos/zephyr/tree/master/samples/basic/threads)
and
there is the API telling about 3 parameters (https://docs.zephyrproject.org/latest/reference/kernel/threads/index.html#c.k_thread_create)

my code works like this

#define MAIN_THREADS_STACK_SIZE (1024)
#define MAIN_THREADS_OPTIONS (K_FP_REGS )
#define THREAD_PRIO_BATTERY 6

void battery_thread(void *dummy1, void *dummy2, void *dummy3);

K_THREAD_DEFINE(batteryThread_id, MAIN_THREADS_STACK_SIZE, battery_thread, NULL, NULL, NULL, THREAD_PRIO_BATTERY, MAIN_THREADS_OPTIONS, 0);

Guess something gets messed up at the thread creation

@StefJar
Copy link
Author

StefJar commented Feb 15, 2021

Looks like this is related to ISRs and thread switching.

in this case
the main thread runs while "BC127" thread starts. The battery voltage is measured so the ADC1 streams the data via DMA2 to the memory.
After the 2nd ISR call the kernel breaks via stack overflow.

[00:00:00.186,000] <[00:00:00.188,000] <dbg> SINK.sink_init: init data sink
[00:00:00.232,000] <dbg> SINK.sink_init: data sink ringbuffer 128kByte, a sink item size 16Bytes, max sink items 8192
[00:00:00.243,000] <dbg> BATTERY.battery_init: battery init
[00:00:00.249,000] <dbg> MAIN.main: usb inserted. enable fast charging
dbg> [00:00:00.256,000] <dbg> ADC1DMA.DMA2_Stream0_IRQHandler_body: adc dma isr
BC127.[00:00:00.264,000] <dbg> ADC1DMA.DMA2_Stream0_IRQHandler_body: adc dma isr
[00:00:00.271,000] <err> os: r0/a1:  0x00000002  r1/a2:  0x00000001  r2/a3:  0xf0f0f0f0
[00:00:00.280,000] <err> os: r3/a4:  0x20006880 r12/ip:  0x00000020 r14/lr:  0x0801a3bb
[00:00:00.289,000] <err> os:  xpsr:  0x41000000
[00:00:00.294,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:00.304,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:00.315,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:00.325,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:00.335,000] <err> os: fpscr:  0x00000000
[00:00:00.341,000] <err> os: EXC_RETURN: 0x0
[00:00:00.345,000] <err> os: Faulting instruction address (r15/pc): 0x0801a812
[00:00:00.353,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:00.361,000] <err> os: Current thread: 0x20001e90 (main)

I altered the adc isr functions and the thread functions. The thread function now takes 3 arguments and the adc isr is coded like the Zephyr documentation says:

// adc dma isr handler
ISR_DIRECT_DECLARE(DMA2_Stream0_IRQHandler) {
	// Check transfer-complete interrupt
	if (LL_DMA_IsEnabledIT_TC(ADC1_DMA, ADC1_DMA_STREAM) && LL_DMA_IsActiveFlag_TC0(ADC1_DMA)) {
		LL_DMA_ClearFlag_TC0(ADC1_DMA);             // Clear half-transfer complete flag
		LL_ADC_Disable(ADC1); // disable adc
		// indicate that we are done
		k_sem_give(&adc1_dma.adcDone);
	}
	LOG_DBG("adc dma isr");
	ISR_DIRECT_PM(); // PM done after servicing interrupt for best latency
	return 1; // We should check if scheduling decision should be made
}

@StefJar
Copy link
Author

StefJar commented Feb 28, 2021

I removed all own driver related ISRs and replaced them with the current Zephyr APIs.

I am getting now a "HARD FAULT". This happens when the main thread switches to an other thread. Altering the stack size is not solving that problem.

[00:00:00.232,000] <err> os: ***** HARD FAULT *****

[00:00:00.237,000] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00002408

[00:00:00.246,000] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x080034eb

[00:00:00.254,000] <err> os:  xpsr:  0x01000000

[00:00:00.260,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0xffffffff

[00:00:00.270,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000

[00:00:00.280,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000

[00:00:00.291,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000

[00:00:00.301,000] <err> os: fpscr:  0x00000000

[00:00:00.306,000] <err> os: r4/v1:  0x00000010  r5/v2:  0x00000000  r6/v3:  0x00000000

[00:00:00.315,000] <err> os: r7/v4:  0x00000000  r8/v5:  0x00000000  r9/v6:  0x00000000

[00:00:00.324,000] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20007830

[00:00:00.332,000] <err> os: EXC_RETURN: 0xfffffffd

[00:00:00.338,000] <err> os: Faulting instruction address (r15/pc): 0x080109d6

[00:00:00.346,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0

[00:00:00.354,000] <err> os: Current thread: 0x20002688 (main)

@StefJar StefJar mentioned this issue Feb 28, 2021
@nashif nashif added the priority: medium Medium impact/importance bug label Mar 2, 2021
@StefJar
Copy link
Author

StefJar commented Mar 8, 2021

After pulling the last commit 1ce264e my system does start up directly with

[00:00:00.000,000] <err> os: r0/a1:  0x00000002  r1/a2:  0x00000000  r2/a3:  0xf0f0f0f0
[00:00:00.009,000] <err> os: r3/a4:  0x20007180 r12/ip:  0xa0000000 r14/lr:  0x0801a9af
[00:00:00.018,000] <err> os:  xpsr:  0x41000000
[00:00:00.023,000] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:00.033,000] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:00.044,000] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:00.054,000] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:00.064,000] <err> os: fpscr:  0x00000000
[00:00:00.070,000] <err> os: EXC_RETURN: 0x0
[00:00:00.075,000] <err> os: Faulting instruction address (r15/pc): 0x0801b3be
[00:00:00.082,000] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:00.090,000] <err> os: Current thread: 0x20002698 (main)

I put a break point directly at the start of the main function. When starting the debugger and reseting the MCU this breakpoint is not hit. Means that the kernel somehow starts but when changing to the main thread it fails with an stack overflow exception.

@StefJar
Copy link
Author

StefJar commented Mar 9, 2021

I set up some break points

The hard fault comes at

https://github.com/zephyrproject-rtos/zephyr/blob/master/kernel/init.c#L410

#ifdef CONFIG_MULTITHREADING
	switch_to_main_thread(prepare_multithreading());
#else

I give a screenshot before it fails
Screenshot from 2021-03-09 16-57-47

@nashif nashif assigned dcpleung and unassigned nashif Mar 9, 2021
@nashif
Copy link
Member

nashif commented Mar 9, 2021

@dcpleung can you please take a look? Not sure why it was assigned to me.

@dcpleung
Copy link
Member

dcpleung commented Mar 9, 2021

I don't have a STM32 board with me right now so I cannot debug on hardware.

Here are the stack size kconfigs after build hello_world with board stm32f412g_disco:

CONFIG_MAIN_STACK_SIZE=1024
CONFIG_IDLE_STACK_SIZE=320
CONFIG_ISR_STACK_SIZE=2048
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=1024
CONFIG_PRIVILEGED_STACK_SIZE=1024

Could you try doubling these? It's how I usually try to see if the problem is stack overflow.

Also, do you remember which commit on master where your code worked? If not, a good starting point would be the v2.5 release. It helps to narrow down the range of changes.

@StefJar
Copy link
Author

StefJar commented Mar 9, 2021

doubled the stack sizes but still getting the error

Think I found the place where the error happens. Its happen when the system work queue is init.

https://github.com/zephyrproject-rtos/zephyr/blob/master/kernel/system_work_q.c#L30

When the work queue thread is created it get started. At z_swap z_check_stack_sentinel is called and that fails.

I attach a screenshot showing the call stack

Screenshot from 2021-03-09 21-59-58

@dcpleung hope that helps

@dcpleung
Copy link
Member

I got the nucleo_f429zi board but I have not been able to reproduce it with the following apps and CONFIG_STACK_SENTINEL=y:

  • samples/hello_world
  • tests/kernel/sched/schedule_api
  • tests/kernel/threads/thread_apis
  • tests/kernel/workq/work_queue
  • tests/kernel/fatal/exception

Is this issue reproducible with any samples or tests apps?

@StefJar
Copy link
Author

StefJar commented Mar 13, 2021

didn't expected any fails with the CI tests

Maybe the combination of

CONFIG_STACK_USAGE=y
CONFIG_STACK_SENTINEL=y
CONFIG_INIT_STACKS=y
CONFIG_SCHED_MULTIQ=y
CONFIG_WAITQ_SCALABLE=y

causes the problems.

Just in case I am quickly attaching my board defconfig file

# arch
CONFIG_SOC_STM32F412ZG=y
CONFIG_SOC_SERIES_STM32F4X=y
CONFIG_BOARD_KRHD1_V1_4A=y

# fp support
CONFIG_FPU=y
CONFIG_FPU_SHARING=y

# MMU
CONFIG_ARM_MPU=y
# MMU for threads
CONFIG_MPU_STACK_GUARD=y
# thread stack protection done by hw
CONFIG_HW_STACK_PROTECTION=y
CONFIG_MPU_ALLOW_FLASH_WRITE=y

# interrupts
CONFIG_DYNAMIC_INTERRUPTS=y
CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_SW_VECTOR_RELAY=y
CONFIG_CORTEX_M_DWT=y
# CONFIG_MULTI_LEVEL_INTERRUPTS=y
CONFIG_EXTRA_EXCEPTION_INFO=y

# reset stm32f412 regs at startup
CONFIG_INIT_ARCH_HW_AT_BOOT=y

# code offset defined inside the dts
# CONFIG_USE_DT_CODE_PARTITION=y

# kernel
CONFIG_STACK_USAGE=y
CONFIG_STACK_SENTINEL=y
CONFIG_INIT_STACKS=y
CONFIG_BOOT_BANNER=n
CONFIG_SCHED_MULTIQ=y
CONFIG_WAITQ_SCALABLE=y
# CONFIG_KERNEL_LOG_LEVEL_DBG=y

# enable pinmux
CONFIG_PINMUX=y

# DMA
CONFIG_DMA=y

# uart
CONFIG_UART_STM32=y
CONFIG_UART_INTERRUPT_DRIVEN=y
CONFIG_UART_DRV_CMD=y
CONFIG_UART_ASYNC_API=y
CONFIG_UART_LINE_CTRL=y

# rtc
CONFIG_COUNTER=y
CONFIG_COUNTER_RTC_STM32_CLOCK_LSE=y
CONFIG_COUNTER_RTC_STM32_BACKUP_DOMAIN_RESET=n

# pwm
CONFIG_PWM=y
CONFIG_PWM_STM32=y

# GPIO
CONFIG_GPIO=y

# I2C
CONFIG_I2C=y
CONFIG_I2C_STM32=y
CONFIG_I2C_1=y
CONFIG_I2C_STM32_INTERRUPT=y

# SPI
CONFIG_SPI=y
#CONFIG_SPI_1=y
#CONFIG_SPI_1_OP_MODES=1
#CONFIG_SPI_3=y
#CONFIG_SPI_3_OP_MODES=1
CONFIG_SPI_STM32_INTERRUPT=y

# adc
# use own driver
CONFIG_ADC=y

# watchdog
# CONFIG_WATCHDOG=y
# CONFIG_WDT_DISABLE_AT_BOOT=y
# CONFIG_WWDG_STM32=y
# CONFIG_IWDG_STM32_TIMEOUT=1000

# usb
CONFIG_USB=y
CONFIG_USB_DC_STM32=y
CONFIG_USB_DEVICE_STACK=y
CONFIG_USB_DEVICE_VID=0x2FE4
CONFIG_USB_DEVICE_PID=0x1
CONFIG_USB_DEVICE_MANUFACTURER="Kokoon Technology Limited"
CONFIG_USB_DEVICE_PRODUCT="Kokoon Relax2"
CONFIG_USB_DEVICE_SN="KR-HD1-00001"
CONFIG_USB_MAX_POWER=250
CONFIG_USB_CDC_ACM=y
# no vbus signaling - we us it as input
# CONFIG_USB_VBUS_GPIO=y
# CONFIG_USB_VBUS_GPIO_DEV_NAME="GPIOA"
# CONFIG_USB_VBUS_GPIO_PIN_NUM=9
CONFIG_USB_CDC_ACM_LOG_LEVEL_ERR=y
CONFIG_USB_DEVICE_LOG_LEVEL_ERR=y

# led
CONFIG_LED=y

# power saving
# CONFIG_SYS_POWER_MANAGEMENT=y
# save power when idleing
# CONFIG_TICKLESS_IDLE=y
# sleep states
# CONFIG_SYS_POWER_SLEEP_STATES=y
# CONFIG_SYS_POWER_DEEP_SLEEP_STATES=y
# device sleep policy 
# CONFIG_DEVICE_POWER_MANAGEMENT=y

#flash
CONFIG_FLASH=y
CONFIG_SOC_FLASH_STM32=y
# CONFIG_SPI_NOR=y
# CONFIG_SPI_NOR_PAGE_SIZE=256
# CONFIG_SPI_NOR_SECTOR_SIZE=4096

# clock configuration
CONFIG_CLOCK_CONTROL=y
CONFIG_CLOCK_CONTROL_LOG_LEVEL_DBG=y

# Clock configuration for Cube Clock control driver
CONFIG_CLOCK_STM32_HSE_CLOCK=24000000
CONFIG_CLOCK_STM32_SYSCLK_SRC_PLL=y

# use HSE as PLL input
CONFIG_CLOCK_STM32_PLL_SRC_HSE=y

# the board have an external oscillator
CONFIG_CLOCK_STM32_HSE_BYPASS=n

# produce 96MHz clock at PLL output
CONFIG_CLOCK_STM32_PLL_M_DIVISOR=24
CONFIG_CLOCK_STM32_PLL_N_MULTIPLIER=144
CONFIG_CLOCK_STM32_PLL_P_DIVISOR=6
# 48MHz for usb 
CONFIG_CLOCK_STM32_PLL_Q_DIVISOR=3

CONFIG_CLOCK_STM32_AHB_PRESCALER=1
CONFIG_CLOCK_STM32_APB1_PRESCALER=2
CONFIG_CLOCK_STM32_APB2_PRESCALER=1

# take care that we enable coretex clk as sysclk 
CONFIG_CORTEX_M_SYSTICK=y
CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=24000000

# serial
CONFIG_SERIAL=y

# console
CONFIG_CONSOLE=y
CONFIG_UART_CONSOLE=y
CONFIG_STDOUT_CONSOLE=y
# CONFIG_UART_CONSOLE_DEBUG_SERVER_HOOKS=y

# log
CONFIG_LOG=y
CONFIG_LOG_BACKEND_SHOW_COLOR=n
# CONFIG_LOG_PROCESS_THREAD=y
# CONFIG_LOG_INPLACE_PROCESS=y
CONFIG_LOG_MODE_IMMEDIATE=y

# CONFIG_LOG_BUFFER_SIZE=4096
# CONFIG_LOG_PROCESS_TRIGGER_THRESHOLD=1

CONFIG_MAIN_STACK_SIZE=2048
# CONFIG_MAIN_STACK_SIZE=4096


CONFIG_RING_BUFFER=y

# enable system reboot functions
CONFIG_REBOOT=y

# CONFIG_REQUIRES_FULL_LIBC=y
CONFIG_NEWLIB_LIBC=y

@dcpleung
Copy link
Member

Building with west build -p -b nucleo_f429zi tests/kernel/threads/thread_apis/ -- -DCONFIG_INIT_STACKS=y -DCONFIG_INIT_ARCH_HW_AT_BOOT=y -DCONFIG_STACK_SENTINEL=y -DCONFIG_STACK_USAGE=y -DCONFIG_FPU=y -DCONFIG_FPU_SHARING=y -DCONFIG_SCHED_MULTIQ=y -DCONFIG_WAITQ_SCALABLE=y and it's not crashing on my board.

One thing I can think of is that, there are enabled interrupts on your board and they are firing before the switch to main thread is completed, thus corrupting the interrupt stack. However, without a way to reproduce it using the available apps in the tree, it is impossible to figure out what is wrong.

@StefJar
Copy link
Author

StefJar commented Mar 22, 2021

Thx @dcpleung for testing.
I am currently on parental leave. I will try to recheck at the beginning of April.

@dcpleung dcpleung added the Waiting for response Waiting for author's response label Mar 24, 2021
@galak
Copy link
Collaborator

galak commented Apr 13, 2021

@StefJar any updates on this.

@StefJar
Copy link
Author

StefJar commented Apr 13, 2021

not yet
I just started to rebuild my project from the scratch to figure out what config crashes. Means I take the 2.5.xx API + rebuild my board dts + proj conf + minimal main.c
From my debugging so far, I can say that the stack sentinel kicks in before the main thread is executed. So I guess it must be some driver ?ISR? that breaks the stack at kernel init.
I keep you posted

@StefJar
Copy link
Author

StefJar commented Apr 21, 2021

update:
I rebuild my project set up with the latest zephyr drivers. Didn't managed to break that.

BUT I managed to get that stack overflow when using a constant string at a LOG_xxx without log_strdup.
Rechecked my code for forgotten log_strdup but didn't found some.

My intension is now shifting towards the logger. In my project I have an own logging back end.
Looks like the logging backend has changed 4 months ago (f1bb20f)
Therefore was no "process" function. So I might need to init that function.
I will check and report back

@StefJar
Copy link
Author

StefJar commented Apr 21, 2021

Looks like it's not logger related. I can now trigger a reliable stack overflow by just setting
CONFIG_INIT_STACKS=y
Looks like the initial stack set up gets corrupted.
My current test set up is custom board (STM32F412) + dts + simple main that logs something. The stack overflow is detected by the sentinel before the main function is entered.

@galak
Copy link
Collaborator

galak commented May 20, 2021

@dcpleung any updates on this?

@dcpleung
Copy link
Member

Tried again but still cannot reproduce the issue:

  • ./scripts/twister -p nucleo_f429zi -T tests/subsys/logging/ --device-testing --device-serial /dev/ttyACM0 -x=CONFIG_INIT_STACKS=y -x=CONFIG_STACK_SENTINEL=y
  • ./scripts/twister -p nucleo_f429zi -T tests/kernel/ --device-testing --device-serial /dev/ttyACM0 -x=CONFIG_INIT_STACKS=y -x=CONFIG_STACK_SENTINEL=y

@nashif nashif added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels May 25, 2021
@nashif
Copy link
Member

nashif commented May 25, 2021

lowering priority given that it can't be reproduced

@StefJar
Copy link
Author

StefJar commented May 27, 2021

I like to give a short update on that issue
I done some work on my project. So I am currently using for everything the Zephyr drivers. So no own ISRs etc. If I am doing a hard reset VDD off -> on, my firmware runs 2x through a reset because of the stack sentinel. After it goes smooth. As well as a reset from my programmer usually does not triggers this error.
I have problems resolving that message.

[00:00:00.000,266] <err> os: r0/a1:  0x00000002  r1/a2:  0x00000001  r2/a3:  0xf0f0f0f0
[00:00:00.008,789] <err> os: r3/a4:  0x2000e700 r12/ip:  0x00000003 r14/lr:  0x0802229f
[00:00:00.017,454] <err> os:  xpsr:  0x41000000
[00:00:00.022,658] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:00.033,058] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:00.043,457] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:00.053,858] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:00.064,254] <err> os: fpscr:  0x00000000
[00:00:00.069,454] <err> os: Faulting instruction address (r15/pc): 0x08021b54
[00:00:00.077,404] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:00.085,272] <err> os: Current thread: 0x200035f8 (unknown)

This happens before my main function is called. Any idea how I can more informations on who is crashing the stack sentinel?

@dcpleung
Copy link
Member

zephyr.lst can usually point you to the instruction that's causing the fault with the program counter address (in your case, (r15/pc): 0x08021b54.

If the toolchain emits symbol for a particular thread, you can find it through zephyr.map to see which thread is causing issue. (in your case, Current thread: 0x200035f8).

@ioannisg
Copy link
Member

In theory, stack sentinel should not be used together with HW_STACK_PROTECTION on Cortex-M. It's not needed. I've not checked, though, if these can co-exist smoothly.

@ioannisg
Copy link
Member

Yeap, the line of code could help here.

@ioannisg
Copy link
Member

Btw, I could not reproduce this either, using the same twister call as @dcpleung on nRF52840.
Running this, I noticed that sentinel cannot work with single-threading so I opened #35764

I noticed @StefJar that you're using FPU - that might bring additional stack requirements. I wonder if what you see is an expected stack overflow.

@StefJar
Copy link
Author

StefJar commented Jun 2, 2021

@ioannisg thx for testing and explaining.

My exception are pointing to the stack sentinel code. That is why I ended up open this issue.

For me its good to know that HW_STACK_PROTECTION and CONFIG_STACK_SENTINEL are conflicting. Maybe it would be beneficial to disable CONFIG_STACK_SENTINEL if HW_STACK_PROTECTION is enabled.

the FPU stuff is super confusing. On app level you can give the K_FP_REGS to the thread creation. This macro becomes alive only if CONFIG_FPU_SHARING is set. Detecting a stack overflow at the thread switch is now tricky. Because the sentinel/HW stack protection needs to know the thread config. Currently I am going with the std. setup on that.

When disabling CONFIG_STACK_SENTINEL my code runs my long run test(12h with permanent sensor i/o and bt data streaming).

With the comment from @ioannisg I got a good explanation . I will leave this issue open for comments for 2 days and close it afterwards.

@StefJar
Copy link
Author

StefJar commented Jun 4, 2021

I am closing that issue, because it can not reproduced.
Important lesson learned is that
HW_STACK_PROTECTION
and
STACK_SENTINEL
are not necessary compatible.
Thx to @dcpleung and @ioannisg

@oryjkov
Copy link
Contributor

oryjkov commented Feb 23, 2023

I am closing that issue, because it can not reproduced. Important lesson learned is that HW_STACK_PROTECTION and STACK_SENTINEL are not necessary compatible. Thx to @dcpleung and @ioannisg

I'm able to pretty consistenly reproduce it with one of the Zephyr Bluetooths samples on an NRF52840 dk. See https://devzone.nordicsemi.com/f/nordic-q-a/97002/enabling-config_fpu-causes-stack-overflow

Would have been nice is KConfig disallowed CONFIG_STACK_SENTINEL=y and CONFIG_FPU=y in the presense of hardware stack protection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Waiting for response Waiting for author's response
Projects
None yet
Development

No branches or pull requests

8 participants