-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eth: stm32h747i_disco: sem timeout and hang on debug build #29915
eth: stm32h747i_disco: sem timeout and hang on debug build #29915
Comments
Hey @emillindq, Thank you for your great job and finding this bug and another big thanks for figuring out the solution. I can see same behavior on my In my Embedded Development experience I had already same situation with stm32h7 SoC series: It was just saving a byte of data into some on start non erasable registers, which were defined by the system design. The problem was, that writing a byte does not triggers exact writing to corresponding registers, until __DSB() were executed. Digging deeper into it, showed, that CPU writes this memory, only as it accumulates 4 bytes (e.g. 32 bit - bus width of M7 core). Thus __DSB() helped a lot at that time. I can only approve your solution. Moreover, it appears with It is also known, that ST's dirvers can behave buggy under special circumstances (@erwango 👀 ). But ST seems to do a good job fixing it continuously. I would not recomend adding Edit: Also putting P.S.: @emillindq If you like you can read on example of D-Cache, why such things are happening. |
Thank you for the encouragement 😃
At first sight it seems like I ran four tests (hint: case 3 & 4 interesting)
With nothing after So now we have a solution (3 or 4) that doesn't involve messing around in the HAL; however it just feels a bit ghetto. I don't know why it suddenly works with a sync barrier before and after function call. Is this good enough? |
@emillindq Good idea with tests. @erwango Would we be able integrate this tests into Zephyrs CI? Even if solutions 3 and 4 are not optimal, they are better than changing the HAL layer itself. For the first the solution will be good enough. Of course you can write your own functions instead ones of ST' HAL (see it here). But, again you will probably find other bugs and the solutions 3 and 4 are good enough. P.S.: It would be also interesting to declare HAL_ETH_Transmit_IT() section as threads critical region. Can't find currently Zephyrs critical section declaration from my smartphone. |
I ran the |
To answer your last thought, I tried to wrap |
Thank you @emillindq and @Nukersson for figuring all of this out ! Seems like you found out that if the cache data is not in sync with RAM when calling DMA (which is accessed using registers this time, so no caching is done right ?), this can lead to mismatch in configuration data and in the end, Ethernet IP is not set to fire the TX complete interrupt. I never thought that this would be a real problem until now so thanks for the discovery 👍 Also great thanks for the fix ! Will try it out as soon as possible. |
Incomplete memory write causes issues with ethernet transmission on stm32h747_disco_m7 board. TxCpltCallback is never called, causing waiting for tx_int_sem to timeout, and occationally hangs ethernet communication. Tested with 700 000 connections on dumb http server, as well as 5min continuous MCU reset every 7s with continuous GET requests. Fixes issue zephyrproject-rtos#29915 Signed-off-by: Emil Lindqvist <emil@lindq.gr>
As mentioned in the PR, still issues with |
@emillindq, does the initial fix in ST HAL that you mentioned earlier actually fixes the issue in all cases ? If there's an actual bug in the HAL driver, it is not excluded to fix it there directly. |
@erwango yes adding |
Ok, thanks for confirming.
Well, before merging the PR here, we need to be clear on the strategy vs HAL. So please address the HAL official issue/PR first. |
@emillindq Have you already created PR in stm32_hal repo? I can confirm, that __DSB() approach (no matter if in HAL or driver without CONFIG_DEBUG) increases stability with my board and samples. I have observed it with my latest CivetWeb update (currently as PR). |
STMicroelectronics/STM32CubeH7#95 |
Hi! I also ran into this problem and I noticed that the DMA descriptors and RX buffers in |
Absolutely agree with you. Thank you for this notice. If you try to run @reloZid Does your described solution helped you? @emillindq Could you change
and run your tests without |
I haven't been through the proposed solution in detail, but we should consider that Zephyr already makes a strong use of MPU, This being said, this is indeed an interesting lead, but it requires a careful study before being applied. |
@Nukersson It doesn't boot with said settings when building for dumb_http_server ( I am on |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
I faced this issue today with |
@erwango Can you agree with that? |
I can agree with this. Building dumb http server sample and running |
I just saw this happen to me again with my fix removed and with latest stm32 HAL version, so it seems like it's a lot better but not completely fixed. |
Thanks for the test. The initial description was "we very fast get a semaphore timeout on waiting for transmission complete callback from ST HAL layer". Do I understand correctly that this is no more "very fast" ? Can you give an idea about the new occurrence rate ? |
Fixes zephyrproject-rtos#29915. Implements the memory layout and MPU configuration for Ethernet buffers for STM32H7 controllers as recommended by ST. 16 KB of SRAM3 are are reserved for this. The first 256 B are for the RX/TX descriptors and configured as strongly ordered, shareable memory. The rest is for RX/TX buffers and configured as non cacheable memory. This configuration is automatically applied for H7 chips if the SRAM3 memory is enabled in the device tree. Signed-off-by: Mario Jaun <mario.jaun@gmail.com>
I must apologize, my tests were not too academic 😇 I must correct myself. I have two hardwares:
h747 h743
With my fix, no problems at all. Once again, this only happens when CONFIG_NO_OPTIMIZATIONS is enabled. With CONFIG_SIZE_OPTIMIZATIONS, no problems. Running Zephyr 05ccdd7 |
Fixes #29915. Implements the memory layout and MPU configuration for Ethernet buffers for STM32H7 controllers as recommended by ST. 16 KB of SRAM3 are are reserved for this. The first 256 B are for the RX/TX descriptors and configured as strongly ordered, shareable memory. The rest is for RX/TX buffers and configured as non cacheable memory. This configuration is automatically applied for H7 chips if the SRAM3 memory is enabled in the device tree. Signed-off-by: Mario Jaun <mario.jaun@gmail.com>
Fixes zephyrproject-rtos#29915. Implements the memory layout and MPU configuration for Ethernet buffers for STM32H7 controllers as recommended by ST. 16 KB of SRAM3 are are reserved for this. The first 256 B are for the RX/TX descriptors and configured as strongly ordered, shareable memory. The rest is for RX/TX buffers and configured as non cacheable memory. This configuration is automatically applied for H7 chips if the SRAM3 memory is enabled in the device tree. Signed-off-by: Mario Jaun <mario.jaun@gmail.com>
suffer from zephyrproject-rtos#29915. All H7 variants have sram2, so lets use that instead. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3. Unfortunately, chip variants STM32H742xx do not have any sram, so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so lets use that instead. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue zephyrproject-rtos#29915. Unfortunately, chip variants STM32H742xx do not have any sram, so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so lets use that instead. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue zephyrproject-rtos#29915. Unfortunately, chip variants STM32H742xx do not have any sram, so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so lets use that instead. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue zephyrproject-rtos#29915. Unfortunately, chip variants STM32H742xx do not have any sram3, so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so lets use that instead. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue zephyrproject-rtos#29915. Unfortunately, some STM32H7 variants do not have any sram3 so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so use that for targets without sram3. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue zephyrproject-rtos#29915. Unfortunately, some STM32H7 variants do not have any sram3 so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so use that for targets without sram3. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR #30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue #29915. Unfortunately, some STM32H7 variants do not have any sram3 so they still suffer from #29915. All H7 variants have sram2 though, so use that for targets without sram3. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
PR zephyrproject-rtos#30403 implemented nocache regions for ethernet DMA buffers in sram3 to fix issue zephyrproject-rtos#29915. Unfortunately, some STM32H7 variants do not have any sram3 so they still suffer from zephyrproject-rtos#29915. All H7 variants have sram2 though, so use that for targets without sram3. Signed-off-by: Björn Stenberg <bjorn@haxx.se>
Describe the bug
When building the dumb http server with
CONFIG_DEBUG
enabled, for the stm32h747i_disco_m7 board, we very fast get a semaphore timeout on waiting for transmission complete callback from ST HAL layer. We can seeOften it hangs and doesn't recover. Build without
CONFIG_DEBUG
and it works flawless. Increasing semaphore timeout time doesn't do any difference.What have you tried to diagnose or workaround this issue?
With instruction cache disabled, it works flawlessly with
CONFIG_DEBUG
enabled. I managed to track it down tomodules/hal/stm32/stm32cube/stm32h7xx/drivers/src/stm32h7xx_hal_eth.c: 2979
. If we enable instruction cache before this line, timeout. If we enable it after, it works. If we insert a data barrier after, it works:I guess this is a fix, but really not that nice to be messing around in ST's HAL, and I'm also wondering if this fix actually fixes a problem we are causing in the driver. We might be doing something wrong in our stm32h7 driver? Comparing with ST samples for STM32H743 it looks correct. I didn't find anybody else having this issue with ST's HAL on stm32h747i MCU.
I've seen some people from ST contributing here, maybe somebody can take a look at this?
Messing with buffer alignments doesn't have any effect either; I tried alignment 256 bytes, with confirmation it was aligned.
Please note that this issue was seen when making this driver as well (#27188 (comment))
To Reproduce
Steps to reproduce the behavior:
/boards/arm/stm32h747i_disco/stm32h747i_disco.dtsi
samples/net/sockets/dumb_http_server/prj.conf
west build -b stm32h747i_disco_m7 zephyr/samples/net/sockets/dumb_http_server/
ab -n 100 -c 1 http://192.0.2.1:8080/
Expected behavior
Ethernet tx complete semaphore should not timeout. Temporary fix proves this is possible.
Impact
Showstopper if it hangs, which it appears to do. We realize it only happens when building for debug, but it's not sustainable to not be able to debug properly. Something's wrong
Logs and console output
Please note I am using different IP's than example.
Environment (please complete the following information):
Additional context
Ethernet cable connected straight to computer
The text was updated successfully, but these errors were encountered: