-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ethernet: stm32h7: fix tx semaphore timeout bug #29944
ethernet: stm32h7: fix tx semaphore timeout bug #29944
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix ! 👍
I could verify that this fixes the behavior of my NUCLEO_H743ZI:
Using dumb_http_server with Debug enabled:
CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y
But I had to increase my Nucleo sysclk because default 96MHz weren't enough to have the shell thread scheduled. So it always hangs.
Clock config:
CONFIG_CLOCK_STM32_PLL_SRC_HSI=y
CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=480000000
CONFIG_CLOCK_STM32_SYSCLK_SRC_PLL=y
CONFIG_CLOCK_STM32_PLL_M_DIVISOR=4
CONFIG_CLOCK_STM32_PLL_N_MULTIPLIER=60
CONFIG_CLOCK_STM32_PLL_P_DIVISOR=2
CONFIG_CLOCK_STM32_PLL_Q_DIVISOR=2
CONFIG_CLOCK_STM32_PLL_R_DIVISOR=2
CONFIG_CLOCK_STM32_D1CPRE=1
CONFIG_CLOCK_STM32_HPRE=2
CONFIG_CLOCK_STM32_D2PPRE1=2
CONFIG_CLOCK_STM32_D2PPRE2=2
CONFIG_CLOCK_STM32_D1PPRE=2
CONFIG_CLOCK_STM32_D3PPRE=2
Benchmarked using:
ab -n 1000 -c 1 http://192.0.2.1:8080/
Without this PR:
*** Booting Zephyr OS build zephyr-v2.4.0-1315-g4c4671a2a21a ***
[00:00:00.006,000] <inf> net_config: Initializing network
[00:00:00.006,000] <inf> net_config: Waiting interface 1 (0x24004c3c) to be up...
uart:~$ Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:01.489,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:01.489,000] <inf> net_config: Interface 1 (0x24004c3c) coming up
[00:00:01.489,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:10.467,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:10.467,000] <err> eth_stm32_hal: eth packet timeout
4c cc 6a 7f 9d ff 02 80 e1 a6 81 68 08 06 00 01 |L.j..... ...h....
08 00 06 04 00 02 02 80 e1 a6 81 68 c0 00 02 01 |........ ...h....
4c cc 6a 7f 9d ff c0 00 02 02 |L.j..... ..
[00:00:11.079,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:11.079,000] <err> eth_stm32_hal: eth packet timeout
4c cc 6a 7f 9d ff 02 80 e1 a6 81 68 08 06 00 01 |L.j..... ...h....
08 00 06 04 00 02 02 80 e1 a6 81 68 c0 00 02 01 |........ ...h....
4c cc 6a 7f 9d ff c0 00 02 02 |L.j..... ..
uart:~$ Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
--> Hang after a few more connections, no recovery from the board, and nothing in the shell, and no answer on ethernet.
With this PR, all is good ! I could complete more than 1000 requests with no hang nor timeouts ! Great job 😄
But still, it doesn't work with the default 96MHz clock settings of Nucleo_H743ZI. This is not a problem from this PR though so it doesn't count for approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just shorten the comments please. Otherwise LGTM and good testing work!
Incomplete memory write causes issues with ethernet transmission on stm32h747_disco_m7 board. TxCpltCallback is never called, causing waiting for tx_int_sem to timeout, and occationally hangs ethernet communication. Tested with 700 000 connections on dumb http server, as well as 5min continuous MCU reset every 7s with continuous GET requests. Fixes issue zephyrproject-rtos#29915 Signed-off-by: Emil Lindqvist <emil@lindq.gr>
4c4671a
to
8ed12ae
Compare
I was continuing development on this change now and found out there is still problems when building with |
Then I would wait with merge. Does dame semaphore timeout happens? |
Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward? |
/* tx_buffer is allocated on function stack, we need */ | ||
/* to wait for the transfer to complete */ | ||
/* So it is not freed before the interrupt happens */ | ||
hal_ret = HAL_ETH_Transmit_IT(heth, &tx_config); | ||
__DSB(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__ISB(); |
Can you please put __ISB()
here and try last test wit CONFIG_NO_OPTIMIZATIONS=y
? According to 2.3.4
Software ordering of memory accesses:
DSB
The Data Synchronization Barrier (DSB) instruction ensures that outstanding memory transactions complete before subsequent
instructions execute. See DSB on page 177.
ISB
The Instruction Synchronization Barrier (ISB) ensures that the effect of all completed memory transactions is recognizable by subsequent
instructions. See ISB on page 178.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and I still get the semaphore timeout.I've tried all combinations of ordering and presence.
My opinion then, there is no need to put it in HAL. I think @erwango would agree. |
As mentioned here yesterday, if it happens to be a real issue in HAL and the only good fix should be done in the HAL, it has to be fixed in the HAL. |
Alright I've tried #29944 (comment) but still no luck. However, as it turns out, enabling So as the issue is resolved for And also, unfortunately, I don't have time to dig further into investigating and creating an issue at STM32CubeH7. |
Maybe @jukkar would be the best to answer this (for current subsystem) |
It looks like if you disable optimizations, the stack usage increases and various network tests/samples start to fail. The stack usage in net would need to be fixed but probably not in this PR as it is a separate issue. @emillindq, could you create a GH issue for this so it is not lost. |
This phenomenon is actually mentioned in several issues, for instance #19340 (comment) and #30028. Sadly the first one has been marked as low-priority. In reality, I've had horrible debugging experience with Honestly I feel like this shouldn't be merged. I'm generally not in favor of ghetto solutions that works in some of the cases with obscure connection to the issue at hand. I don't know why this fix works. I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead. But should I create a PR involving the |
@emillindq Have you already tried __DSB() in stm32h7xx_hal_eth.c: 2979 approach with this PR #29676? |
Being able to reproduce on ST Ethernet code is a good thing but not a prerequisite to raise issues on STM32 HAL repo.
Yes, you can create this PR on hal_stm32. Though, it won't be merged before we have a first feedback on the HAL issue. |
|
Moving to D2 domain SRAM for
and compiling with:
and running:
without __DSB() approach (this PR) produces:
Anyway a positive effect can be seen: The webpage becomes more responsive! |
any update on this? |
I think this is the most feasible solution #30403 I'm acting as a tech lead in my company, and we have simply cloned the HAL, made the HAL change mentioned and then moved on. |
Incomplete memory write causes issues with ethernet
transmission on stm32h747_disco_m7 board.
TxCpltCallback is never called, causing waiting for
tx_int_sem to timeout, and occationally hangs
ethernet communication.
Tested with 700 000 connections on dumb http server,
as well as 5min continuous MCU reset every 7s with
continuous GET requests.
Fixes issue #29915