ethernet: stm32h7: fix tx semaphore timeout bug #29944

emillindq · 2020-11-11T14:45:23Z

Incomplete memory write causes issues with ethernet
transmission on stm32h747_disco_m7 board.
TxCpltCallback is never called, causing waiting for
tx_int_sem to timeout, and occationally hangs
ethernet communication.

Tested with 700 000 connections on dumb http server,
as well as 5min continuous MCU reset every 7s with
continuous GET requests.

Fixes issue #29915

lochej

Thanks for the fix ! 👍

I could verify that this fixes the behavior of my NUCLEO_H743ZI:

Using dumb_http_server with Debug enabled:

CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y

But I had to increase my Nucleo sysclk because default 96MHz weren't enough to have the shell thread scheduled. So it always hangs.
Clock config:

CONFIG_CLOCK_STM32_PLL_SRC_HSI=y
CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=480000000
CONFIG_CLOCK_STM32_SYSCLK_SRC_PLL=y
CONFIG_CLOCK_STM32_PLL_M_DIVISOR=4
CONFIG_CLOCK_STM32_PLL_N_MULTIPLIER=60
CONFIG_CLOCK_STM32_PLL_P_DIVISOR=2
CONFIG_CLOCK_STM32_PLL_Q_DIVISOR=2
CONFIG_CLOCK_STM32_PLL_R_DIVISOR=2
CONFIG_CLOCK_STM32_D1CPRE=1
CONFIG_CLOCK_STM32_HPRE=2
CONFIG_CLOCK_STM32_D2PPRE1=2
CONFIG_CLOCK_STM32_D2PPRE2=2
CONFIG_CLOCK_STM32_D1PPRE=2
CONFIG_CLOCK_STM32_D3PPRE=2

Benchmarked using:

ab -n 1000 -c 1 http://192.0.2.1:8080/

Without this PR:

*** Booting Zephyr OS build zephyr-v2.4.0-1315-g4c4671a2a21a  ***
[00:00:00.006,000] <inf> net_config: Initializing network
[00:00:00.006,000] <inf> net_config: Waiting interface 1 (0x24004c3c) to be up...
uart:~$ Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:01.489,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:01.489,000] <inf> net_config: Interface 1 (0x24004c3c) coming up
[00:00:01.489,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:10.467,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:10.467,000] <err> eth_stm32_hal: eth packet timeout
4c cc 6a 7f 9d ff 02 80  e1 a6 81 68 08 06 00 01 |L.j..... ...h....
08 00 06 04 00 02 02 80  e1 a6 81 68 c0 00 02 01 |........ ...h....
4c cc 6a 7f 9d ff c0 00  02 02                   |L.j..... ..      
[00:00:11.079,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:11.079,000] <err> eth_stm32_hal: eth packet timeout
4c cc 6a 7f 9d ff 02 80  e1 a6 81 68 08 06 00 01 |L.j..... ...h....
08 00 06 04 00 02 02 80  e1 a6 81 68 c0 00 02 01 |........ ...h....
4c cc 6a 7f 9d ff c0 00  02 02                   |L.j..... ..      
uart:~$ Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
--> Hang after a few more connections, no recovery from the board, and nothing in the shell, and no answer on ethernet.

With this PR, all is good ! I could complete more than 1000 requests with no hang nor timeouts ! Great job 😄
But still, it doesn't work with the default 96MHz clock settings of Nucleo_H743ZI. This is not a problem from this PR though so it doesn't count for approval.

KozhinovAlexander

Just shorten the comments please. Otherwise LGTM and good testing work!

drivers/ethernet/eth_stm32_hal.c

Incomplete memory write causes issues with ethernet transmission on stm32h747_disco_m7 board. TxCpltCallback is never called, causing waiting for tx_int_sem to timeout, and occationally hangs ethernet communication. Tested with 700 000 connections on dumb http server, as well as 5min continuous MCU reset every 7s with continuous GET requests. Fixes issue zephyrproject-rtos#29915 Signed-off-by: Emil Lindqvist <emil@lindq.gr>

emillindq · 2020-11-12T08:30:45Z

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

KozhinovAlexander · 2020-11-12T10:03:25Z

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

emillindq · 2020-11-12T11:04:54Z

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

KozhinovAlexander · 2020-11-12T18:33:44Z

drivers/ethernet/eth_stm32_hal.c

 	/* tx_buffer is allocated on function stack, we need */
 	/* to wait for the transfer to complete */
 	/* So it is not freed before the interrupt happens */
 	hal_ret = HAL_ETH_Transmit_IT(heth, &tx_config);
+	__DSB();



Suggested change

__ISB();

Can you please put __ISB() here and try last test wit CONFIG_NO_OPTIMIZATIONS=y? According to 2.3.4
Software ordering of memory accesses:

DSB
The Data Synchronization Barrier (DSB) instruction ensures that outstanding memory transactions complete before subsequent
instructions execute. See DSB on page 177.

ISB
The Instruction Synchronization Barrier (ISB) ensures that the effect of all completed memory transactions is recognizable by subsequent
instructions. See ISB on page 178.

I tried this and I still get the semaphore timeout.I've tried all combinations of ordering and presence.

KozhinovAlexander · 2020-11-12T18:36:28Z

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

My opinion then, there is no need to put it in HAL. I think @erwango would agree.

erwango · 2020-11-13T07:56:51Z

As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

My opinion then, there is no need to put it in HAL. I think @erwango would agree.

As mentioned here yesterday, if it happens to be a real issue in HAL and the only good fix should be done in the HAL, it has to be fixed in the HAL.
Issues and PR could be submitted directly here: https://github.com/STMicroelectronics/STM32CubeH7.
If issue is acknowledged, we can live with a temp patch in hal_stm32 module waiting the official fix to be delivered.

emillindq · 2020-11-17T08:33:53Z

Alright I've tried #29944 (comment) but still no luck. However, as it turns out, enabling CONFIG_NO_OPTIMIZATIONS gives all different kinds of errors when starting to use more of Zephyr's features, most stack overflows as most of the default values are set based on CONFIG_DEBUG. So this is the least of somebody's problem who is trying to use CONFIG_NO_OPTIMIZATIONS. In our project we have decided to just use CONFIG_DEBUG.

So as the issue is resolved for CONFIG_DEBUG, I think it's okay to merge it. Looking at most other merge requests and default values, very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

And also, unfortunately, I don't have time to dig further into investigating and creating an issue at STM32CubeH7.

erwango · 2020-11-17T08:50:37Z

So as the issue is resolved for CONFIG_DEBUG, I think it's okay to merge it. Looking at most other merge requests and default values, very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

Maybe @jukkar would be the best to answer this (for current subsystem)

jukkar · 2020-11-17T09:08:26Z

very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

It looks like if you disable optimizations, the stack usage increases and various network tests/samples start to fail. The stack usage in net would need to be fixed but probably not in this PR as it is a separate issue. @emillindq, could you create a GH issue for this so it is not lost.

emillindq · 2020-11-17T20:56:19Z

very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

It looks like if you disable optimizations, the stack usage increases and various network tests/samples start to fail. The stack usage in net would need to be fixed but probably not in this PR as it is a separate issue. @emillindq, could you create a GH issue for this so it is not lost.

This phenomenon is actually mentioned in several issues, for instance #19340 (comment) and #30028. Sadly the first one has been marked as low-priority. In reality, I've had horrible debugging experience with CONFIG_DEBUG; just today pc was jumping around like a frog when stepping making debugging impossible. I don't know if it's like that on other boards, but for stm32h747i_disco CONFIG_NO_OPTIMIZATIONS is required for debugging. I'll see if I can create a separate issue about that.

Honestly I feel like this shouldn't be merged. I'm generally not in favor of ghetto solutions that works in some of the cases with obscure connection to the issue at hand. I don't know why this fix works. I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead.

But should I create a PR involving the __DSB() in the zephyrproject-rtos/hal_stm32 in the meantime? That fix appears to go to the root of the problem as it works for all cases. It seems like a much better choice than merging this. What do you think @Nukersson @erwango @lochej ?

KozhinovAlexander · 2020-11-17T21:06:42Z

@emillindq Have you already tried __DSB() in stm32h7xx_hal_eth.c: 2979 approach with this PR #29676?
Otherwise I would test it on my board at weekend also. If it is not helping too, we need to trigger HAL based fix.

erwango · 2020-11-18T07:49:11Z

@emillindq

I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead.

Being able to reproduce on ST Ethernet code is a good thing but not a prerequisite to raise issues on STM32 HAL repo.
ST ethernet code is proposed as one working example. As long as the bug you're facing is visible while you're using the HAL correctly, it is valid.

But should I create a PR involving the __DSB() in the zephyrproject-rtos/hal_stm32 in the meantime? That fix appears to go to the root of the problem as it works for all cases. It seems like a much better choice than merging this. What do you think @Nukersson @erwango @lochej ?

Yes, you can create this PR on hal_stm32. Though, it won't be merged before we have a first feedback on the HAL issue.

emillindq · 2020-11-23T08:48:29Z

@emillindq

I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead.

Being able to reproduce on ST Ethernet code is a good thing but not a prerequisite to raise issues on STM32 HAL repo.
ST ethernet code is proposed as one working example. As long as the bug you're facing is visible while you're using the HAL correctly, it is valid.

But should I create a PR involving the __DSB() in the zephyrproject-rtos/hal_stm32 in the meantime? That fix appears to go to the root of the problem as it works for all cases. It seems like a much better choice than merging this. What do you think @Nukersson @erwango @lochej ?

Yes, you can create this PR on hal_stm32. Though, it won't be merged before we have a first feedback on the HAL issue.

STMicroelectronics/STM32CubeH7#95
Thanks!

KozhinovAlexander · 2020-11-29T17:25:56Z

Moving to D2 domain SRAM for dumb_http_server sample:

/* System data RAM accessible over AHB bus: SRAM1 in D2 domain */
sram1: memory@30000000 {
reg = <0x30000000 DT_SIZE_K(288)>;
compatible = "mmio-sram";
};

and compiling with:

CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y

and running:

ab -n 1000 -c 1 http://192.0.2.1:8080/

without __DSB() approach (this PR) produces:

Connection #502 from 192.0.2.38
Connection from 192.0.2.38 closed
[00:00:25.773,000] eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:25.773,000] eth_stm32_hal: eth packet timeout
a8 a1 59 2a 76 b6 02 80 e1 5e 68 0a 08 00 45 00 |..Y*v... .^h...E.
00 28 00 00 00 00 40 06 f6 a8 c0 00 02 01 c0 00 |.(....@. ........
02 26 1f 90 c3 88 b0 86 2e 2f b5 5c 68 97 50 11 |.&...... ./.\h.P.
05 00 46 e9 00 00

Anyway a positive effect can be seen: The webpage becomes more responsive!

nashif · 2021-02-04T18:16:23Z

any update on this?

emillindq · 2021-02-04T18:29:40Z

any update on this?

I think this is the most feasible solution #30403
I'm closing this pull request. As I said before, this solution is quite ghetto and doesn't go to the root of the problem so I feel like it shouldn't be merged anyways.

I'm acting as a tech lead in my company, and we have simply cloned the HAL, made the HAL change mentioned and then moved on.

emillindq requested review from lochej and KozhinovAlexander as code owners November 11, 2020 14:45

github-actions bot added area: Networking platform: STM32 ST Micro STM32 labels Nov 11, 2020

lochej approved these changes Nov 11, 2020

View reviewed changes

KozhinovAlexander suggested changes Nov 11, 2020

View reviewed changes

drivers/ethernet/eth_stm32_hal.c Outdated Show resolved Hide resolved

drivers/ethernet/eth_stm32_hal.c Outdated Show resolved Hide resolved

emillindq force-pushed the stm32h7_eth_timeout_fix branch from 4c4671a to 8ed12ae Compare November 12, 2020 07:24

KozhinovAlexander suggested changes Nov 12, 2020

View reviewed changes

emillindq mentioned this pull request Nov 17, 2020

> very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this? #30098

Closed

emillindq mentioned this pull request Nov 23, 2020

STM32H7: Ethernet TX callback complete not called intermittently STMicroelectronics/STM32CubeH7#95

Closed

KozhinovAlexander linked an issue Nov 29, 2020 that may be closed by this pull request

eth: stm32h747i_disco: sem timeout and hang on debug build #29915

Closed

mjaun mentioned this pull request Dec 2, 2020

eth: stm32h7: use SRAM3 / MPU configuration #30403

Merged

zephyrbot added the area: Ethernet label Jan 8, 2021

zephyrbot requested review from erwango, jukkar and pfalcon January 8, 2021 21:26

zephyrbot requested a review from tbursztyka January 8, 2021 21:26

zephyrbot assigned erwango Jan 8, 2021

emillindq closed this Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ethernet: stm32h7: fix tx semaphore timeout bug #29944

ethernet: stm32h7: fix tx semaphore timeout bug #29944

emillindq commented Nov 11, 2020

lochej left a comment

KozhinovAlexander left a comment •

edited

Loading

emillindq commented Nov 12, 2020

KozhinovAlexander commented Nov 12, 2020

emillindq commented Nov 12, 2020

KozhinovAlexander Nov 12, 2020

emillindq Nov 17, 2020

KozhinovAlexander commented Nov 12, 2020

erwango commented Nov 13, 2020

emillindq commented Nov 17, 2020

erwango commented Nov 17, 2020

jukkar commented Nov 17, 2020

emillindq commented Nov 17, 2020

KozhinovAlexander commented Nov 17, 2020 •

edited

Loading

erwango commented Nov 18, 2020

emillindq commented Nov 23, 2020

KozhinovAlexander commented Nov 29, 2020 •

edited

Loading

nashif commented Feb 4, 2021

emillindq commented Feb 4, 2021

ethernet: stm32h7: fix tx semaphore timeout bug #29944

ethernet: stm32h7: fix tx semaphore timeout bug #29944

Conversation

emillindq commented Nov 11, 2020

lochej left a comment

Choose a reason for hiding this comment

KozhinovAlexander left a comment • edited Loading

Choose a reason for hiding this comment

emillindq commented Nov 12, 2020

KozhinovAlexander commented Nov 12, 2020

emillindq commented Nov 12, 2020

KozhinovAlexander Nov 12, 2020

Choose a reason for hiding this comment

emillindq Nov 17, 2020

Choose a reason for hiding this comment

KozhinovAlexander commented Nov 12, 2020

erwango commented Nov 13, 2020

emillindq commented Nov 17, 2020

erwango commented Nov 17, 2020

jukkar commented Nov 17, 2020

emillindq commented Nov 17, 2020

KozhinovAlexander commented Nov 17, 2020 • edited Loading

erwango commented Nov 18, 2020

emillindq commented Nov 23, 2020

KozhinovAlexander commented Nov 29, 2020 • edited Loading

nashif commented Feb 4, 2021

emillindq commented Feb 4, 2021

KozhinovAlexander left a comment •

edited

Loading

KozhinovAlexander commented Nov 17, 2020 •

edited

Loading

KozhinovAlexander commented Nov 29, 2020 •

edited

Loading