Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ethernet: stm32h7: fix tx semaphore timeout bug #29944

Closed

Conversation

emillindq
Copy link
Contributor

Incomplete memory write causes issues with ethernet
transmission on stm32h747_disco_m7 board.
TxCpltCallback is never called, causing waiting for
tx_int_sem to timeout, and occationally hangs
ethernet communication.

Tested with 700 000 connections on dumb http server,
as well as 5min continuous MCU reset every 7s with
continuous GET requests.

Fixes issue #29915

Copy link
Collaborator

@lochej lochej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix ! 👍

I could verify that this fixes the behavior of my NUCLEO_H743ZI:

Using dumb_http_server with Debug enabled:

CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y

But I had to increase my Nucleo sysclk because default 96MHz weren't enough to have the shell thread scheduled. So it always hangs.
Clock config:

CONFIG_CLOCK_STM32_PLL_SRC_HSI=y
CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC=480000000
CONFIG_CLOCK_STM32_SYSCLK_SRC_PLL=y
CONFIG_CLOCK_STM32_PLL_M_DIVISOR=4
CONFIG_CLOCK_STM32_PLL_N_MULTIPLIER=60
CONFIG_CLOCK_STM32_PLL_P_DIVISOR=2
CONFIG_CLOCK_STM32_PLL_Q_DIVISOR=2
CONFIG_CLOCK_STM32_PLL_R_DIVISOR=2
CONFIG_CLOCK_STM32_D1CPRE=1
CONFIG_CLOCK_STM32_HPRE=2
CONFIG_CLOCK_STM32_D2PPRE1=2
CONFIG_CLOCK_STM32_D2PPRE2=2
CONFIG_CLOCK_STM32_D1PPRE=2
CONFIG_CLOCK_STM32_D3PPRE=2

Benchmarked using:

ab -n 1000 -c 1 http://192.0.2.1:8080/

Without this PR:

*** Booting Zephyr OS build zephyr-v2.4.0-1315-g4c4671a2a21a  ***
[00:00:00.006,000] <inf> net_config: Initializing network
[00:00:00.006,000] <inf> net_config: Waiting interface 1 (0x24004c3c) to be up...
uart:~$ Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:01.489,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:01.489,000] <inf> net_config: Interface 1 (0x24004c3c) coming up
[00:00:01.489,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:10.467,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:10.467,000] <err> eth_stm32_hal: eth packet timeout
4c cc 6a 7f 9d ff 02 80  e1 a6 81 68 08 06 00 01 |L.j..... ...h....
08 00 06 04 00 02 02 80  e1 a6 81 68 c0 00 02 01 |........ ...h....
4c cc 6a 7f 9d ff c0 00  02 02                   |L.j..... ..      
[00:00:11.079,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:11.079,000] <err> eth_stm32_hal: eth packet timeout
4c cc 6a 7f 9d ff 02 80  e1 a6 81 68 08 06 00 01 |L.j..... ...h....
08 00 06 04 00 02 02 80  e1 a6 81 68 c0 00 02 01 |........ ...h....
4c cc 6a 7f 9d ff c0 00  02 02                   |L.j..... ..      
uart:~$ Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
--> Hang after a few more connections, no recovery from the board, and nothing in the shell, and no answer on ethernet.

With this PR, all is good ! I could complete more than 1000 requests with no hang nor timeouts ! Great job 😄
But still, it doesn't work with the default 96MHz clock settings of Nucleo_H743ZI. This is not a problem from this PR though so it doesn't count for approval.

Copy link
Collaborator

@KozhinovAlexander KozhinovAlexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just shorten the comments please. Otherwise LGTM and good testing work!

drivers/ethernet/eth_stm32_hal.c Outdated Show resolved Hide resolved
drivers/ethernet/eth_stm32_hal.c Outdated Show resolved Hide resolved
Incomplete memory write causes issues with ethernet
transmission on stm32h747_disco_m7 board.
TxCpltCallback is never called, causing waiting for
tx_int_sem to timeout, and occationally hangs
ethernet communication.

Tested with 700 000 connections on dumb http server,
as well as 5min continuous MCU reset every 7s with
continuous GET requests.

Fixes issue zephyrproject-rtos#29915

Signed-off-by: Emil Lindqvist <emil@lindq.gr>
@emillindq emillindq force-pushed the stm32h7_eth_timeout_fix branch from 4c4671a to 8ed12ae Compare November 12, 2020 07:24
@emillindq
Copy link
Contributor Author

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

@KozhinovAlexander
Copy link
Collaborator

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

@emillindq
Copy link
Contributor Author

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

/* tx_buffer is allocated on function stack, we need */
/* to wait for the transfer to complete */
/* So it is not freed before the interrupt happens */
hal_ret = HAL_ETH_Transmit_IT(heth, &tx_config);
__DSB();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__ISB();

Can you please put __ISB() here and try last test wit CONFIG_NO_OPTIMIZATIONS=y? According to 2.3.4
Software ordering of memory accesses
:

DSB
The Data Synchronization Barrier (DSB) instruction ensures that outstanding memory transactions complete before subsequent
instructions execute. See DSB on page 177.

ISB
The Instruction Synchronization Barrier (ISB) ensures that the effect of all completed memory transactions is recognizable by subsequent
instructions. See ISB on page 178.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and I still get the semaphore timeout.I've tried all combinations of ordering and presence.

@KozhinovAlexander
Copy link
Collaborator

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

I was continuing development on this change now and found out there is still problems when building with CONFIG_NO_OPTIMIZATIONS which we are doing in our project since CONFIG_DEBUG causes inconsistent debugging. As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

My opinion then, there is no need to put it in HAL. I think @erwango would agree.

@erwango
Copy link
Member

erwango commented Nov 13, 2020

As mentioned in #29944 the fix in HAL is the only solution that appears to work in all cases. We can still merge this, but I don't think it's good enough.

Then I would wait with merge. Does dame semaphore timeout happens?

Yes it is the same timeout. Should we put it in the HAL or can you suggest another way forward?

My opinion then, there is no need to put it in HAL. I think @erwango would agree.

As mentioned here yesterday, if it happens to be a real issue in HAL and the only good fix should be done in the HAL, it has to be fixed in the HAL.
Issues and PR could be submitted directly here: https://github.com/STMicroelectronics/STM32CubeH7.
If issue is acknowledged, we can live with a temp patch in hal_stm32 module waiting the official fix to be delivered.

@emillindq
Copy link
Contributor Author

Alright I've tried #29944 (comment) but still no luck. However, as it turns out, enabling CONFIG_NO_OPTIMIZATIONS gives all different kinds of errors when starting to use more of Zephyr's features, most stack overflows as most of the default values are set based on CONFIG_DEBUG. So this is the least of somebody's problem who is trying to use CONFIG_NO_OPTIMIZATIONS. In our project we have decided to just use CONFIG_DEBUG.

So as the issue is resolved for CONFIG_DEBUG, I think it's okay to merge it. Looking at most other merge requests and default values, very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

And also, unfortunately, I don't have time to dig further into investigating and creating an issue at STM32CubeH7.

@erwango
Copy link
Member

erwango commented Nov 17, 2020

So as the issue is resolved for CONFIG_DEBUG, I think it's okay to merge it. Looking at most other merge requests and default values, very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

Maybe @jukkar would be the best to answer this (for current subsystem)

@jukkar
Copy link
Member

jukkar commented Nov 17, 2020

very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

It looks like if you disable optimizations, the stack usage increases and various network tests/samples start to fail. The stack usage in net would need to be fixed but probably not in this PR as it is a separate issue. @emillindq, could you create a GH issue for this so it is not lost.

@emillindq
Copy link
Contributor Author

very few are even tested with CONFIG_NO_OPTIMIZATIONS. What is the general consensus about this?

It looks like if you disable optimizations, the stack usage increases and various network tests/samples start to fail. The stack usage in net would need to be fixed but probably not in this PR as it is a separate issue. @emillindq, could you create a GH issue for this so it is not lost.

This phenomenon is actually mentioned in several issues, for instance #19340 (comment) and #30028. Sadly the first one has been marked as low-priority. In reality, I've had horrible debugging experience with CONFIG_DEBUG; just today pc was jumping around like a frog when stepping making debugging impossible. I don't know if it's like that on other boards, but for stm32h747i_disco CONFIG_NO_OPTIMIZATIONS is required for debugging. I'll see if I can create a separate issue about that.

Honestly I feel like this shouldn't be merged. I'm generally not in favor of ghetto solutions that works in some of the cases with obscure connection to the issue at hand. I don't know why this fix works. I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead.

But should I create a PR involving the __DSB() in the zephyrproject-rtos/hal_stm32 in the meantime? That fix appears to go to the root of the problem as it works for all cases. It seems like a much better choice than merging this. What do you think @Nukersson @erwango @lochej ?

@KozhinovAlexander
Copy link
Collaborator

KozhinovAlexander commented Nov 17, 2020

@emillindq Have you already tried __DSB() in stm32h7xx_hal_eth.c: 2979 approach with this PR #29676?
Otherwise I would test it on my board at weekend also. If it is not helping too, we need to trigger HAL based fix.

@erwango
Copy link
Member

erwango commented Nov 18, 2020

@emillindq

I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead.

Being able to reproduce on ST Ethernet code is a good thing but not a prerequisite to raise issues on STM32 HAL repo.
ST ethernet code is proposed as one working example. As long as the bug you're facing is visible while you're using the HAL correctly, it is valid.

But should I create a PR involving the __DSB() in the zephyrproject-rtos/hal_stm32 in the meantime? That fix appears to go to the root of the problem as it works for all cases. It seems like a much better choice than merging this. What do you think @Nukersson @erwango @lochej ?

Yes, you can create this PR on hal_stm32. Though, it won't be merged before we have a first feedback on the HAL issue.

@emillindq
Copy link
Contributor Author

@emillindq

I think it's better to keep the codebase clean and address the source of the issue instead (which obviously this PR doesn't do). I'll see if I can port some sample ST ethernet code from H743 to H747 and create an issue/PR in STM32 HAL repo instead.

Being able to reproduce on ST Ethernet code is a good thing but not a prerequisite to raise issues on STM32 HAL repo.
ST ethernet code is proposed as one working example. As long as the bug you're facing is visible while you're using the HAL correctly, it is valid.

But should I create a PR involving the __DSB() in the zephyrproject-rtos/hal_stm32 in the meantime? That fix appears to go to the root of the problem as it works for all cases. It seems like a much better choice than merging this. What do you think @Nukersson @erwango @lochej ?

Yes, you can create this PR on hal_stm32. Though, it won't be merged before we have a first feedback on the HAL issue.

STMicroelectronics/STM32CubeH7#95
Thanks!

@KozhinovAlexander KozhinovAlexander linked an issue Nov 29, 2020 that may be closed by this pull request
@KozhinovAlexander
Copy link
Collaborator

KozhinovAlexander commented Nov 29, 2020

Moving to D2 domain SRAM for dumb_http_server sample:

/* System data RAM accessible over AHB bus: SRAM1 in D2 domain */
sram1: memory@30000000 {
reg = <0x30000000 DT_SIZE_K(288)>;
compatible = "mmio-sram";
};

and compiling with:

CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y

and running:

ab -n 1000 -c 1 http://192.0.2.1:8080/

without __DSB() approach (this PR) produces:

Connection #502 from 192.0.2.38
Connection from 192.0.2.38 closed
[00:00:25.773,000] eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:25.773,000] eth_stm32_hal: eth packet timeout
a8 a1 59 2a 76 b6 02 80 e1 5e 68 0a 08 00 45 00 |..Y*v... .^h...E.
00 28 00 00 00 00 40 06 f6 a8 c0 00 02 01 c0 00 |.(....@. ........
02 26 1f 90 c3 88 b0 86 2e 2f b5 5c 68 97 50 11 |.&...... ./.\h.P.
05 00 46 e9 00 00

Anyway a positive effect can be seen: The webpage becomes more responsive!

@nashif
Copy link
Member

nashif commented Feb 4, 2021

any update on this?

@emillindq
Copy link
Contributor Author

any update on this?

I think this is the most feasible solution #30403
I'm closing this pull request. As I said before, this solution is quite ghetto and doesn't go to the root of the problem so I feel like it shouldn't be merged anyways.

I'm acting as a tech lead in my company, and we have simply cloned the HAL, made the HAL change mentioned and then moved on.

@emillindq emillindq closed this Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eth: stm32h747i_disco: sem timeout and hang on debug build
7 participants