subsys: shell_telent: don't close connection on EAGAIN #66287

talih0 · 2023-12-07T16:48:53Z

An EAGAIN error can occur if the tcp window size is full. Instead of closing the connection the application should sleep and retry sendin the packet.

rlubos

Looks good, but I've left one more suggestion to make this even more robust.

rlubos · 2023-12-08T08:17:09Z

subsys/shell/backends/shell_telnet.c

+	for (;;) {
+		int err;
+
+		err = net_context_send(sh_telnet->client_ctx, msg, len, telnet_sent_cb,


There's one more thing that's missing to make this fully robust. net_context_send() returns the number of bytes sent, and it could return less than was requested (for example if window was almost full).
So, intead of having an endless loop with break, the loop could monitor how much bytes are left to send, see https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/http/http_client.c#L32 for example.

thanks @rlubos, I updated the PR.

When I was testing this PR, I noticed that there can be a deadlock on arp_mutex/conn->lock when using the "net arp" command with telnet shell
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/shell/arp.c#L63

Thread 1:
The "net arp" command will lock
arp_mutex followed by conn->lock

Thread 2:
In tcp_in() the locks are taken in the other order:
conn->lock followed by arp_mutex (in ethernet_ll_prepare_on_ipv4())

In some race conditions, we can get stuck in Thread 1 waiting for conn->lock, because Thread 2 has conn->lock and is waiting for arp_mutex.

We probably should not take arp_mutex for a long time via "net arp" shell command. We can instead copy all the arp entries, and send the copied entries to telnet shell.

I'm kind of puzzled here, I can't see where (and why in the first place) would net arp command lock the TCP connection mutex?

@rlubos, sorry, was a bit busy last couple of days. Here's a backtrace:

0: net_tcp_queue () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/tcp.c:3285
1: 0x0c016b2c in context_sendto () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/net_context.c:2184
2: 0x0c016e3e in net_context_send () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/net_context.c:2287
3: 0x0c0015a6 in telnet_send () at /home/andriy/zephyrproject/zephyr/subsys/shell/backends/shell_telnet.c:188
4: 0x0c001678 in write () at /home/andriy/zephyrproject/zephyr/subsys/shell/backends/shell_telnet.c:493
5: 0x0c014378 in z_shell_write () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_ops.c:434
6: 0x0c013fe0 in z_shell_fprintf_buffer_flush () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_fprintf.c:46
7: 0x0c0026b4 in z_shell_fprintf_fmt () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_fprintf.c:39
8: 0x0c014404 in z_shell_vfprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_ops.c:525
9: 0x0c013ef2 in shell_vfprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell.c:1526
10: 0x0c013f32 in shell_fprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell.c:1543
11: 0x0c0042ca in arp_cb () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:25
12: 0x0c00728c in net_arp_foreach () at /home/andriy/zephyrproject/zephyr/subsys/net/l2/ethernet/arp.c:794
13: 0x0c004294 in cmd_net_arp () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:63
14: cmd_net_arp () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:46

Thanks @talih0 for the details, I somehow missed the information that you use a telnet backend. I see where's the problem now.

If I'm not mistaken, I think that the deadlock could be avoided if CONFIG_NET_TC_TX_COUNT was set to 1 (it defaults to 0 unless userspace is enabled). That way the Ethernet L2 is processed by a separate thread, so TCP/ARP mutexes won't be locked from the same thread. Would you be able to check that? If I'm correct, we could likely enforce this configuration if one of the networking shell backends are enabled.

Otherwise, I don't see other way around than postponing shell operations until we're out of net_arp_foreach() function.

thanks, @rlubos. Yes, setting CONFIG_NET_TC_TX_COUNT=1 fixes the problem.

EAGAIN error is returned if the tcp window size is full. Retry sending the packet instead of closing the connection if this error occurs. Also the full payload may not be sent in a single call to net_context_send(). Keep track of the number of bytes remaining and try to send the full payload. Signed-off-by: Andriy Gelman <andriy.gelman@gmail.com>

zephyrbot added the area: Shell Shell subsystem label Dec 7, 2023

zephyrbot requested review from carlescufi and jakub-uC December 7, 2023 16:49

zephyrbot assigned jakub-uC Dec 7, 2023

talih0 force-pushed the telnet_fix branch 2 times, most recently from 0e15df1 to a4f17d1 Compare December 7, 2023 18:30

talih0 requested a review from rlubos December 7, 2023 18:30

rlubos reviewed Dec 8, 2023

View reviewed changes

rlubos requested a review from jukkar December 8, 2023 08:17

talih0 force-pushed the telnet_fix branch from a4f17d1 to d09b151 Compare December 10, 2023 04:58

rlubos approved these changes Dec 11, 2023

View reviewed changes

jukkar approved these changes Dec 11, 2023

View reviewed changes

carlescufi merged commit 8232ddd into zephyrproject-rtos:main Dec 12, 2023
16 checks passed

rlubos mentioned this pull request Dec 14, 2023

net: shell: Prevent deadlock with net arp command #66518

Merged

mkaranki mentioned this pull request Jan 15, 2024

shell: telnet: Don't close the connection on ENOBUFS error #67596

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subsys: shell_telent: don't close connection on EAGAIN #66287

subsys: shell_telent: don't close connection on EAGAIN #66287

talih0 commented Dec 7, 2023

rlubos left a comment

rlubos Dec 8, 2023

talih0 Dec 10, 2023

rlubos Dec 11, 2023

talih0 Dec 13, 2023 •

edited

Loading

rlubos Dec 13, 2023

talih0 Dec 13, 2023

subsys: shell_telent: don't close connection on EAGAIN #66287

subsys: shell_telent: don't close connection on EAGAIN #66287

Conversation

talih0 commented Dec 7, 2023

rlubos left a comment

Choose a reason for hiding this comment

rlubos Dec 8, 2023

Choose a reason for hiding this comment

talih0 Dec 10, 2023

Choose a reason for hiding this comment

rlubos Dec 11, 2023

Choose a reason for hiding this comment

talih0 Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

rlubos Dec 13, 2023

Choose a reason for hiding this comment

talih0 Dec 13, 2023

Choose a reason for hiding this comment

talih0 Dec 13, 2023 •

edited

Loading