-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subsys: shell_telent: don't close connection on EAGAIN #66287
Conversation
0e15df1
to
a4f17d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I've left one more suggestion to make this even more robust.
subsys/shell/backends/shell_telnet.c
Outdated
for (;;) { | ||
int err; | ||
|
||
err = net_context_send(sh_telnet->client_ctx, msg, len, telnet_sent_cb, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one more thing that's missing to make this fully robust. net_context_send()
returns the number of bytes sent, and it could return less than was requested (for example if window was almost full).
So, intead of having an endless loop with break
, the loop could monitor how much bytes are left to send, see https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/http/http_client.c#L32 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @rlubos, I updated the PR.
When I was testing this PR, I noticed that there can be a deadlock on arp_mutex/conn->lock when using the "net arp" command with telnet shell
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/shell/arp.c#L63
Thread 1:
The "net arp" command will lock
arp_mutex followed by conn->lock
Thread 2:
In tcp_in() the locks are taken in the other order:
conn->lock followed by arp_mutex (in ethernet_ll_prepare_on_ipv4())
In some race conditions, we can get stuck in Thread 1 waiting for conn->lock, because Thread 2 has conn->lock and is waiting for arp_mutex.
We probably should not take arp_mutex for a long time via "net arp" shell command. We can instead copy all the arp entries, and send the copied entries to telnet shell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of puzzled here, I can't see where (and why in the first place) would net arp
command lock the TCP connection mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlubos, sorry, was a bit busy last couple of days. Here's a backtrace:
0: net_tcp_queue () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/tcp.c:3285
1: 0x0c016b2c in context_sendto () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/net_context.c:2184
2: 0x0c016e3e in net_context_send () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/net_context.c:2287
3: 0x0c0015a6 in telnet_send () at /home/andriy/zephyrproject/zephyr/subsys/shell/backends/shell_telnet.c:188
4: 0x0c001678 in write () at /home/andriy/zephyrproject/zephyr/subsys/shell/backends/shell_telnet.c:493
5: 0x0c014378 in z_shell_write () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_ops.c:434
6: 0x0c013fe0 in z_shell_fprintf_buffer_flush () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_fprintf.c:46
7: 0x0c0026b4 in z_shell_fprintf_fmt () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_fprintf.c:39
8: 0x0c014404 in z_shell_vfprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_ops.c:525
9: 0x0c013ef2 in shell_vfprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell.c:1526
10: 0x0c013f32 in shell_fprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell.c:1543
11: 0x0c0042ca in arp_cb () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:25
12: 0x0c00728c in net_arp_foreach () at /home/andriy/zephyrproject/zephyr/subsys/net/l2/ethernet/arp.c:794
13: 0x0c004294 in cmd_net_arp () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:63
14: cmd_net_arp () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:46
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @talih0 for the details, I somehow missed the information that you use a telnet backend. I see where's the problem now.
If I'm not mistaken, I think that the deadlock could be avoided if CONFIG_NET_TC_TX_COUNT
was set to 1 (it defaults to 0 unless userspace is enabled). That way the Ethernet L2 is processed by a separate thread, so TCP/ARP mutexes won't be locked from the same thread. Would you be able to check that? If I'm correct, we could likely enforce this configuration if one of the networking shell backends are enabled.
Otherwise, I don't see other way around than postponing shell operations until we're out of net_arp_foreach()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @rlubos. Yes, setting CONFIG_NET_TC_TX_COUNT=1 fixes the problem.
EAGAIN error is returned if the tcp window size is full. Retry sending the packet instead of closing the connection if this error occurs. Also the full payload may not be sent in a single call to net_context_send(). Keep track of the number of bytes remaining and try to send the full payload. Signed-off-by: Andriy Gelman <andriy.gelman@gmail.com>
An EAGAIN error can occur if the tcp window size is full. Instead of closing the connection the application should sleep and retry sendin the packet.