Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

emillindq · 2021-03-15T07:34:37Z

Describe the bug
I am running into a problem where poll() is called with timeout but returning 0 without waiting. This causes a busy-wait loop which is blocking lower-priority tasks (and essentially all tasks if time-slicing is not enabled). I am doing the following:

A DTLS socket is created
A call to bind() is done to open up a port.
It then loops poll() with a timeout to see if any data has been received on the socket.
If data is received, it will call recvfrom and handle the data.

The application is a CoAP server/client where a client call sendto starts the DTLS handshake. In other words, my end is a DTLS client.

It seems the problem is that bind() call will open up incoming data to be given to the socket FIFO without being given to mbedTLS.

zephyr/subsys/net/lib/sockets/sockets.c

Lines 364 to 367 in 4626a57

    
           if (net_context_get_type(ctx) == SOCK_DGRAM) { 
        
           	SET_ERRNO(net_context_recv(ctx, zsock_received_cb, K_NO_WAIT, 
        
           				   ctx->user_data)); 
        
           }

It seems like data is only given to mbedTLS when recv(from) or send(to) is called. So if any data comes in after bind() then this causes data to exist in the FIFO but it hasn't reached mbedTLS. Then from what I can see sockets_tls returns -EAGAIN in the update_pollin function because

Data is available to read
Handshake not completed
..but due to mentioned reasons, handshake hasn't even been started yet.

zephyr/subsys/net/lib/sockets/sockets_tls.c

Lines 1875 to 1877 in 4626a57

    
           if (!is_handshake_complete(ctx)) { 
        
           	goto again; 
        
           }

It will stay this way until the end of time since no handshake is started and will neither timeout nor ever finish.

I don't know if this is usage fault from my side, probably, but from what I can see even if my DTLS socket were to have the server role then if bind() is called and data is received then handshake would not be started and poll would still act this way because it seems like it's firstly when recvfrom is called that mbedTLS is fed the data and any data that comes in after bind() is called will just be put in the queue and not be given to mbedTLS. Please correct me if I'm wrong. Anyways in my opinion this kind of behavior is wrong and the case should be handled. Please comment and advice, and I can implement a fix if it's deemed necessary.

To Reproduce
Steps to reproduce the behavior:

Create DTLS socket
bind() to some port
From extern send any data UDP data to said port
--> Poll will return 0 without waiting even if timeout is given.

Expected behavior

If bind() is called and DTLS is set to be client, then if data is received and DTLS session is not established then the data should be discarded
If bind() is called and DTLS is set to be server, then if data is received then poll() should say data is available and recvfrom start DTLS handshaking.

Impact
Showstopper, kinda, since the external library is using bind()/poll() combination. I could change it to recvfrom with some non-blocking flag and then check if data was received and then k_msleep(1) but it seems like a better solution to fix the root problem.

Logs and console output
N/A

Environment (please complete the following information):

OS: Ubuntu 20.04
Zephyr SDK
89a9096

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

rlubos · 2021-03-16T16:01:51Z

Thanks for reporting, indeed this seems to be a valid issue. I just wanted to clarify that we're on the same page first. In the first paragraph you wrote:

I am running into a problem where poll() is called with timeout but returning 0 without waiting. This causes a busy-wait loop

and the further in the analysis:

Then from what I can see sockets_tls returns -EAGAIN in the update_pollin function ... It will stay this way until the end of time since no handshake is started and will neither timeout nor ever finish.

Are you sure about the former statement? Because what I've observed was indeed a busy loop, but all inside a single poll() call. Setting a timeout for the poll() and sending some dummy data caused the poll() function to return 0 in a timely manner, as indicated by the timeout value, yet the poll() internals kept busy-looping for all that time. So for instance calling the poll() timeout to 1 second caused the poll() to return 0 after that 1 second, but immediately. But poll() kept busy-looping during that second, preventing other lower priority threads from running, just as you indicated.

Actually, we used to to handle this case properly (please have a look at some older revision:

zephyr/subsys/net/lib/sockets/sockets_tls.c

Line 1729 in 7d77307

if (net_context_get_type(ctx) == SOCK_DGRAM &&

), but a regression slipped in during the major TLS sockets rework (net_context decoupling). Nonetheless, I think it still could be handled in a similar way now.

So in my opinion, how poll() should work for DTLS:

In case of DTLS client:

If the handshake is not complete, ztls_poll_prepare_ctx() funcion shall no call it's counterpart for the underlying socket, but should rather configure the k_poll_event to monitor the tls_established semaphore. This way, the poll() will wait either for timeout or for the handshake to complete w/o busy looping, in case the handshake takes place in some other thread.
Once the handshake completes, and the timeout still hasn't expired, we should reconfigure the k_poll_event to monitor the underyling socket fifo instead in the ztls_poll_update_ctx() and return EAGAIN. Any data that arrives at this point will be fed to the poll()'s internal zsock_recv() and either consumed by mbedTLS or decrypted.

In case of DTLS server it's a bit different though:

In this case the server shall be ready to handle the handshake from within the poll() function, since it's not the server who initiates this. This could be achieved in a relatively simple manner, by ignoring the tls_established value and monitoring the underlying socket at all times. In case any data arrives, it should be fed to the internal zsock_recv() which will handle the hanshake, or report an error in case some rubbish arrives.
Once the handshake is complete and there's still no application data, return EAGAIN to monitor the fifo until the timeout.

What do you think, does the above make sense? I can work on a fix if we agree it's a sane approach.

emillindq · 2021-03-16T19:17:18Z

Thank you for looking into this @rlubos !

Are you sure about the former statement?

You are right, upon further investigation I concur with you, it busy waits within poll() but stays true to the timeout!

In case of DTLS client:

Yes this sounds like a good plan. Although it would seem easier to just poll mbedtls_ssl_get_bytes_avail somehow but of course that's not possible, and in addition we need to keep mbedtls module alive by calling zsock_recv() in update_pollin.

In case of DTLS server it's a bit different though:

I have nothing to add, it sounds like the cleanest solution available.

It would be great if you implement this, it feels a bit over my head at the moment. I have already two merge requests to update in my backlog as well 😄

Thanks!

rlubos · 2021-03-17T08:07:23Z

I have nothing to add, it sounds like the cleanest solution available.

It would be great if you implement this, it feels a bit over my head at the moment.

Great, I'll work on the patch then.

emillindq added the bug The issue is a bug, or the PR is fixing a bug label Mar 15, 2021

carlescufi added the area: Networking label Mar 15, 2021

jukkar assigned rlubos Mar 16, 2021

galak added the priority: low Low impact/importance bug label Mar 16, 2021

rlubos mentioned this issue Mar 18, 2021

net: sockets: tls: Fix poll() handling during the handshake #33474

Merged

nashif closed this as completed in #33474 Mar 31, 2021

emillindq mentioned this issue May 28, 2021

sockets_tls: receiving data on offloaded socket before handshake causes pollin | pollerr and failed recvfrom (SARA-R4) #35789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

emillindq commented Mar 15, 2021 •

edited

Loading

rlubos commented Mar 16, 2021

emillindq commented Mar 16, 2021

rlubos commented Mar 17, 2021

Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

Comments

emillindq commented Mar 15, 2021 • edited Loading

rlubos commented Mar 16, 2021

emillindq commented Mar 16, 2021

rlubos commented Mar 17, 2021

emillindq commented Mar 15, 2021 •

edited

Loading