Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS,PAL] Gramine doesn't support EINPROGRESS (non-blocking sockets connect) #1641

Closed
dimakuv opened this issue Nov 6, 2023 · 8 comments · Fixed by #1643
Closed

[LibOS,PAL] Gramine doesn't support EINPROGRESS (non-blocking sockets connect) #1641

dimakuv opened this issue Nov 6, 2023 · 8 comments · Fixed by #1643

Comments

@dimakuv
Copy link

dimakuv commented Nov 6, 2023

Description of the problem

Gramine doesn't support non-blocking sockets connect, see these snippets in our code:

  • gramine/libos/src/net/ip.c

    Lines 224 to 229 in 24a581c

    /* XXX: this connect is always blocking (regardless of actual setting of nonblockingness on
    * `sock->pal_handle`. See also the comment in tcp connect implementation in Linux PAL. */
    ret = PalSocketConnect(sock->pal_handle, &pal_remote_addr, &pal_local_addr);
    if (ret < 0) {
    return ret == -PAL_ERROR_CONNFAILED ? -ECONNREFUSED : pal_to_unix_errno(ret);
    }
  • int ret = DO_SYSCALL(connect, handle->sock.fd, &sa_storage, (int)linux_addrlen);
    if (ret < 0) {
    /* XXX: Non blocking socket. Currently there is no way of notifying LibOS of successful or
    * failed connection, so we have to block and wait. */
    if (ret != -EINPROGRESS) {
    return unix_to_pal_error(ret);
    }
    struct pollfd pfd = {
    .fd = handle->sock.fd,
    .events = POLLOUT,
    };
    ret = DO_SYSCALL(poll, &pfd, 1, /*timeout=*/-1);
    if (ret != 1 || pfd.revents == 0) {
    return ret < 0 ? unix_to_pal_error(ret) : -PAL_ERROR_INVAL;
    }
    int val = 0;
    unsigned int len = sizeof(val);
    ret = DO_SYSCALL(getsockopt, handle->sock.fd, SOL_SOCKET, SO_ERROR, &val, &len);
    if (ret < 0 || val < 0) {
    return ret < 0 ? unix_to_pal_error(ret) : -PAL_ERROR_INVAL;
    }
    if (val) {
    return unix_to_pal_error(-val);
    }
    /* Connect succeeded. */
    }
  • int ret = DO_SYSCALL_INTERRUPTIBLE(connect, ocall_connect_args->fd, ocall_connect_args->addr,
    (int)ocall_connect_args->addrlen);
    if (ret < 0) {
    /* XXX: Non blocking socket. Currently there is no way of notifying LibOS of successful or
    * failed connection, so we have to block and wait. */
    if (ret != -EINPROGRESS) {
    return ret;
    }
    struct pollfd pfd = {
    .fd = ocall_connect_args->fd,
    .events = POLLOUT,
    };
    ret = DO_SYSCALL(poll, &pfd, 1, /*timeout=*/-1);
    if (ret != 1 || pfd.revents == 0) {
    return ret < 0 ? ret : -EINVAL;
    }
    int val = 0;
    unsigned int len = sizeof(val);
    ret = DO_SYSCALL(getsockopt, ocall_connect_args->fd, SOL_SOCKET, SO_ERROR, &val, &len);
    if (ret < 0 || val < 0) {
    return ret < 0 ? ret : -EINVAL;
    }
    if (val) {
    return -val;
    }
    /* Connect succeeded. */
    }

Non-blocking sockets connect is a curious corner case. From man 2 connect:

   EINPROGRESS
          The socket is nonblocking and the connection cannot be
          completed immediately.  (UNIX domain sockets failed with
          EAGAIN instead.)  It is possible to select(2) or poll(2)
          for completion by selecting the socket for writing.  After
          select(2) indicates writability, use getsockopt(2) to read
          the SO_ERROR option at level SOL_SOCKET to determine
          whether connect() completed successfully (SO_ERROR is
          zero) or unsuccessfully (SO_ERROR is one of the usual
          error codes listed here, explaining the reason for the
          failure).

We can boil it down to two things:

  1. In Linux, there is an intermediate socket state: in addition to SOCKET_NEW and SOCKET_CONNECTED, non-blocking sockets can be in the SOCKET_CONNECTING state. In this state, no operation except for select/poll/epoll is possible.
  2. In Gramine, there is no such intermediate socket state. So to hide the absence of SOCKET_CONNECTING, Gramine does the hacks above -- by blocking on waiting until the socket is fully connected.

The problem with Gramine's current approach: a non-blocking socket effectively becomes blocking, in this specific case of the connect() syscall.

The real-world consequence: if a remote peer is unresponsive, the connect operation blocks for a long time before it times out. On my Linux box, this timeout is 60 seconds.

This problem was independently reported on two workloads.

Steps to reproduce

There is no simple way to emulate EINPROGRESS. For this, basically an SYN TCP packet must be delayed or dropped during the TCP/IP handshake.

So I emulated using IP tables modification, which works fine on my Ubuntu 20.04:

  • Always drop the SYN packet on local port 12345 (this emulates unresponsive peer):
    sudo iptables -A OUTPUT -p tcp --tcp-flags SYN,ACK,FIN,RST SYN -d 127.0.0.1 --dport 12345 -j DROP
    
  • Restore normal behavior on local port 12345 (this emulates responsive peer):
    sudo iptables -D OUTPUT -p tcp --tcp-flags SYN,ACK,FIN,RST SYN -d 127.0.0.1 --dport 12345 -j DROP
    
  • Check the current IP Tables rules:
    sudo iptables -S
    

And here's my testing code. I just modified CI-Examples/helloworld C file:

#define _GNU_SOURCE
#include <arpa/inet.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>

#include "common.h"

#define ERR(msg, args...) \
    errx(1, "%d: " msg, __LINE__, ##args)

#define SRV_IP "127.0.0.1"
#define PORT   12345

int main(void) {
    int ret;

    int s = CHECK(socket(AF_INET, SOCK_STREAM, 0));

#if 1 /* switch on/switch off non-blocking mode */
    int flags = CHECK(fcntl(s, F_GETFL, 0));
    CHECK(fcntl(s, F_SETFL, flags | O_NONBLOCK));
#endif

    struct sockaddr_in sa = {
        .sin_family = AF_INET,
        .sin_port = htons(PORT),
    };
    if (inet_aton(SRV_IP, &sa.sin_addr) != 1) {
        ERR("inet_aton failed");
    }
    ret = connect(s, (void*)&sa, sizeof(sa));

    if (!ret) {
        ERR("connection established immediately but expected connection refused");
    } else if (ret < 0 && errno != EINPROGRESS) {
        if (errno != ECONNREFUSED)
            ERR("connect didn't fail with EINPROGRESS but with %s", strerror(errno));
        puts("TEST OK: connection was refused immediately");
    } else {
        struct sockaddr_in sa;
        int linux_addrlen_int = sizeof(sa);
        ret = getsockname(s, &sa, &linux_addrlen_int);
        if (ret < 0)
            ERR("getsockname after EINPROGRESS failed with %s", strerror(errno));
        printf("Local IP address is: %s\n", inet_ntoa(sa.sin_addr));
        printf("Local port is: %d\n", (int)ntohs(sa.sin_port));

        ret = getpeername(s, &sa, &linux_addrlen_int);
        if (ret != -1 || errno != ENOTCONN)
            ERR("getpeername after EINPROGRESS didn't fail with ENOTCONN but with %s", strerror(errno));

        char buf[3] = "Hi";
        ssize_t bytes = send(s, buf, sizeof(buf), /*flags=*/0);
        if (bytes != -1 || errno != EAGAIN)
            ERR("send after EINPROGRESS didn't fail with ENOTCONN but with %s", strerror(errno));

        bytes = recv(s, buf, sizeof(buf), /*flags=*/0);
        if (bytes != -1 || errno != EAGAIN)
            ERR("recv after EINPROGRESS didn't fail with ENOTCONN but with %s", strerror(errno));

        struct pollfd infds[] = {
            {.fd = s, .events = POLLOUT},
        };
        ret = CHECK(poll(infds, 1, /*timout_ms=*/10000));
        if (ret == 0) {
            puts("TEST OK: connection timed out");
        } else {
            if (!(infds[0].revents & POLLOUT)) {
                ERR("polling returned something wrong");
            }
            int so_error;
            socklen_t optlen = sizeof(so_error);
            CHECK(getsockopt(s, SOL_SOCKET, SO_ERROR, &so_error, &optlen));
            if (optlen != sizeof(so_error) || so_error != ECONNREFUSED) {
                ERR("connection established after initial EINPROGRESS but expected connection refused");
            }
            puts("TEST OK: connection refused after initial EINPROGRESS");
        }
    }

    CHECK(close(s));
    return 0;
}

Now we can test the following cases:

  1. Classic blocking socket -- just change to #if 0. The connect() syscall will immediately fail with ECONNREFUSED.
  2. Non-blocking socket, non-existing peer. Don't do anything with IP Tables. The connect() syscall will first fail with EINPROGRESS, and then poll will immediately succeed, and the socket will have ECONNREFUSED.
  3. Non-blocking socket, unresponsive peer. Change IP Tables to drop the SYN packet. The connect() syscall will first fail with EINPROGRESS, and then poll will time out after 10 seconds.
  4. Non-blocking socket, slow but responsive peer. Change IP Tables to drop the SYN packet. Start the app, wait for 3 seconds. Restore IP tables. Now the connect() syscall will first fail with EINPROGRESS, and then poll will wait for these 3 seconds, and then the socket will have ECONNREFUSED.

Gramine commit hash

24a581c

@dimakuv
Copy link
Author

dimakuv commented Nov 6, 2023

I see two ways of fixing it.

Implement proper SOCK_CONNECTING state

Gramine doesn't have an intermediate state that would describe "the socket is currently connecting". Gramine implements a simplified state diagram:

/*
* Sockets can be in 4 states: NEW, BOUND, LISTENING and CONNECTED.
*
* +------------------+
* | |
* | |
* bind() listen() V accept() old socket
* +--> NEW --------------------> BOUND -------------> LISTEN --------------+
* | | | ^ new socket
* | | | | |
* | | | | |
* | | connect() | | disconnect() |
* | | | | (if it was bound) |
* | | connect() | | |
* | | | | |
* | | V | |
* | +---------------------> CONNECTED <--------------------------------+
* | |
* | disconnect() |
* | (if it was not bound) |
* +------------------------------+
*
*/

This diagram worked very well for us until now. This diagram also puts limitations that make introducing SOCK_CONNECTING quite non-trivial.

One problem is that Gramine doesn't have a proper place where to move from SOCK_CONNECTING to SOCK_CONNECTED. The Linux host doesn't inform us on such a state transition. Moreover, there seems to be no "definite" system call at which we could verify the state of the connection (select/poll/epoll? but it doesn't seem to be 100%, more like a recommendation).

Another problem is that there are some socket metadata updates on a successful connect(), which only happen at the connect() point:

gramine/libos/src/net/ip.c

Lines 233 to 237 in 24a581c

if (sock->state != SOCK_BOUND) {
assert(sock->state == SOCK_NEW);
assert(!sock->was_bound);
pal_to_linux_sockaddr(&pal_local_addr, &sock->local_addr, &sock->local_addrlen);
}

If we imagine that we split the connection logic into "CONNECTING actions" and "CONNECTED actions", we need to do actual connect() host syscall in the former set of actions, and we need to update the sock->local_addr metadata in the latter set of actions. But there is no separate API to get local_addr. I especially don't like to introduce it in the SGX case, because it creates a new OCALL and a new attack vector.

Propagate EINPROGRESS but consider the connection established

Basically, Gramine should return to the app that EINPROGRESS happens. But internally, Gramine continues as if the connection is established (i.e. moves to SOCK_CONNECTED state and updates socket metadata).

We don't need to do anything with select/poll/epoll -- the POLLOUT action will be reported by the host when the socket is finally connected (or the connection is broken).

One problem is that I am not sure if a EINPROGRESS-failing connect can still update the local_addr. Well, if not, we can patch the getsockname() syscall to retrieve the local address, but it's the same as the second problem in the previous section.

Another problem is that getsockopt(s, SOL_SOCKET, SO_ERROR, ...) will return 0 (success), because we don't have any way of checking for EINPROGRESS/ECONNREFUSED status of the connection.


I feel like both ways require non-trivial changes in our sockets code:

  • libos/src/sys/libos_socket.c
  • libos/src/net/ip.c
  • pal/src/host/linux/pal_sockets.c
  • pal/src/host/linux-sgx/host_ocalls.c

@kailun-qin
Copy link
Contributor

First, I'm more in favor of opt#1 -- Implement proper SOCK_CONNECTING state.

Moreover, there seems to be no "definite" system call at which we could verify the state of the connection (select/poll/epoll? but it doesn't seem to be 100%, more like a recommendation).

Yeah, I understand that there may not have "definite" syscalls, but in practice I think it's still good enough for most of the reasonable workloads -- if we put the non-blocking sockets into SOCK_CONNECTING state on connect() (when we get EINPROGRESS) and verify/proceed w/ the exact state of the connection on select()/poll()/epoll()? And functionality wise, w/ such approach, I think we could at least support "non-blocking socket + unresponsive peer" w/o losing anything else.

But there is no separate API to get local_addr. I especially don't like to introduce it in the SGX case, because it creates a new OCALL and a new attack vector.

Sorry I don't get this -- why do we need a separate API (and which API) to get local_addr? Can't we update the metadata at the same time when we verify/proceed w/ the state of the connection on select()/poll()/epoll() (if applicable)?

@dimakuv
Copy link
Author

dimakuv commented Nov 7, 2023

But there is no separate API to get local_addr. I especially don't like to introduce it in the SGX case, because it creates a new OCALL and a new attack vector.

Sorry I don't get this -- why do we need a separate API (and which API) to get local_addr? Can't we update the metadata at the same time when we verify/proceed w/ the state of the connection on select()/poll()/epoll() (if applicable)?

"Can't we update the metadata" -- but how will you do it? What PAL API will you call? There is no API to "get local IP address" in PAL. This "getting local IP address" is only achieved currently through the use of the "connect" API (PalSocketConnect()).

That's what I meant -- we don't have a separate PAL API to get this info.

@kailun-qin
Copy link
Contributor

That's what I meant -- we don't have a separate PAL API to get this info.

Thanks for the explanations! Now I understand -- I agree that we'd better reuse/extend some existing PAL APIs (if possible and applicable), like PalSocketConnect() as you proposed, so that no need to introduce a new PAL API.

@dimakuv
Copy link
Author

dimakuv commented Nov 7, 2023

Update: The good news is that Linux binds the local address during connect(), even if connect() returns EINPROGRESS.

This means that we can get the local_addr at the first PalSocketConnect() and memorize it, even if the API itself fails with EINPROGRESS. Since this was the only problematic metadata, this solves my Problem 2 from above.

Note: I updated the example C program above, to reflect the getsockname() success even on EINPROGRESS connections.

@dimakuv
Copy link
Author

dimakuv commented Nov 7, 2023

Update: getpeername() returns ENOTCONN in case of EINPROGRESS connection.

@dimakuv
Copy link
Author

dimakuv commented Nov 7, 2023

Update:

  • send() returns EAGAIN in case of EINPROGRESS connection.
  • recv() returns EAGAIN in case of EINPROGRESS connection.

I updated the test C program to check for these errors.

@dimakuv
Copy link
Author

dimakuv commented Nov 7, 2023

The PR #1643 is ready for review. It should fix this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants