Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start_connection raises an IndexError when socket creation fails with uvloop #93

Closed
Moosapoor opened this issue Sep 19, 2024 · 39 comments
Closed
Labels
bug Something isn't working problem in external library Problem is in an external library

Comments

@Moosapoor
Copy link

Describe the bug
If creating the socket fails with an Exception other than an OSError, the start_connection method will raise an IndexError when it's trying to get the first exception.

To Reproduce

import aiohttp
import pytest
from unittest import mock


@pytest.mark.asyncio
async def test_session_will_raise_index_error():
    with mock.patch("socket.socket") as mock_socket:
        mock_socket.side_effect = Exception()
        session = aiohttp.ClientSession()
        await session.get("http://sample.com")
    await session.close()

Logs

.virtualenvs/api-R653k7wR-py3.11/lib/python3.11/site-packages/aiohttp/client.py:657: in _request
    conn = await self._connector.connect(
.virtualenvs/api-R653k7wR-py3.11/lib/python3.11/site-packages/aiohttp/connector.py:564: in connect
    proto = await self._create_connection(req, traces, timeout)
.virtualenvs/api-R653k7wR-py3.11/lib/python3.11/site-packages/aiohttp/connector.py:975: in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
.virtualenvs/api-R653k7wR-py3.11/lib/python3.11/site-packages/aiohttp/connector.py:1319: in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
.virtualenvs/api-R653k7wR-py3.11/lib/python3.11/site-packages/aiohttp/connector.py:1073: in _wrap_create_connection
    sock = await aiohappyeyeballs.start_connection(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

addr_infos = [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('5.22.145.16', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('5.22.145.121', 80))]

    async def start_connection(
        addr_infos: Sequence[AddrInfoType],
        *,
        local_addr_infos: Optional[Sequence[AddrInfoType]] = None,
        happy_eyeballs_delay: Optional[float] = None,
        interleave: Optional[int] = None,
        loop: Optional[asyncio.AbstractEventLoop] = None,
    ) -> socket.socket:
        """
        Connect to a TCP server.
    
        Create a socket connection to a specified destination.  The
        destination is specified as a list of AddrInfoType tuples as
        returned from getaddrinfo().
    
        The arguments are, in order:
    
        * ``family``: the address family, e.g. ``socket.AF_INET`` or
            ``socket.AF_INET6``.
        * ``type``: the socket type, e.g. ``socket.SOCK_STREAM`` or
            ``socket.SOCK_DGRAM``.
        * ``proto``: the protocol, e.g. ``socket.IPPROTO_TCP`` or
            ``socket.IPPROTO_UDP``.
        * ``canonname``: the canonical name of the address, e.g.
            ``"www.python.org"``.
        * ``sockaddr``: the socket address
    
        This method is a coroutine which will try to establish the connection
        in the background. When successful, the coroutine returns a
        socket.
    
        The expected use case is to use this method in conjunction with
        loop.create_connection() to establish a connection to a server::
    
                socket = await start_connection(addr_infos)
                transport, protocol = await loop.create_connection(
                    MyProtocol, sock=socket, ...)
        """
        if not (current_loop := loop):
            current_loop = asyncio.get_running_loop()
    
        single_addr_info = len(addr_infos) == 1
    
        if happy_eyeballs_delay is not None and interleave is None:
            # If using happy eyeballs, default to interleave addresses by family
            interleave = 1
    
        if interleave and not single_addr_info:
            addr_infos = _interleave_addrinfos(addr_infos, interleave)
    
        sock: Optional[socket.socket] = None
        exceptions: List[List[OSError]] = []
        if happy_eyeballs_delay is None or single_addr_info:
            # not using happy eyeballs
            for addrinfo in addr_infos:
                try:
                    sock = await _connect_sock(
                        current_loop, exceptions, addrinfo, local_addr_infos
                    )
                    break
                except OSError:
                    continue
        else:  # using happy eyeballs
            sock, _, _ = await staggered.staggered_race(
                (
                    functools.partial(
                        _connect_sock, current_loop, exceptions, addrinfo, local_addr_infos
                    )
                    for addrinfo in addr_infos
                ),
                happy_eyeballs_delay,
                loop=current_loop,
            )
    
        if sock is None:
            all_exceptions = [exc for sub in exceptions for exc in sub]
            try:
>               first_exception = all_exceptions[0]
E               IndexError: list index out of range

.virtualenvs/api-R653k7wR-py3.11/lib/python3.11/site-packages/aiohappyeyeballs/impl.py:102: IndexError

Python version

$ python --version  
Python 3.11.3

aiohttp version

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.10.5
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: 
Author-email: 
License: Apache 2

OS
macOs

Additional context
No additional context.

@Moosapoor Moosapoor added the bug Something isn't working label Sep 19, 2024
@bdraco
Copy link
Member

bdraco commented Sep 19, 2024

This code comes from stdlib. Is it possible that creating a socket can raise something other than OSError?

@Moosapoor
Copy link
Author

Is it possible that creating a socket can raise something other than OSError?

It happened in one of my projects and if I'm not wrong it was a RuntimeError.

@bdraco
Copy link
Member

bdraco commented Sep 19, 2024

Can you post steps to recreate the issue without manually patching the exception, and post the exception trace you received in production?

@Moosapoor
Copy link
Author

I don't know how to reproduce the issue in the real world tbh. It happened couple of times and I couldn't find any specific situation which could result the issue.
This is the stack trace:

WouldBlock: null
  File "anyio/streams/memory.py", line 98, in receive
    return self.receive_nowait()
  File "anyio/streams/memory.py", line 93, in receive_nowait
    raise WouldBlock
EndOfStream: null
  File "starlette/middleware/base.py", line 159, in call_next
    message = await recv_stream.receive()
  File "anyio/streams/memory.py", line 118, in receive
    raise EndOfStream
IndexError: list index out of range
  ...
  File "starlette/middleware/base.py", line 165, in call_next
    raise app_exc
  File "starlette/middleware/base.py", line 151, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "starlette/routing.py", line 72, in app
    response = await func(request)
  File "fastapi/routing.py", line 297, in app
    raw_response = await run_endpoint_function(
  File "fastapi/routing.py", line 210, in run_endpoint_function
    return await dependant.call(**values)
  ...
  File "contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
  ...
  File "contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
  ...
  File "tenacity/_asyncio.py", line 71, in __anext__
    do = self.iter(retry_state=self._retry_state)
  File "__init__.py", line 314, in iter
    return fut.result()
  File "concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
  File "concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  ...
  File "contextlib.py", line 650, in enter_async_context
    result = await _enter(cm)
  File "contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
  ...
  File "aiohttp/client.py", line 1353, in __aenter__
    self._resp = await self._coro
  File "aiohttp/client.py", line 657, in _request
    conn = await self._connector.connect(
  File "aiohttp/connector.py", line 564, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "aiohttp/connector.py", line 975, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
  File "aiohttp/connector.py", line 1319, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
  File "aiohttp/connector.py", line 1073, in _wrap_create_connection
    sock = await aiohappyeyeballs.start_connection(
  File "aiohappyeyeballs/impl.py", line 102, in start_connection
    first_exception = all_exceptions[0]

@Dreamsorcerer
Copy link
Member

That traceback seems weirdly formatted, but I'm assuming the original exception comes from anyio, so maybe this is really a bug in anyio which should be raising some kind of OSError...?

@gryevns
Copy link

gryevns commented Sep 24, 2024

To give some more context, the exceptions for _ here are:

[
    RuntimeError('File descriptor 27 is used by transport <TCPTransport closed=False reading=True 0x2a4ce36edd00>'),
    RuntimeError('File descriptor 27 is used by transport <TCPTransport closed=False reading=True 0x2a4ce36edd00>')
]

@tkukushkin
Copy link

Experienced same problems too. We use aiohttp and after 3.10 sometimes ClientSession stops working at all, all requests end with timeout error. Before this, we see such errors in our logs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/cian_http/client.py", line 112, in request
    async with self._get_session().request(
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 1355, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 659, in _request
    conn = await self._connector.connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 557, in connect
    proto = await self._create_connection(req, traces, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1002, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1305, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1061, in _wrap_create_connection
    sock = await aiohappyeyeballs.start_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohappyeyeballs/impl.py", line 81, in start_connection
    sock = await _connect_sock(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohappyeyeballs/impl.py", line 166, in _connect_sock
    await loop.sock_connect(sock, address)
  File "uvloop/loop.pyx", line 2613, in sock_connect
  File "uvloop/loop.pyx", line 1087, in uvloop.loop.Loop._sock_connect
  File "uvloop/loop.pyx", line 835, in uvloop.loop.Loop._add_writer
  File "uvloop/loop.pyx", line 762, in uvloop.loop.Loop._ensure_fd_no_transport
RuntimeError: File descriptor 59 is used by transport <TCPTransport closed=False reading=True 0x55a6ec5c9460>

and

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/cian_http/client.py", line 112, in request
    async with self._get_session().request(
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 1353, in __aenter__
    self._resp = await self._coro
                 ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/client.py", line 657, in _request
    conn = await self._connector.connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 564, in connect
    proto = await self._create_connection(req, traces, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 975, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1319, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/connector.py", line 1073, in _wrap_create_connection
    sock = await aiohappyeyeballs.start_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohappyeyeballs/impl.py", line 102, in start_connection
    first_exception = all_exceptions[0]
                      ~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Our configuration:

aiohttp.ClientSession(
    cookie_jar=aiohttp.DummyCookieJar(),
    trace_configs[...],
    connector=aiohttp.TCPConnector(limit=200),
)

We don't know how to reproduce it, it happens rarely but impact is too high, so we had to pin aiohttp 3.9.5 without aiohappyeyeballs.

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

```python
RuntimeError: File descriptor 59 is used by transport <TCPTransport closed=False reading=True 0x55a6ec5c9460>

Its possible there is a race in the asyncio staggered code where there are two winners like #100

@tkukushkin
Copy link

We haven't seen any AssertionError in our logs and we don't use eager task factory if it's important.

Is there any workaround? As far as I understand from my traceback, staggered_race wasn't used because of single_addr_info.

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

FWIW, this is the exact same code from asyncio.base_events

https://github.com/python/cpython/blob/35541c410d894d4fa18002f7fdbebfe522f8320e/Lib/asyncio/base_events.py#L1150

I haven't found any cpython issues similar to this one

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

Is everyone with the issue using uvloop?

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

It looks like uvloop can raise RuntimeError instead of OSError when loop.sock_connect fails.

@tkukushkin
Copy link

But can asyncio raise RuntimeError in such situations too?

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

If it can it means the code in asyncio.base_events doesn't handle it

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

Also please answer my question about using uvloop. I can't fix the issue if I don't know where to look for the source.

@gryevns
Copy link

gryevns commented Sep 30, 2024

For the original issue, yes we are using uvloop via uvicorn.

@tkukushkin
Copy link

Yes, we use uvloop as you can see from my traceback.

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

selector_events can raise OSError, RuntimeError, and ValueError.. so cpython doesn't handle this right either

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

ValueError should never be possible for our case

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

RuntimeError shouldn't happen unless there is a bug in the loop implementation which is allowing the socket to get reused

@tkukushkin
Copy link

But it's still unclear why there is no such problem when aiohappyeyeballs isn't used. We've seen such tracebacks only with aiohttp 3.10.

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

But it's still unclear why there is no such problem when aiohappyeyeballs isn't used. We've seen such tracebacks only with aiohttp 3.10.

I expect loop.sock_connect isn't cancellation safe in uvloop

@bdraco bdraco changed the title start_connection raises an IndexError when socket creation fails start_connection raises an IndexError when socket creation fails with uvloop Sep 30, 2024
@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

We could start catching the RuntimeError: File descriptor 59 is used by transport <TCPTransport closed=False reading=True 0x55a6ec5c9460> which will give a better error message when it fails.

That won't solve the problem with uvloop reusing file descriptors though.

@tkukushkin
Copy link

So it means it's not so difficult to make MRE?

@bdraco bdraco added the problem in external library Problem is in an external library label Sep 30, 2024
@tkukushkin
Copy link

tkukushkin commented Sep 30, 2024

Tell me if I'm wrong, but it seems like sock_connect can't be cancelled at all.

import asyncio
import socket
import time

import anyio
import uvloop


async def main():
    while True:
        sock = socket.socket()
        with anyio.fail_after(0.05):
            s = time.time()
            await asyncio.get_running_loop().sock_connect(sock, ('54.159.225.35', 443))  # httpbin
            print(time.time() - s)


uvloop.run(main())

I see not timeouts and get only this output:

0.15363287925720215
0.15378308296203613
0.16184616088867188
...

UPD: I forgot about making socket non-blocking

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

I tried this and no luck so maybe a race somewhere else that interacts poorly with `sock_connect

import asyncio
import socket

import uvloop
import random


async def main():
    loop = asyncio.get_running_loop()
    while True:
        tasks = []
        for _ in range(10):
            sock = socket.socket()
            tasks.append(
                asyncio.Task(
                    loop.sock_connect(sock, ("54.159.225.35", 443)),
                    eager_start=True,
                    loop=loop,
                )
            )
        sleep = random.randint(0, 1)
        if sleep:
            await asyncio.sleep(sleep)

        for task in tasks:
            task.cancel()
            try:
                await task
            except asyncio.CancelledError:
                pass


uvloop.run(main())

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

I am not a uvloop expert by any means. We need someone who has experience with uvloop to figure this one out.

@tkukushkin
Copy link

I will also clarify, we use aiohttp on server side with handler_cancellation=True and it's not rare when our web handlers are cancelled and consequently requests with aiohttp to other services are also not rarely cancelled, we don't shield anything. Also 99.9% of our http requests are made to some services in kubernetes, which have only one IPv4 without IPv6. So I still don't understand how it works without aiohappyeyeballs with uvloop.

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

cancellation is just a guess where the problem is as thats so hard to get right it usually ends up being where the problem is.
It could be some other race in uvloop that is causing the leak.

Its also possible that the new staggered_race implementation in 2.4.3 doesn't hit the problem by pure luck.

@bdraco
Copy link
Member

bdraco commented Sep 30, 2024

I've tried a few more things but haven't had any luck getting uvloop to throw RuntimeError. Hopefully someone will come up with a reproducer so we can open an issue with uvloop

@bdraco
Copy link
Member

bdraco commented Oct 2, 2024

Can you still reproduce with 2.4.3?

@Moosapoor
Copy link
Author

I haven't bumped the version to 2.4.3 yet. Let me do that and I'll let you know if it happened again. Thanks for resolving the issue.

@Moosapoor
Copy link
Author

@bdraco I can confirm that the IndexError bug is fixed in 2.4.3 and now we're receiving the actual underlying exception (i.e. RuntimeError related to the "File descriptor"). Thanks.

@todddialpad
Copy link
Contributor

I've experienced the RuntimeError too and had some success in reproducing it statistically. The setup for me is an app that runs both an async server and aiohttp clients.
After instrumenting aiohttp, aiohappyeyeballs, uvloop and libuv with debug code, the sequence of events I am seeing are:

  • A client connection starts, causing the manual socket creation and connection in aiohappyeyeballs
  • A timeout happens on the client connection
  • An incoming server request is accepted, assigned the same file descriptor as the just abandoned client connection
  • The server request appears to proceed, at least to the point that the socket is read
  • At some point here a read of /proc/self/fd/ no longer shows the file descriptor as open, but the event loop doesn't generate any events for the transport
  • A new client connection starts and the call in aiohappyeyeballs to create a new socket results in a socket with the same file descriptor number as the preceding client connection and server request
  • Once the new client connection gets to the point of registering the new transport with uvloop, the RuntimeError is raised because uvloop still believes the server request transport is using it (the transport is neither _closing nor _closed).

@todddialpad
Copy link
Contributor

todddialpad commented Nov 20, 2024

I suspect that without aiohappyeyeballs, the socket creation is done inside libuv (aiohttp calls loop.create_connection() which creates a transport which then calls libuv.uv_tcp_connect which will eventually result in a callback with a connected socket), and in this process cleans up or avoids the file descriptor number reuse problem.

@todddialpad
Copy link
Contributor

I made a small change to uvloop and ran my tests. The failures disappeared. The patch I made was

diff --git a/uvloop/loop.pyx b/uvloop/loop.pyx
index f9a5a23..9797684 100644
--- a/uvloop/loop.pyx
+++ b/uvloop/loop.pyx
@@ -1877,6 +1877,7 @@ cdef class Loop:
             AddrInfo ai_local = None
             AddrInfo ai_remote
             TCPTransport tr
+            int sockfd
 
             system.addrinfo *rai = NULL
             system.addrinfo *lai = NULL
@@ -2060,8 +2061,9 @@ cdef class Loop:
             waiter = self._new_future()
             tr = TCPTransport.new(self, protocol, None, waiter, context)
             try:
+                sockfd = sock.detach()
                 # libuv will make socket non-blocking
-                tr._open(sock.fileno())
+                tr._open(sockfd)
                 tr._init_protocol()
                 await waiter
             except (KeyboardInterrupt, SystemExit):
@@ -2075,8 +2077,6 @@ cdef class Loop:
                 tr._close()
                 raise
 
-            tr._attach_fileobj(sock)
-
         if ssl:
             app_transport = protocol._get_app_transport(context)
             try:

So basically, the call to loop.create_connection takes ownership of the file descriptor and invalidates the socket object.
I'm not sure this is the correct fix, but at least it does suggest the underlying root cause is contention over the ownership of the file descriptor.

@bdraco
Copy link
Member

bdraco commented Nov 30, 2024

fixed via #114

@bdraco bdraco closed this as completed Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working problem in external library Problem is in an external library
Projects
None yet
Development

No branches or pull requests

6 participants