Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition After Async Cancellations Breaks Connection Pool #961

Open
1 task done
TheTechromancer opened this issue Oct 7, 2024 · 0 comments
Open
1 task done

Comments

@TheTechromancer
Copy link

This bug occurs after certain async cancellations, and causes the HTTP pool to get stuck in a degraded state, unable to issue requests. It looks like a resurfacing of this issue.

httpcore version: 1.0.5
python version: 3.12.4

Code to reproduce:

import random
import asyncio
import httpcore
from collections import defaultdict

# Global dictionary to store status code counts
status_counts = defaultdict(int)


async def make_request(pool):
    try:
        response = await pool.request(
            method="GET",
            url=("https://localhost:8000/"),
        )
        status_counts[response.status] += 1
    except BaseException:
        status_counts[0] += 1


async def spawn_and_cancel_tasks(pool):
    while True:
        tasks = [asyncio.create_task(make_request(pool)) for _ in range(100)]
        await asyncio.sleep(random.random())  # Allow some time for the tasks to start
        for task in tasks:
            task.cancel()
        await asyncio.gather(*tasks, return_exceptions=True)


async def monitor_stats(pool):
    while True:
        status_str = ", ".join(f"{code}: {count}" for code, count in sorted(status_counts.items()))
        print(f"Status codes: {status_str} | Pool {pool}")
        status_counts.clear()
        await asyncio.sleep(1)


async def main():

    async with httpcore.AsyncConnectionPool() as pool:
        stats_task = asyncio.create_task(monitor_stats(pool))
        spawn_task = asyncio.create_task(spawn_and_cancel_tasks(pool))

        try:
            await asyncio.gather(stats_task, spawn_task)
        except KeyboardInterrupt:
            print("Stopping...")

if __name__ == "__main__":
    asyncio.run(main())

Output:

Status codes:  | Pool <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]>
Status codes: 200: 200 | Pool <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 10 idle]>
Status codes: 0: 202, 200: 139 | Pool <AsyncConnectionPool [Requests: 62 active, 0 queued | Connections: 8 active, 2 idle]>
Status codes: 0: 142, 200: 117 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 5 active, 0 idle]>
Status codes: 0: 35, 200: 165 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 5 active, 5 idle]>
Status codes: 200: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 5 active, 5 idle]>
Status codes: 0: 155, 200: 45 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 300 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 10 active, 0 idle]>
Status codes:  | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 10 active, 0 idle]> <- stops responding here
Status codes: 0: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 300 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>

You can see from the above output the point where the pool stops issuing new requests. This is reproducible using the default python http server:

echo 'a' > index.html
python -m http.server

The bug applies to both asyncio and trio. However in trio it's much narrower. With asyncio it triggers almost immediately, while trio takes about 30 seconds.

Code to reproduce in trio:

import random
import trio
import httpcore
from collections import defaultdict

# Global dictionary to store status code counts
status_counts = defaultdict(int)


async def make_request(pool):
    try:
        response = await pool.request(
            method="GET",
            url="http://localhost:8000/",
        )
        status_counts[response.status] += 1
    except BaseException:
        status_counts[0] += 1


async def spawn_and_cancel_tasks(pool):
    while True:
        with trio.CancelScope() as cancel_scope:
            async with trio.open_nursery() as nursery:
                for _ in range(100):
                    nursery.start_soon(make_request, pool)
                await trio.sleep(random.random())  # Allow some time for the tasks to start
                cancel_scope.cancel()


async def monitor_stats(pool):
    while True:
        status_str = ", ".join(f"{code}: {count}" for code, count in sorted(status_counts.items(), reverse=True))
        print(f"Status codes: {status_str} | Pool {pool}")
        # Clear the counts for the next iteration
        status_counts.clear()
        await trio.sleep(1)


async def main():

    async with httpcore.AsyncConnectionPool() as pool:
        async with trio.open_nursery() as nursery:
            nursery.start_soon(monitor_stats, pool)
            await spawn_and_cancel_tasks(pool)


if __name__ == "__main__":
    try:
        trio.run(main)
    except KeyboardInterrupt:
        print("Stopping...")

Sorry to bring this up again; I know these types of issues suck to troubleshoot. But it's been causing issues for us in production. Please let me know if there's anything else I can do to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant