Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition that leads Quart to hang with uvicorn #848

Merged
merged 25 commits into from
Nov 18, 2020

Conversation

euri10
Copy link
Member

@euri10 euri10 commented Nov 12, 2020

EDIT: ready for review, long story short there is a race condition that is fixed (hopefully), details below on the reasons it's happening

fixes #847
fixes #748

Just posting this as a draft for potential ideas / discussion to solve / understand #847, I've got headaches trying to think about it clearly !

I've added some trace logs on a branch,

then ran with:

docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 1 -d 1s -t 1 -s payload.lua http://127.0.0.1:8000

so that wrk sends 1 concurrent requests for 1s.

This is enough to get only 1 req/s on Quart and 140ish on Starlette.

the diff log is represented below, at disposal on the branch too in quart.log and starlette.log:

20201112_0849_1792x868_1605167357

what we see is that it's the same until line 19:
TRACE: 127.0.0.1:59398 - ASGI [2] Receive {'type': 'http.request', 'body': '<199 bytes>', 'more_body': False}

at that point there is a difference, uvicorn RequestResponseCycle receive coroutine is entered again on quart while starlette is getting the send messages.

now the send messages are slightly different too:

quart

TRACE:    127.0.0.1:59138 - ASGI [2] Send {'type': 'http.response.start', 'status': 200, 'headers': [(b'content-type', b'text/html; charset=utf-8'), (b'content-length', b'199')]}
TRACE:    127.0.0.1:59138 - ASGI [2] Send {'type': 'http.response.body', 'body': '<199 bytes>', 'more_body': True}
TRACE:    127.0.0.1:59138 - ASGI [2] Send {'type': 'http.response.body', 'body': '<0 bytes>', 'more_body': False}

starlette

TRACE:    127.0.0.1:59398 - ASGI [2] Send {'type': 'http.response.start', 'status': 200, 'headers': [(b'content-length', b'199')]}
TRACE:    127.0.0.1:59398 - ASGI [2] Send {'type': 'http.response.body', 'body': '<199 bytes>'}

see there is that more_body True then False on quart while in starlette there is not.

I suspect there is a case where in uvicorn we don't put the response_complete flag where we should, but really here fail at seeing where...

@euri10
Copy link
Member Author

euri10 commented Nov 12, 2020

with the change, running the various apps contained in 846_quart_race.py at least it does not hang anymore after the 1st "batch" of wrk (understand the number of reqs specified with the -c flag)

pure asgi app

❯ docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 512 -d 1m -t 16 -s payload.lua http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  16 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.19s   105.60ms   1.87s    89.42%
    Req/Sec    29.13     25.37   310.00     95.89%
  25598 requests in 1.00m, 7.15MB read
Requests/sec:    425.95
Transfer/sec:    121.88KB

quart app

❯ docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 512 -d 1m -t 16 -s payload.lua http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  16 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.67s   361.46ms   1.99s    80.95%
    Req/Sec    19.79     22.00   230.00     93.63%
  14196 requests in 1.00m, 4.51MB read
  Socket errors: connect 0, read 0, write 0, timeout 13461
Requests/sec:    236.21
Transfer/sec:     76.81KB

starlette app

❯ docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 512 -d 1m -t 16 -s payload.lua http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  16 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   540.39ms   38.98ms   1.18s    95.28%
    Req/Sec    63.85     34.12   313.00     92.74%
  56598 requests in 1.00m, 15.81MB read
Requests/sec:    942.03
Transfer/sec:    269.54KB

@euri10
Copy link
Member Author

euri10 commented Nov 12, 2020

would appreciate some feedback on it before jumping on making the change on h11 too, but this looks like setting up a new asyncio Event per RequestResponseCycle instead of having it at the HttpToolsProtocol level does the job.

@spacedragon

This comment has been minimized.

@euri10
Copy link
Member Author

euri10 commented Nov 12, 2020

I find there is a deadlock on this line https://github.com/encode/uvicorn/blob/master/uvicorn/protocols/http/httptools_impl.py#L549 until it timeout or disconnects

yep you're absolutely correct, I think having a message_event per cycle solves it

Copy link
Member Author

@euri10 euri10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

846_quart_race.py and payload.lua will of course be removed, this is to get something reproducible if someone felt digging !

@euri10
Copy link
Member Author

euri10 commented Nov 12, 2020

should also fix #748 if you have some time @itayperl

@itayperl
Copy link

@euri10 this is similar to how I solved #748 in my fork. I suggest also adding a self.message_event.set() after line 532 (self.response_complete = True) to interrupt any ongoing receives.

@euri10
Copy link
Member Author

euri10 commented Nov 12, 2020

I just looked at how you did, it's slightly different: I erased entirely the self.message at the protocol level while you still have it and mutate it while constructing the RequestResponseCycle

I have no preference and not sure what is the best way,

this said, to build some confidence around the fix, some tests would help, the snippet in #748 looks like a good basis, but I have been unable to get a hanging behaviour in python alone, seems like doing curl localhost/foo localhost/bar in python is not that simple.

@itayperl
Copy link

this worked for me:

with requests.Session() as s:
    s.get('http://localhost:8000/foo')
    s.get('http://localhost:8000/bar')

@delong-coder
Copy link

Hey @euri10 I agree with @itayperl to add a self.message_event.set() after (self.response_complete = True). That line of code increased QPS from 9000 to 9400 with my simple quart app on a 32 CPU cores machine.
From my understanding, the code will recycle those resources of asyncio.event().wait() in the ongoing receives. Do you think it is correct?

@euri10 euri10 changed the title Quart app hanging with uvicorn Fix race condition that leads Quart to hang with uvicorn Nov 13, 2020
@euri10 euri10 marked this pull request as ready for review November 13, 2020 07:49
@euri10
Copy link
Member Author

euri10 commented Nov 13, 2020

Hey @euri10 I agree with @itayperl to add a self.message_event.set() after (self.response_complete = True). That line of code increased QPS from 9000 to 9400 with my simple quart app on a 32 CPU cores machine.
From my understanding, the code will recycle those resources of asyncio.event().wait() in the ongoing receives. Do you think it is correct?

yes it seems to be the case, it's always hard to figure out what the "order" should be in those interleaved spaghetti of coroutines...

@euri10 euri10 requested a review from a team November 17, 2020 08:13
@delong-coder
Copy link

@tomchristie @florimondmanca Could you please help review this PR?
Thanks!

Copy link
Member

@florimondmanca florimondmanca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only found a small knack. Otherwise given the feedback that people have been running a similar fix in production for a while, LGTM. :-) Nice one!

@@ -146,7 +145,7 @@ def connection_lost(self, exc):
# Premature client disconnect
pass

self.message_event.set()
self.cycle.message_event.set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need…?

Suggested change
self.cycle.message_event.set()
if self.cycle is not None:
self.cycle.message_event.set()

(I see we check this higher up in this method, and we do in the HttpTools implementation.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember effectively I added the if self.cycle is not None: in the http_tools implementation because one test failed without it,
the fact that tests pass without it in the h11 implementation is probably due to the fact (dont quote me on that) that we may only have that failing test on a default config which by default will pick the httptools implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your suggestion on both httptools and h11 @florimondmanca

@euri10 euri10 merged commit de21361 into encode:master Nov 18, 2020
@euri10 euri10 deleted the 847_quart_race branch November 18, 2020 19:56
@euri10
Copy link
Member Author

euri10 commented Nov 18, 2020

thanks for the review @florimondmanca !
kudos to @itayperl for the highly accurate description of the bug and fix 👍
thanks to @delong-coder too for the wrk tests, I wish I could run wrk on a 32 core machine !

@DavidLangworthy
Copy link

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants