Fix race condition that leads Quart to hang with uvicorn #848

euri10 · 2020-11-12T08:11:19Z

EDIT: ready for review, long story short there is a race condition that is fixed (hopefully), details below on the reasons it's happening

fixes #847
fixes #748

Just posting this as a draft for potential ideas / discussion to solve / understand #847, I've got headaches trying to think about it clearly !

I've added some trace logs on a branch,

then ran with:

docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 1 -d 1s -t 1 -s payload.lua http://127.0.0.1:8000

so that wrk sends 1 concurrent requests for 1s.

This is enough to get only 1 req/s on Quart and 140ish on Starlette.

the diff log is represented below, at disposal on the branch too in quart.log and starlette.log:

what we see is that it's the same until line 19:
TRACE: 127.0.0.1:59398 - ASGI [2] Receive {'type': 'http.request', 'body': '<199 bytes>', 'more_body': False}

at that point there is a difference, uvicorn RequestResponseCycle receive coroutine is entered again on quart while starlette is getting the send messages.

now the send messages are slightly different too:

quart

TRACE:    127.0.0.1:59138 - ASGI [2] Send {'type': 'http.response.start', 'status': 200, 'headers': [(b'content-type', b'text/html; charset=utf-8'), (b'content-length', b'199')]}
TRACE:    127.0.0.1:59138 - ASGI [2] Send {'type': 'http.response.body', 'body': '<199 bytes>', 'more_body': True}
TRACE:    127.0.0.1:59138 - ASGI [2] Send {'type': 'http.response.body', 'body': '<0 bytes>', 'more_body': False}

starlette

TRACE:    127.0.0.1:59398 - ASGI [2] Send {'type': 'http.response.start', 'status': 200, 'headers': [(b'content-length', b'199')]}
TRACE:    127.0.0.1:59398 - ASGI [2] Send {'type': 'http.response.body', 'body': '<199 bytes>'}

see there is that more_body True then False on quart while in starlette there is not.

I suspect there is a case where in uvicorn we don't put the response_complete flag where we should, but really here fail at seeing where...

)" This reverts commit d5dcf80

…ncode#832)"" This reverts commit 64049e5

euri10 · 2020-11-12T11:08:59Z

with the change, running the various apps contained in 846_quart_race.py at least it does not hang anymore after the 1st "batch" of wrk (understand the number of reqs specified with the -c flag)

pure asgi app

❯ docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 512 -d 1m -t 16 -s payload.lua http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  16 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.19s   105.60ms   1.87s    89.42%
    Req/Sec    29.13     25.37   310.00     95.89%
  25598 requests in 1.00m, 7.15MB read
Requests/sec:    425.95
Transfer/sec:    121.88KB

quart app

❯ docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 512 -d 1m -t 16 -s payload.lua http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  16 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.67s   361.46ms   1.99s    80.95%
    Req/Sec    19.79     22.00   230.00     93.63%
  14196 requests in 1.00m, 4.51MB read
  Socket errors: connect 0, read 0, write 0, timeout 13461
Requests/sec:    236.21
Transfer/sec:     76.81KB

starlette app

❯ docker run -v `pwd`:/data  --net=host --rm williamyeh/wrk -c 512 -d 1m -t 16 -s payload.lua http://127.0.0.1:8000
Running 1m test @ http://127.0.0.1:8000
  16 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   540.39ms   38.98ms   1.18s    95.28%
    Req/Sec    63.85     34.12   313.00     92.74%
  56598 requests in 1.00m, 15.81MB read
Requests/sec:    942.03
Transfer/sec:    269.54KB

euri10 · 2020-11-12T11:15:09Z

would appreciate some feedback on it before jumping on making the change on h11 too, but this looks like setting up a new asyncio Event per RequestResponseCycle instead of having it at the HttpToolsProtocol level does the job.

euri10 · 2020-11-12T11:25:02Z

I find there is a deadlock on this line https://github.com/encode/uvicorn/blob/master/uvicorn/protocols/http/httptools_impl.py#L549 until it timeout or disconnects

yep you're absolutely correct, I think having a message_event per cycle solves it

euri10

846_quart_race.py and payload.lua will of course be removed, this is to get something reproducible if someone felt digging !

euri10 · 2020-11-12T12:21:49Z

should also fix #748 if you have some time @itayperl

itayperl · 2020-11-12T12:49:39Z

@euri10 this is similar to how I solved #748 in my fork. I suggest also adding a self.message_event.set() after line 532 (self.response_complete = True) to interrupt any ongoing receives.

euri10 · 2020-11-12T14:52:27Z

I just looked at how you did, it's slightly different: I erased entirely the self.message at the protocol level while you still have it and mutate it while constructing the RequestResponseCycle

I have no preference and not sure what is the best way,

this said, to build some confidence around the fix, some tests would help, the snippet in #748 looks like a good basis, but I have been unable to get a hanging behaviour in python alone, seems like doing curl localhost/foo localhost/bar in python is not that simple.

itayperl · 2020-11-12T16:33:18Z

this worked for me:

with requests.Session() as s:
    s.get('http://localhost:8000/foo')
    s.get('http://localhost:8000/bar')

…ager

delong-coder · 2020-11-13T07:12:02Z

Hey @euri10 I agree with @itayperl to add a self.message_event.set() after (self.response_complete = True). That line of code increased QPS from 9000 to 9400 with my simple quart app on a 32 CPU cores machine.
From my understanding, the code will recycle those resources of asyncio.event().wait() in the ongoing receives. Do you think it is correct?

euri10 · 2020-11-13T07:51:25Z

Hey @euri10 I agree with @itayperl to add a self.message_event.set() after (self.response_complete = True). That line of code increased QPS from 9000 to 9400 with my simple quart app on a 32 CPU cores machine.
From my understanding, the code will recycle those resources of asyncio.event().wait() in the ongoing receives. Do you think it is correct?

yes it seems to be the case, it's always hard to figure out what the "order" should be in those interleaved spaghetti of coroutines...

delong-coder · 2020-11-18T07:46:46Z

@tomchristie @florimondmanca Could you please help review this PR?
Thanks!

florimondmanca

Only found a small knack. Otherwise given the feedback that people have been running a similar fix in production for a while, LGTM. :-) Nice one!

florimondmanca · 2020-11-18T19:07:16Z

uvicorn/protocols/http/h11_impl.py

@@ -146,7 +145,7 @@ def connection_lost(self, exc):
                # Premature client disconnect
                pass

-        self.message_event.set()
+        self.cycle.message_event.set()


Do we need…?

Suggested change

self.cycle.message_event.set()

if self.cycle is not None:

self.cycle.message_event.set()

(I see we check this higher up in this method, and we do in the HttpTools implementation.)

I remember effectively I added the if self.cycle is not None: in the http_tools implementation because one test failed without it,
the fact that tests pass without it in the h11 implementation is probably due to the fact (dont quote me on that) that we may only have that failing test on a default config which by default will pick the httptools implementation.

I added your suggestion on both httptools and h11 @florimondmanca

euri10 · 2020-11-18T19:59:55Z

thanks for the review @florimondmanca !
kudos to @itayperl for the highly accurate description of the bug and fix 👍
thanks to @delong-coder too for the wrk tests, I wish I could run wrk on a 32 core machine !

DavidLangworthy · 2020-11-19T01:19:32Z

Thanks all!

euri10 added 7 commits November 11, 2020 17:46

Revert "Cancel old keepalive-trigger before setting new one. (encode#832

64049e5

)" This reverts commit d5dcf80

Revert "Revert "Cancel old keepalive-trigger before setting new one. (e…

503e7a1

…ncode#832)"" This reverts commit 64049e5

App that reproduce issue with quart

4dc7639

Trim logs

ecd7502

Added diff logs

c54b83a

Added lua payload

ef691d1

Less diff

2eb10f4

euri10 mentioned this pull request Nov 12, 2020

Requests hang when running quart with uvicorn #847

Closed

2 tasks

euri10 added 6 commits November 12, 2020 10:47

Add pure asgi app, cant reproduce

b541de8

Set event per cycle

cb92e85

Removed log trace

86a6d77

Message event is now in cycle

bc613d2

Removed logs

ca9e4be

Cannot set message if cycle is None

abe1df6

This comment has been minimized.

Sign in to view

Deleted logs

7066ad3

euri10 commented Nov 12, 2020

View reviewed changes

Test concurrent hang

5912510

euri10 added 4 commits November 12, 2020 17:36

Merge branch 'test_concur' into 847_quart_race

31cf175

Hanging test

f19151b

Merge branch 'test_concur' into 847_quart_race

e28443d

Right test and modified client a little to be able to use context man…

26a1c1c

…ager

euri10 added 5 commits November 13, 2020 08:35

emoved test client changes and test

4b38b5f

Add event set in send coroutine

4ecb961

Tests leftovers

3d3cf06

h11 implementation

39ea259

Removed working files

e51ac73

euri10 changed the title ~~Quart app hanging with uvicorn~~ Fix race condition that leads Quart to hang with uvicorn Nov 13, 2020

euri10 marked this pull request as ready for review November 13, 2020 07:49

euri10 requested a review from a team November 17, 2020 08:13

florimondmanca approved these changes Nov 18, 2020

View reviewed changes

Suggestions implemented

3df94d7

euri10 merged commit de21361 into encode:master Nov 18, 2020

euri10 deleted the 847_quart_race branch November 18, 2020 19:56

euri10 mentioned this pull request Nov 18, 2020

RuntimeError in uvicorn for some requests #111

Closed

DavidLangworthy mentioned this pull request Nov 21, 2020

Dot release to pick up PR #848 ? #861

Closed

euri10 mentioned this pull request Nov 22, 2020

v0.12.3 #862

Merged

delong-coder mentioned this pull request Nov 23, 2020

Low performance work with 'gunicorn with uvicorn worker' pallets/quart#113

Closed

pradithya mentioned this pull request Feb 27, 2021

Update Minimum Version of UV Loop caraml-dev/merlin#87

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition that leads Quart to hang with uvicorn #848

Fix race condition that leads Quart to hang with uvicorn #848

euri10 commented Nov 12, 2020 •

edited

Loading

euri10 commented Nov 12, 2020

euri10 commented Nov 12, 2020

This comment has been minimized.

euri10 commented Nov 12, 2020

euri10 left a comment

euri10 commented Nov 12, 2020

itayperl commented Nov 12, 2020

euri10 commented Nov 12, 2020 •

edited

Loading

itayperl commented Nov 12, 2020

delong-coder commented Nov 13, 2020

euri10 commented Nov 13, 2020

delong-coder commented Nov 18, 2020

florimondmanca left a comment

florimondmanca Nov 18, 2020

euri10 Nov 18, 2020

euri10 Nov 18, 2020

euri10 commented Nov 18, 2020

DavidLangworthy commented Nov 19, 2020

	self.cycle.message_event.set()
	if self.cycle is not None:
	self.cycle.message_event.set()

Fix race condition that leads Quart to hang with uvicorn #848

Fix race condition that leads Quart to hang with uvicorn #848

Conversation

euri10 commented Nov 12, 2020 • edited Loading

euri10 commented Nov 12, 2020

euri10 commented Nov 12, 2020

This comment has been minimized.

euri10 commented Nov 12, 2020

euri10 left a comment

Choose a reason for hiding this comment

euri10 commented Nov 12, 2020

itayperl commented Nov 12, 2020

euri10 commented Nov 12, 2020 • edited Loading

itayperl commented Nov 12, 2020

delong-coder commented Nov 13, 2020

euri10 commented Nov 13, 2020

delong-coder commented Nov 18, 2020

florimondmanca left a comment

Choose a reason for hiding this comment

florimondmanca Nov 18, 2020

Choose a reason for hiding this comment

euri10 Nov 18, 2020

Choose a reason for hiding this comment

euri10 Nov 18, 2020

Choose a reason for hiding this comment

euri10 commented Nov 18, 2020

DavidLangworthy commented Nov 19, 2020

euri10 commented Nov 12, 2020 •

edited

Loading

euri10 commented Nov 12, 2020 •

edited

Loading