-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass queued requests to thread pool on server shutdown #2122
Conversation
b379b8a
to
17af6cf
Compare
@@ -293,8 +293,6 @@ def run(background=true) | |||
return run_lopez_mode(background) | |||
end | |||
|
|||
queue_requests = @queue_requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes a change from cc4ad17, since we need to change behavior based on the updated value of the @queue_requests
variable after the reactor is shutdown. I assume this was a performance optimization, but I don't know if it actually makes any measurable difference.
lib/puma/client.rb
Outdated
# middle of transmitting a response, unless a network or client failure | ||
# is suspected. | ||
def idle? | ||
@requests_served > 0 && !in_data_phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we not be idle if @requests_served
is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted the RFC statement "Servers SHOULD always respond to at least one request per connection, if at all possible" to mean that the server shouldn't consider unused (@requests_served == 0
) connections to be 'idle', so shouldn't force-close them during a graceful shutdown. This makes sense to me- if a client opens a connection it usually makes at least one request pretty soon after.
Maybe #idle?
isn't the clearest term for this method? I used it here because it the same RFC section uses it to refer to a connection 'the server has decided to close':
A client, server, or proxy MAY close the transport connection at any
time. For example, a client might have started to send a new request
at the same time that the server has decided to close the "idle"
connection. From the server's point of view, the connection is being
closed while it was idle, but from the client's point of view, a
request is in progress.
I also considered #can_close?
, #should_close?
or #closeable?
if any of those seem more fitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Riiight, so if we haven't served any requests, we can just shut down. Got it.
Any of the close
methods you proposed I think would be more fitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think @requests_served
is a good variable name here. It is (from what I understand) a count of the number of requests that have been received from the client, and not the number of requests served from the server. I think a better name might be @client_request_count
.
Given that tidbit, I don't think the logic here is entirely correct. I do want to take a step back from the specific implementation though and make sure we're on the same page at a high level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this on client.rb, I'm OK with the variable name.
@@ -422,11 +420,12 @@ def handle_servers | |||
|
|||
@events.fire :state, @status | |||
|
|||
graceful_shutdown if @status == :stop || @status == :restart | |||
if queue_requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this needs to be @queue_requests given your change. But I'd also rather you not change this to an ivar read and leave it as a local var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were actually two separate queue_requests
local vars- one within Server#run
, and one within Server#handle_servers
.
The local var in #run
had to be removed in order for this PR to work at all- see my comment at #2122 (review). However, this one in #handle_servers
actually doesn't need to be changed as well, so I left it as-is to minimize the diff.
That said, since this local var in #handle_servers
is only ever read on server shutdown and probably has no performance impact, I would lean towards removing it for consistency/simplicity if that's your preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation works for me.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for your PR. It was really hard to review, mostly due to the complexity of the subject matter as well as complexity of Puma internals. I can only imagine that it must have taken you much longer to write this PR.
I want to take a step back from the implementation here and talk about some high level questions and goals
Is this a regression?
My high level question is: Did this used to work? Is this a regression or is this something new we're trying to find/fix? It looks like this might be related to #2200 but...I don't see where something changed in the reactor. Did something else get changed somewhere? If so, can we git bisect and understand what broke the behavior?
If you go back to 4.0.0 and run the same test does it pass or fail?
Side note: I would love if this process of seeing when a given test would have worked for failed was easier to do somehow. Committing it into git means we can't
git bisect
Regardless of if this was broken or not, another question is: What do we want to happen? While digging in the code I found a few situations, and two of them have semi-undefined behavior:
Scenario A) Client 2 is fully buffered
- Client Request 1 comes in and is fully buffered
- Client Request 2 comes in and is partially buffered
- Server gets shutdown request
- Client Request 2 sends rest of its request before shutdown timeout
- Server generates and delivers full response of Request 1, sends response
- Desired: Reactor passes the request to the thread pool before shutting down for a response
I think we can all agree on this.
Scenario B) Client 2 CAN be (but is not yet) fully buffered
- Client Request 1 comes in and is fully buffered
- Client Request 2 comes in and is partially buffered
- Server gets shutdown request
- Client Request 2 sends rest of its request before shutdown timeout
- Server generates and delivers full response of Request 1, sends response
- Desired: Reactor keeps trying to buffer the request to see if it can pass it to the thread pool to be processed before shutting down?
Is this true? Do we want to do that? The downside is we have no way of knowing if the client will ever send the rest of the request in time. How do we detect the difference between scenario B to scenario C where we need to serve need a default response
Scenario C) Client 2 cannot fully buffered
- Client Request 1 comes in and is fully buffered
- Client Request 2 comes in and is partially buffered
- Server gets shutdown request
- Client Request 2 DOES NOT send rest of its request
- Server waits for full response of Request 1, sends response
- Desired: ???
If we pass a partial response to the threadpool for processing, there's no guarantee that it can do anything with it, it might be a half formed json body, that cannot be decoded. While RFC 2616 says we "should" send at least one response, what should we send? If we pass it to the thread pool in this state it's going to be a 500. I think that's not accurate. A 503 would perhaps be more appropriate.
Questions
- Is this a regression?
- If you pass a non-fully buffered client request to the thread pool what happens? Does it try to buffer it? cc/ @nateberkopec @evanphx
- What do we want to do when the client can be fully buffered? (scenario B)
- What do we want to do when the client cannot be fully buffered? (scenario C)
lib/puma/client.rb
Outdated
# middle of transmitting a response, unless a network or client failure | ||
# is suspected. | ||
def idle? | ||
@requests_served > 0 && !in_data_phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think @requests_served
is a good variable name here. It is (from what I understand) a count of the number of requests that have been received from the client, and not the number of requests served from the server. I think a better name might be @client_request_count
.
Given that tidbit, I don't think the logic here is entirely correct. I do want to take a step back from the specific implementation though and make sure we're on the same page at a high level.
e76799b
to
9a0e207
Compare
9a0e207
to
941262e
Compare
Cover various scenario combinations related to timeout settings, queue_requests configuration, and post/get request type.
941262e
to
d8b2680
Compare
As far as I can tell, this never used to work since the reactor implementation has existed. For example, on bb2aeb8 (first commit where the I only encountered this issue as an intermittent unit-test failure while working on #2123, and worked backwards from there to create a minimal reproducible failing test and fix. In any case it seems possible that #2200 is caused by this underlying issue, though I can't be sure.
I wasn't sure how Scenario 'A' differed from 'B' (identical except for the 'desired' part?), so I'll just address B and C.
To clarify the behavior across all combinations of scenarios, I added a few extra test cases to this PR. Let me know if these are clear enough and if the expected behavior is reasonable, or if there are any additional scenarios you think should be covered.
Yes, this is what happens when |
Update method documentation based on feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard work. You arrived at a very understandable place, looks good!
👏 |
🎊 Is there a timeframe for the next release of puma? |
Basically the two PRs listed in the milestone have to get merged: https://github.com/puma/puma/milestone/7 |
Ah, major release is next. Okay, thanks. |
Description
Fixes a concurrency bug during server shutdown when
queue_requests
is enabled (the default), which can can be minimally reproduced by the following sequence:Expected: s2 receives a successful response
Actual: s2 receives a 500 response
Here is the test implementation, currently failing on
master
:puma/test/test_puma_server.rb
Lines 771 to 788 in 17af6cf
The underlying issue is that in a 'graceful' shutdown, the server stops accepting connections and the thread pool stops accepting new work while it waits up to
force_shutdown_after
seconds to finish processing currently active requests. However, a client request queued in the reactor will be rejected if it gets received after the thread pool stops accepting new work.To fix, this PR makes a small change to shut down the reactor before the thread pool, and to disable
queue_requests
and pass along any remaining non-'idle' client connections to the thread pool when the reactor shuts down. This allows graceful shutdown to behave properly without prematurely aborting any requests it shouldn't.The logic for determining 'idle' clients that can be closed immediately follows guidance from RFC 2616 section 8.1.4:
The added test depends on the small fix in #2121 to pass properly on Windows, so this branch currently includes that commit as well.
Your checklist for this pull request
[changelog skip]
the pull request title.[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.