Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shield HTTP request handlers from async cancellations. #4314

Merged
merged 8 commits into from
Jun 2, 2023

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented May 23, 2023

We now spawn a new task for every HTTP request, and wait on the JoinHandle. If Hyper drops the Future, the spawned task will keep running. This protects the rest of the pageserver code from unexpected async cancellations.

This creates a CancellationToken for each request and passes it to the handler function. If the HTTP request is dropped by the client, the CancellationToken is signaled. None of the handler functions make use for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When you're looking at a function signature and you see that it takes a CancellationToken as argument, it's a nice hint that the function might run for a long time, and won't be async cancelled. The default assumption in the pageserver is now that async functions are not cancellation-safe anyway, unless explictly marked as such, but this is a nice extra reminder.

Spawning a task for each request is OK from a performance point of view because spawning is very cheap in Tokio, and none of our HTTP requests are very performance critical anyway.

Fixes issue #3478

@github-actions
Copy link

github-actions bot commented May 23, 2023

1004 tests run: 964 passed, 0 failed, 40 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_remote_storage_upload_queue_retries[local_fs]: ✅ debug
The comment gets automatically updated with the latest test results
b613988 at 2023-06-01T05:37:59.704Z :recycle:

@hlinnaka hlinnaka force-pushed the requestspan-refactor branch 2 times, most recently from 68d605b to 90431c8 Compare May 26, 2023 12:01
Base automatically changed from requestspan-refactor to main May 27, 2023 08:47
hlinnaka added a commit that referenced this pull request May 27, 2023
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit lock for the deletion.

Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because the
client disconnected, the timeline deletion stops half-way through the
operation. That needs to be fixed, too, but that's a separate story.

(This is a simpler replacement for PR #4194. I'm also working on the
cancellation shielding, see PR #4314.)
@hlinnaka
Copy link
Contributor Author

All the prerequisite PRs have now been merged, and I have rebased this over 'main'. This is now ready to be reviewed and merged.

@hlinnaka hlinnaka marked this pull request as ready for review May 27, 2023 13:04
@hlinnaka hlinnaka requested review from a team as code owners May 27, 2023 13:04
@hlinnaka hlinnaka requested review from lubennikovaav and removed request for a team May 27, 2023 13:04
@hlinnaka hlinnaka force-pushed the http-request-spawn branch 2 times, most recently from d13a3ad to c9b8eaf Compare May 29, 2023 19:28
@koivunej
Copy link
Member

Unsure what are the compilation errors, maybe in relation to features = "testing"?

@hlinnaka
Copy link
Contributor Author

Unsure what are the compilation errors, maybe in relation to features = "testing"?

Some of them at least, yes. I don't undertstand the test_delete_timeline_client_hangup failures though. It passes on my laptop. Perhaps those test runs were run before I put back the warning when a request is dropped. That would explain it. But I though I had pushed that and that the CI runs included that already.

Re-triggered.

We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.

This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function
might run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is
a nice extra reminder.

Spawning a task for each request is OK from a performance point of
view because spawning is very cheap in Tokio, and none of our HTTP
requests are very performance critical anyway.

Fixes issue #3478
Per discussion, this is intentionally enabled even when compiled without
the testing feature.
Because the handler will now continue running, that's not a series issue
anymore. But a line in the log might be still useful.
With the refactoring of testing_api_handler, the 'testing' handlers
are compiled even if 'testing' feature is not enabled. They are
unreachable, but the compiler doesn't realize that.
For the cancellation guard to work, we need to do the tokio::spawn()
inside the request_span(), not the other way round.

Fix test case, now that the first timeline_delete() continues to run
in the background, even if the HTTP connection is closed.
@hlinnaka hlinnaka merged commit 9787227 into main Jun 2, 2023
@hlinnaka hlinnaka deleted the http-request-spawn branch June 2, 2023 12:28
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Jun 2, 2023

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants