Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow two timeline_delete operations to run concurrently. #4313

Merged
merged 3 commits into from
May 27, 2023

Conversation

hlinnaka
Copy link
Contributor

If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit flag for the deletion.

Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story.

(This is a simpler replacement for PR #4194. I'm also working on the cancellation shielding, I'll open separate PR for that.)

@hlinnaka hlinnaka requested review from a team as code owners May 23, 2023 14:35
@hlinnaka hlinnaka requested review from knizhnik, skyzh, problame, koivunej and a team and removed request for a team, knizhnik and skyzh May 23, 2023 14:35
@github-actions
Copy link

github-actions bot commented May 23, 2023

1004 tests run: 962 passed, 0 failed, 42 skipped (full report)


Flaky tests (1)

Postgres 15

The comment gets automatically updated with the latest test results
f54eb39 at 2023-05-27T10:01:43.370Z :recycle:

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, because the tests asserted the response status code but I'll let Christian approve.

@hlinnaka hlinnaka force-pushed the simple-timeline_delete-coalesce branch 3 times, most recently from d664114 to 900f1af Compare May 26, 2023 09:23
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments in delete_timeline are now stale, please go through them and adjust

Example

        // NB: If you call delete_timeline multiple times concurrently, they will
        // all go through the motions here. Make sure the code here is idempotent,
        // and don't error out if some of the shutdown tasks have already been
        // completed!

Other than that, I don't see why we need a mutex instead of an AtomicBool for this.

But I'm Ok with using the tokio mutex.

We should really fix the #3478 issue, though.

Giving approval because I'm on vacation next week, but please fix those comments.

@hlinnaka hlinnaka force-pushed the simple-timeline_delete-coalesce branch from 900f1af to e4a7624 Compare May 26, 2023 22:51
@hlinnaka
Copy link
Contributor Author

Some comments in delete_timeline are now stale, please go through them and adjust

Example

        // NB: If you call delete_timeline multiple times concurrently, they will
        // all go through the motions here. Make sure the code here is idempotent,
        // and don't error out if some of the shutdown tasks have already been
        // completed!

Updated that and a few other comments. Also, this race cannot happen any more:

-        let removed_timeline = timelines.remove(&timeline_id);
-        if removed_timeline.is_none() {
-            // This can legitimately happen if there's a concurrent call to this function.
-            //   T1                                             T2
-            //   lock
-            //   unlock
-            //                                                  lock
-            //                                                  unlock
-            //                                                  remove files
-            //                                                  lock
-            //                                                  remove from map
-            //                                                  unlock
-            //                                                  return
-            //   remove files
-            //   lock
-            //   remove from map observes empty map
-            //   unlock
-            //   return
-            debug!("concurrent call to this function won the race");

so I turned that into a panic instead.

Other than that, I don't see why we need a mutex instead of an AtomicBool for this.

But I'm Ok with using the tokio mutex.

The Mutex allows you to detect if another task is currently performing the deletion. With an AtomicBool, the boolean would need to indicate "is deletion in progress", and you'd need to be careful to set it back to false if the deletion fails half-way through. I think you'd need another bool for "did it already finish", or make it an atomic with three states:

  1. deletion not in progress
  2. deletion in progress
  3. deletion completed

Currently, this uses try_lock to check if another deletion is in progress, but originally I actually actually wanted to wait for it to finish, like #4194 did. That would require pulling in the test changes from that PR, however, and because we still don't handle async cancellations gracefully, those tests would need some further tweaking to make them pass. After we fix #3478, we might want to do that, but for now I wanted to keep the behaviour and tests unchanged.

We should really fix the #3478 issue, though.

Yes, I'm doing that in #4314.

@hlinnaka hlinnaka force-pushed the simple-timeline_delete-coalesce branch from e4a7624 to f584db3 Compare May 26, 2023 22:56
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit flag for the deletion.

Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because
the client disconnected, the timeline deletion stops half-way through
the operation. That needs to be fixed, too, but that's a separate
story.
Also, we now really should find the timeline we're deleting in
'timelines' map at the end of deletion, so turn that into a panic.
@hlinnaka hlinnaka force-pushed the simple-timeline_delete-coalesce branch from f584db3 to f54eb39 Compare May 27, 2023 08:49
@hlinnaka hlinnaka merged commit 2d6a022 into main May 27, 2023
@hlinnaka hlinnaka deleted the simple-timeline_delete-coalesce branch May 27, 2023 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants