-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce requeue flag #286
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #286 +/- ##
==========================================
- Coverage 91.74% 91.42% -0.32%
==========================================
Files 64 64
Lines 4626 4644 +18
==========================================
+ Hits 4244 4246 +2
- Misses 382 398 +16
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
|
This is amazing work. Sorry I only have the ability to respond in emoji reactions right now. |
I think this is ready for review, we should just merge the other branch before and then rebase 👍🏻 |
I just read through your great description of the new logic, it all makes sense , very elegant. Just to check I get it: the requeue flag is flipped to |
Gunna review now 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks all good to me, thanks for tracking down this tricky bug.
I never managed to reproduce this particular race condition on my (slow??) laptop, I'm assuming it fixed what you did manage to observe.
Regarding the still failing test, I'd say we merge and look at that in a new PR.
I'll go look at the other branch now and get that merged first. |
75bb1f4
to
3573aab
Compare
Ah, I actually had a change request. Think it would be good to add debug logging for when an operation is requeued. I can add this, let me know if you have any logging related thoughts @adzialocha. |
I think you got the it, just maybe clarifying that "concurrent work on the same document will occur" is not true. There is no concurrent work on any same document allowed, otherwise we tap into undefined behaviour because of the database reads and writes on that document. We also never queued up a new task for every operation. We queue up a task for every document one could say, the operation is the trigger for that. If a second operation for the same document comes in while we're already having a task, a "requeue" is flagged.
You mean the
Sure, please go ahead! |
Thank you for rebasing! |
Sorry, that was a typo on my part, should have read "no concurrent work...." |
No I mean, this queue bug, I couldn't reproduce it locally, it was only observable on your fast computer. |
* main: Update breaking API calls for new `p2panda-rs` version (#293) Update Cargo.lock as well Use released p2panda-rs version 0.7.0 Migrate CLI from `structopt` to `clap` (#289) Increase timeout for failing materializer test Introduce requeue flag (#286) Do transactions correctly (#285) Add libp2p service and configuration (#282)
Introduces a
requeue
boolean-flag in the input index of ourFactory
implementation (task worker queue) which allows tasks to get re-scheduled after completion.This fixes a nasty bug which can be only reproduced on very fast machines: #281
This branches off Sam's work on #285 which surely made the bug easier to track down.
Problem
Let's say we have operations arriving in the following order at the node:
C1 [D1]
,U2 [D1]
,C1 [D2]
,U3 [D1]
andU2 [D2]
. That's an CREATE, UPDATE and another UPDATE operation for a "Document 1" and another CREATE and UPDATE operation for a "Document 2", they arrive at the same time at the GraphQL API and get put in this order on the service bus where they get broadcasted.The current factory implementation would now look at each of these operations arriving through the service bus and queue them up IF we don't handle that document yet. In the "dispatcher" this would look like that:
C1 [D1]
arrives. Do we already work onD1
? No, let's queue it!U2 [D1]
arrives. Do we already work onD1
? Yes. Ignore.C1 [D2]
arrives. Do we already work onD2
? No, let's queue it!U3 [D1]
arrives. Do we already work onD1
? Yes. Ignore.U2 [D2]
arrives. Do we already work onD2
? Yes. Ignore... concurrently we're already running a whole worker pool eagerly waiting to look at the queue, taking the tasks off it. So depending on how the "dispatcher" races with the "workers" the outcome might always look a little bit different. For example, if we finished working on
C1 [D1]
before the dispatcher looks atU3 [D1]
, it might actually get queued as well.That design of "looking at duplicates" had one purpose: Avoid working on the same document twice, aka "we're already working on it, why do it a second time". This is a nice optimization, but it also assumes that we have all possible operations in the database already in the moment before
C1 [D1]
or a later D1-related task kicks in.Funnily, this worked quite well so far, we never noticed a problem. Most of the time the requests were slow enough for the factory to queue every operation and ignore nothing:
C1 [D1]
arrives. Do we already work onD1
? No, let's queue it!... (the worker happily crunches that operation in the background, seeing C1 in the database)
... some short fraction of time later ..
U2 [D1]
arrives. Do we already work onD1
? No, let's queue it!... (the worker happily crunches that operation in the background, seeing U2 in the database)
... and so on ..
Since our requests get faster now (due to client-side caching in
shirokuma
) we observed the following problem:C1 [D1]
arrives. Do we already work onD1
? No, let's queue it!... (the worker happily crunches that operation in the background, seeing C1 in the database)
... some short fraction of time later ..
U2 [D1]
arrives! Woah. FAST! Surprise!! Do we already work onD1
? Yes. Ignore.... and so on ..
The new operations
U2
etc. arrived in the database after the worker looked at them but before it finished working onC1
! Thus, we lostU2
..Solution
After some analysis I realized that the system makes still sense in its basic assumption: We don't need to work on documents more than we have to!
aquadoggo
is an event-sourcing system where "events" kick-in work, independent of the amount of data arriving. Here is an example:Note how the brackets
[...]
"group" the operations in work units. They simply get loaded from the database as they already were there before the work started.There is an ongoing stream of incoming UPDATE operations and every time we kick in the worker for that document it takes the "fresh" ones and materializes them. We don't need to have a worker for each operation, we just have to make sure that we consider them all exactly once - and this is where the previous system failed, as operations just got lost ..
In this PR a "re-queue" flag got introduced which restarts a task for a document D when operations have been observed which came in while the worker was already running on D. It works like this:
C1 [D1]
arrives. Do we already work onD1
? No, let's queue it!U2 [D1]
arrives. Do we already work onD1
? Yes. Set requeue flag forD1
totrue
C1 [D2]
arrives. Do we already work onD2
? No, let's queue it!U3 [D1]
arrives. Do we already work onD1
? Yes, and the requeue flag is set. Ignore.U2 [D2]
arrives. Do we already work onD2
? Yes. Set requeue flag forD2
totrue
This scenario assumes that the worker for
D1
and the other worker forD2
are still running while all other operations get dispatched, that's the "worst-case scenario" in terms of maximum pressure on the dispatcher which made the previous implementation fail.What this does now is quite nice: As soon as the work on
C1
finished and the requeue flag was detected, it will just add a new task to the queue to continue working onD1
, making sure to also account for all these operations which arrived a little bit later (U2
andU3
).What we get from this is:
📋 Checklist
CHANGELOG.md