Introduce requeue flag #286

adzialocha · 2023-03-06T17:43:51Z

Introduces a requeue boolean-flag in the input index of our Factory implementation (task worker queue) which allows tasks to get re-scheduled after completion.

This fixes a nasty bug which can be only reproduced on very fast machines: #281

This branches off Sam's work on #285 which surely made the bug easier to track down.

Problem

Let's say we have operations arriving in the following order at the node: C1 [D1], U2 [D1], C1 [D2], U3 [D1] and U2 [D2]. That's an CREATE, UPDATE and another UPDATE operation for a "Document 1" and another CREATE and UPDATE operation for a "Document 2", they arrive at the same time at the GraphQL API and get put in this order on the service bus where they get broadcasted.

The current factory implementation would now look at each of these operations arriving through the service bus and queue them up IF we don't handle that document yet. In the "dispatcher" this would look like that:

C1 [D1] arrives. Do we already work on D1? No, let's queue it!
U2 [D1] arrives. Do we already work on D1? Yes. Ignore.
C1 [D2] arrives. Do we already work on D2? No, let's queue it!
U3 [D1] arrives. Do we already work on D1? Yes. Ignore.
U2 [D2] arrives. Do we already work on D2? Yes. Ignore.

.. concurrently we're already running a whole worker pool eagerly waiting to look at the queue, taking the tasks off it. So depending on how the "dispatcher" races with the "workers" the outcome might always look a little bit different. For example, if we finished working on C1 [D1] before the dispatcher looks at U3 [D1], it might actually get queued as well.

That design of "looking at duplicates" had one purpose: Avoid working on the same document twice, aka "we're already working on it, why do it a second time". This is a nice optimization, but it also assumes that we have all possible operations in the database already in the moment before C1 [D1] or a later D1-related task kicks in.

Funnily, this worked quite well so far, we never noticed a problem. Most of the time the requests were slow enough for the factory to queue every operation and ignore nothing:

C1 [D1] arrives. Do we already work on D1? No, let's queue it!
... (the worker happily crunches that operation in the background, seeing C1 in the database)
... some short fraction of time later ..
U2 [D1] arrives. Do we already work on D1? No, let's queue it!
... (the worker happily crunches that operation in the background, seeing U2 in the database)
... and so on ..

Since our requests get faster now (due to client-side caching in shirokuma) we observed the following problem:

C1 [D1] arrives. Do we already work on D1? No, let's queue it!
... (the worker happily crunches that operation in the background, seeing C1 in the database)
... some short fraction of time later ..
U2 [D1] arrives! Woah. FAST! Surprise!! Do we already work on D1? Yes. Ignore.
... and so on ..

The new operations U2 etc. arrived in the database after the worker looked at them but before it finished working on C1! Thus, we lost U2 ..

Solution

After some analysis I realized that the system makes still sense in its basic assumption: We don't need to work on documents more than we have to! aquadoggo is an event-sourcing system where "events" kick-in work, independent of the amount of data arriving. Here is an example:

D: [U U U] [U U U U U U U] U U U U U U U U ...
          Work!           Work!

Note how the brackets [...] "group" the operations in work units. They simply get loaded from the database as they already were there before the work started.

There is an ongoing stream of incoming UPDATE operations and every time we kick in the worker for that document it takes the "fresh" ones and materializes them. We don't need to have a worker for each operation, we just have to make sure that we consider them all exactly once - and this is where the previous system failed, as operations just got lost ..

In this PR a "re-queue" flag got introduced which restarts a task for a document D when operations have been observed which came in while the worker was already running on D. It works like this:

C1 [D1] arrives. Do we already work on D1? No, let's queue it!
U2 [D1] arrives. Do we already work on D1? Yes. Set requeue flag for D1 to true
C1 [D2] arrives. Do we already work on D2? No, let's queue it!
U3 [D1] arrives. Do we already work on D1? Yes, and the requeue flag is set. Ignore.
U2 [D2] arrives. Do we already work on D2? Yes. Set requeue flag for D2 to true

This scenario assumes that the worker for D1 and the other worker for D2 are still running while all other operations get dispatched, that's the "worst-case scenario" in terms of maximum pressure on the dispatcher which made the previous implementation fail.

What this does now is quite nice: As soon as the work on C1 finished and the requeue flag was detected, it will just add a new task to the queue to continue working on D1, making sure to also account for all these operations which arrived a little bit later (U2 and U3).

What we get from this is:

Some sort of "batching" of operations for the same document when they arrive very fast
The queue does not get "blocked" by waiting operations, we can still continue looking at other operations for other documents and assigning them to different workers
Accounting for moments of very high pressure on the dispatcher without losing any information
Still making sure that we only working on a document once at a time, to avoid concurrent work overwriting each other

📋 Checklist

Add tests that cover your changes
Add this PR to the Unreleased section in CHANGELOG.md
Link this PR to any issues it closes
New files contain a SPDX license header

codecov · 2023-03-06T17:50:50Z

Codecov Report

Patch coverage: 36.36% and project coverage change: -0.32 ⚠️

Comparison is base (335be5a) 91.74% compared to head (9d98a7e) 91.42%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
- Coverage   91.74%   91.42%   -0.32%     
==========================================
  Files          64       64              
  Lines        4626     4644      +18     
==========================================
+ Hits         4244     4246       +2     
- Misses        382      398      +16

Impacted Files	Coverage Δ
aquadoggo/src/materializer/worker.rs	`81.52% <36.36%> (-4.92%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

adzialocha · 2023-03-06T18:43:42Z

materialize_document_from_bus seems to still be failing following the fixes @sandreae, I wonder whats hidden there 😆

sandreae · 2023-03-06T18:50:02Z

This is amazing work. Sorry I only have the ability to respond in emoji reactions right now.

adzialocha · 2023-03-14T11:54:37Z

I think this is ready for review, we should just merge the other branch before and then rebase 👍🏻

sandreae · 2023-03-14T14:05:08Z

I just read through your great description of the new logic, it all makes sense , very elegant. Just to check I get it: the requeue flag is flipped to true if an operation arrives for a document D1 while a task processing D1 is in progress. This means a task for D1 will be queued up again once the current one finishes. This means all operations will be accounted for, now concurrent work on the same document will occur, and we don't inefficiently queue up a new task for every operation.

sandreae · 2023-03-14T14:05:29Z

Gunna review now 👍

sandreae

This looks all good to me, thanks for tracking down this tricky bug.

I never managed to reproduce this particular race condition on my (slow??) laptop, I'm assuming it fixed what you did manage to observe.

Regarding the still failing test, I'd say we merge and look at that in a new PR.

sandreae · 2023-03-14T14:13:24Z

I'll go look at the other branch now and get that merged first.

sandreae · 2023-03-14T14:30:04Z

Ah, I actually had a change request. Think it would be good to add debug logging for when an operation is requeued. I can add this, let me know if you have any logging related thoughts @adzialocha.

adzialocha · 2023-03-14T14:43:21Z

Just to check I get it: the requeue flag is flipped to true if an operation arrives for a document D1 while a task processing D1 is in progress. This means a task for D1 will be queued up again once the current one finishes. This means all operations will be accounted for, now concurrent work on the same document will occur, and we don't inefficiently queue up a new task for every operation.

I think you got the it, just maybe clarifying that "concurrent work on the same document will occur" is not true. There is no concurrent work on any same document allowed, otherwise we tap into undefined behaviour because of the database reads and writes on that document.

We also never queued up a new task for every operation. We queue up a task for every document one could say, the operation is the trigger for that. If a second operation for the same document comes in while we're already having a task, a "requeue" is flagged.

I never managed to reproduce this particular race condition on my (slow??) laptop, I'm assuming it fixed what you did manage to observe.

You mean the materialize_document_from_bus test, right?

Ah, I actually had a change request. Think it would be good to add debug logging for when an operation is requeued. I can add this, let me know if you have any logging related thoughts @adzialocha.

Sure, please go ahead!

adzialocha · 2023-03-14T14:44:15Z

Thank you for rebasing!

sandreae · 2023-03-14T14:56:38Z

I think you got the it, just maybe clarifying that "concurrent work on the same document will occur" is not true. There is no concurrent work on any same document allowed, otherwise we tap into undefined behaviour because of the database reads and writes on that document.

Sorry, that was a typo on my part, should have read "no concurrent work...."

sandreae · 2023-03-14T14:58:07Z

You mean the materialize_document_from_bus test, right?

No I mean, this queue bug, I couldn't reproduce it locally, it was only observable on your fast computer.

* main: Update breaking API calls for new `p2panda-rs` version (#293) Update Cargo.lock as well Use released p2panda-rs version 0.7.0 Migrate CLI from `structopt` to `clap` (#289) Increase timeout for failing materializer test Introduce requeue flag (#286) Do transactions correctly (#285) Add libp2p service and configuration (#282)

adzialocha changed the base branch from main to do-transactions-differently March 6, 2023 17:44

adzialocha linked an issue Mar 6, 2023 that may be closed by this pull request

Updates not getting materialized #281

Closed

adzialocha marked this pull request as ready for review March 14, 2023 11:53

sandreae approved these changes Mar 14, 2023

View reviewed changes

adzialocha added 5 commits March 14, 2023 14:22

Introduce requeue flag

ee66aff

Write more about re-scheduling

e86e72a

Only remove task from store when it is done for good

45e1e59

Use enum for flag instead

b64ff0b

Add entry to CHANGELOG.md

3573aab

sandreae force-pushed the introduce-requeue-flag branch from 75bb1f4 to 3573aab Compare March 14, 2023 14:23

Base automatically changed from do-transactions-differently to main March 14, 2023 14:23

sandreae self-requested a review March 14, 2023 14:27

Add logging

9d98a7e

p2panda deleted a comment from sandreae Mar 14, 2023

adzialocha merged commit c0775be into main Mar 14, 2023

adzialocha deleted the introduce-requeue-flag branch March 14, 2023 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce requeue flag #286

Introduce requeue flag #286

adzialocha commented Mar 6, 2023 •

edited

Loading

codecov bot commented Mar 6, 2023 •

edited

Loading

adzialocha commented Mar 6, 2023

sandreae commented Mar 6, 2023

adzialocha commented Mar 14, 2023 •

edited

Loading

sandreae commented Mar 14, 2023

sandreae commented Mar 14, 2023

sandreae left a comment

sandreae commented Mar 14, 2023

sandreae commented Mar 14, 2023

adzialocha commented Mar 14, 2023

adzialocha commented Mar 14, 2023

sandreae commented Mar 14, 2023

sandreae commented Mar 14, 2023

Introduce requeue flag #286

Introduce requeue flag #286

Conversation

adzialocha commented Mar 6, 2023 • edited Loading

Problem

Solution

📋 Checklist

codecov bot commented Mar 6, 2023 • edited Loading

Codecov Report

adzialocha commented Mar 6, 2023

sandreae commented Mar 6, 2023

adzialocha commented Mar 14, 2023 • edited Loading

sandreae commented Mar 14, 2023

sandreae commented Mar 14, 2023

sandreae left a comment

Choose a reason for hiding this comment

sandreae commented Mar 14, 2023

sandreae commented Mar 14, 2023

adzialocha commented Mar 14, 2023

adzialocha commented Mar 14, 2023

sandreae commented Mar 14, 2023

sandreae commented Mar 14, 2023

adzialocha commented Mar 6, 2023 •

edited

Loading

codecov bot commented Mar 6, 2023 •

edited

Loading

adzialocha commented Mar 14, 2023 •

edited

Loading