[Bifrost] RepairTail task for replicated loglet #2046

AhmedSoliman · 2024-10-09T14:41:35Z

This puts together the design and implementation of the tail repair procedure that's required when FindTail cannot establish a consistent durable tail from log-servers. The details are described as comments in code.

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2024-10-09T15:07:53Z

Test Results

5 files ±0 5 suites ±0 2m 43s ⏱️ -4s
45 tests ±0 45 ✅ ±0 0 💤 ±0 0 ❌ ±0
114 runs ±0 114 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit 516c026. ± Comparison against base commit ef3ffc4.

♻️ This comment has been updated with latest results.

tillrohrmann

Great work @AhmedSoliman 🚀 The changes look really good and the logic seems sound to me. +1 for merging :-)

crates/bifrost/src/providers/replicated_loglet/read_path/read_stream_task.rs

crates/bifrost/src/providers/replicated_loglet/tasks/digests.rs

tillrohrmann · 2024-10-11T10:05:01Z

crates/bifrost/src/providers/replicated_loglet/tasks/digests.rs

+        // We run stores as tasks because we'll wait only for the necessary write-quorum but the
+        // rest of the stores can continue in the background as best-effort replication (if the
+        // spread selector strategy picked extra nodes)


Unrelated: Can it become a problem if we accumulate network send tasks that are awaiting a response which won't come because the other node has died? I don't expect this to happen often but over time it could result into a memory leak.

The task will give up once it exhausts our rpc retry policy (which is finite by default) but the risk exists if someone changed this to infinite retries.

crates/bifrost/src/providers/replicated_loglet/tasks/digests.rs

crates/bifrost/src/providers/replicated_loglet/tasks/find_tail.rs

tillrohrmann · 2024-10-11T10:28:59Z

crates/bifrost/src/providers/replicated_loglet/tasks/repair_tail.rs

+/// 1. Log-servers persisting the last known_global_tail periodically/async and using this value as known_global_tail on startup.
+/// 2. Sequencer-driven seal. If the sequencer is alive, it can send a special value with the seal message to
+///    indicate what is the ultimate known-global-tail that nodes should repair to instead of relying on the observed max-tail.
+/// 3. Limit `from_offset` to repair from to max(min(local_tails), max(known_global_tails), known_archived, trim_point)


Why could we take the min of local tails? Wouldn't this run the risk to lose previously committed records?

This implies that those tails are from f-majority of nodes. If we have responses of f-majority of log-servers.

Maybe you can explain this optimization to me once we have a bit more time after the demo. I can't wrap my head around it yet. The only thing I can think of is that we can skip those entries where we can reliably say that even with the missing nodes, there can't be a write quorum.

crates/bifrost/src/providers/replicated_loglet/tasks/repair_tail.rs

This puts together the design and implementation of the tail repair procedure that's required when FindTail cannot establish a consistent durable tail from log-servers. The details are described as comments in code.

AhmedSoliman mentioned this pull request Oct 9, 2024

[Bifrost] GetDigest message in replicated loglets #2042

Merged

AhmedSoliman force-pushed the pr2046 branch from fe73cf7 to fe486ef Compare October 9, 2024 14:55

AhmedSoliman force-pushed the pr2046 branch from fe486ef to 83fb10e Compare October 9, 2024 15:41

AhmedSoliman changed the title ~~[WIP] RepairTail~~ [Bifrost] RepairTail task for replicated loglet Oct 10, 2024

AhmedSoliman force-pushed the pr2046 branch 2 times, most recently from 037e92b to 351371b Compare October 10, 2024 12:34

AhmedSoliman marked this pull request as ready for review October 10, 2024 12:34

AhmedSoliman requested review from tillrohrmann and muhamadazmy October 10, 2024 12:34

AhmedSoliman force-pushed the pr2046 branch 2 times, most recently from e11e1f9 to 1392f17 Compare October 10, 2024 13:41

AhmedSoliman mentioned this pull request Oct 10, 2024

[Bifrost] restatectl logs generate-metadata command #2051

Merged

AhmedSoliman force-pushed the pr2046 branch from 1392f17 to 25102c5 Compare October 10, 2024 15:15

AhmedSoliman mentioned this pull request Oct 10, 2024

Random Fixes #2056

Merged

tillrohrmann approved these changes Oct 11, 2024

View reviewed changes

AhmedSoliman force-pushed the pr2046 branch from 25102c5 to 7052151 Compare October 11, 2024 11:42

AhmedSoliman mentioned this pull request Oct 11, 2024

[Bifrost] Periodic tail checker #2064

Merged

[Bifrost] RepairTail task for replicated loglet

516c026

This puts together the design and implementation of the tail repair procedure that's required when FindTail cannot establish a consistent durable tail from log-servers. The details are described as comments in code.

AhmedSoliman force-pushed the pr2046 branch from 7052151 to 516c026 Compare October 11, 2024 12:24

AhmedSoliman merged commit 516c026 into main Oct 11, 2024
16 of 17 checks passed

AhmedSoliman deleted the pr2046 branch October 11, 2024 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bifrost] RepairTail task for replicated loglet #2046

[Bifrost] RepairTail task for replicated loglet #2046

AhmedSoliman commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading

tillrohrmann left a comment

tillrohrmann Oct 11, 2024

AhmedSoliman Oct 11, 2024

tillrohrmann Oct 11, 2024

AhmedSoliman Oct 11, 2024

tillrohrmann Oct 11, 2024

[Bifrost] RepairTail task for replicated loglet #2046

[Bifrost] RepairTail task for replicated loglet #2046

Conversation

AhmedSoliman commented Oct 9, 2024 • edited Loading

This puts together the design and implementation of the tail repair procedure that's required when FindTail cannot establish a consistent durable tail from log-servers. The details are described as comments in code.

github-actions bot commented Oct 9, 2024 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Oct 11, 2024

Choose a reason for hiding this comment

AhmedSoliman Oct 11, 2024

Choose a reason for hiding this comment

tillrohrmann Oct 11, 2024

Choose a reason for hiding this comment

AhmedSoliman Oct 11, 2024

Choose a reason for hiding this comment

tillrohrmann Oct 11, 2024

Choose a reason for hiding this comment

AhmedSoliman commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading