Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phil/recoverylog pruning #347

Merged
merged 2 commits into from
Aug 23, 2023
Merged

Phil/recoverylog pruning #347

merged 2 commits into from
Aug 23, 2023

Conversation

psFried
Copy link
Contributor

@psFried psFried commented Aug 22, 2023

Fixes a couple of issues in gazctl subcommands regarding missing FSM hints for recovery logs.
See individual commit messages.

I tested these both as best I could manually, but there's still no automated tests for them AFAIK. Here's the scenarios I tested, and please feel free to suggest any others you can think of:

Recover:

  • Run shards recover on a few active Flow shards from production

Recover 2:

  • Setup a Flow capture shard and ingest some documents
  • Disable the shard
  • Run shards recover --dir a
  • Delete the primary hints from etcd
  • Run shards recover --dir b and diff to confirm it's the same as a (previously, this would have errored)
  • Delete the backup hints from etcd
  • Run shards recover --dir c and diff to confirm it's the same as a

Prune:

  • Ran a few --dry-run commands against Flow's combustible-cronut environment
  • Spot checked some of the 600+ journals that were skipped due to missing hints to confirm that they were indeed missing live nodes or hints
  • Spot checked some of the journals that had been pruned to ensure that their shards all had hints
  • Ran --dry-run -l estuary.dev/task-type=derivation and got:
    • bytesKept=4606810948793 bytesPruned=26377008674102 bytesTotal=30983819622895 fragmentsKept=17359 fragmentsPruned=102500 fragmentsTotal=119859 shardsTotal=114 skippedJournals=2
  • Ran --dry-run -l estuary.dev/task-type=materialization and got:
    • bytesKept=37440051630 bytesPruned=536165477654 bytesTotal=573605529284 fragmentsKept=3503 fragmentsPruned=9307 fragmentsTotal=12810 shardsTotal=291 skippedJournals=61
  • Ran --dry-run -l estuary.dev/task-type=capture and got:
    • bytesKept=1529386200947 bytesPruned=14368880429210 bytesTotal=15898266630157 fragmentsKept=7896 fragmentsPruned=77030 fragmentsTotal=84926 shardsTotal=512 skippedJournals=56

This change is Reviewable

Updates the `gazctl shards recover` subcommand to tolerate missing FSM Hints.
Previously, this command would fail if it could not find primary hints for the
given shard.  This was not necessary, since recovery is designed to work
without any hints.  This changes the `shards recover` subcommand to use the
same logic that's used in `recovery.go` for resolving a final set of hints,
which may be empty except for the journal name.  Now, you can run the command
on a shard with only backup hints, or no hints at all, and it will still
recover an identical state.
The `shards prune` subcommand was failing with an error if any selected shard
was missing backup hints. This updates the subcommand to instead skip pruning
any journals where a shard that uses it is missing hints.

Put another way, if any selected shard is missing hints for a given recovery
log journal, we do not prune that journal.
Copy link
Contributor

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thank you

@psFried psFried merged commit 3fe4c4a into master Aug 23, 2023
1 check passed
@psFried psFried deleted the phil/recoverylog-pruning branch August 23, 2023 13:25
psFried added a commit to estuary/flow that referenced this pull request Aug 23, 2023
This is to pull in a few fixes from gazette/core#347
psFried added a commit to estuary/flow that referenced this pull request Aug 23, 2023
This is to pull in a few fixes from gazette/core#347
psFried added a commit to estuary/flow that referenced this pull request Aug 23, 2023
This is to pull in a few fixes from gazette/core#347
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants