Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly store last prepare steps for replay. #1253

Merged
merged 4 commits into from
Apr 17, 2023
Merged

Conversation

branlwyd
Copy link
Member

I did this because it turned out to be (basically) necessary for both the ping-pong implementation, as well as the race-condition fixes.

This also fixes some buggy behavior in the corners of replay. For example, if a round which ended up dropping a report was replayed, the Helper would respond differently on replay: no prepare step at all the first time, a failed prepare step with error ReportDropped on replay. [There may be more; I didn't do a close reading of every code path.]

I did this because it turned out to be (basically) necessary for both
the ping-pong implementation, as well as the race-condition fixes.

This also fixes some buggy behavior in the corners of replay. For
example, if a round which ended up dropping a report was replayed, the
Helper would respond differently on replay: no prepare step at all the
first time, a failed prepare step with error ReportDropped on replay.
[There may be more; I didn't do a close reading of every code path.]
@branlwyd branlwyd requested a review from a team as a code owner April 17, 2023 04:27
Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only blocker is creating a database migration for the last_prep_step column, and whatever the test failure on stable is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change be a schema migration? Or perhaps the relevant question is: do we want to deploy this change into staging-dap-04? If so, then I think we should do a migration script because otherwise we'd have to re-deploy that environment's database from scratch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure -- I think we could technically get away with just flattening/redeploying the DB since we have not yet actually used it. But it's good to test our migration implementation, too, and I suppose since we've released we should follow the DB migration strategy.

But I couldn't find any documentation on how to write a migration. Presumably the goal is to create a pair of up/down files in the proper lexicographic order with the other migration files; but do we use sqlx to create these files somehow, or create them manually? is "filename lexicographic order" the correct way to determine order of migrations? how do we determine which DB versions work with which Janus versions? IMO, we should write a few sentences documenting this when time permits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a migration for this. One more thing it would be good to define: what determines the expected ordering of the schema update with the Janus software update (or any other relevant deployment operations)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion of creating new migrations is here: https://github.com/divviup/janus/blob/main/docs/DEPLOYING.md#database

And discussion of how to do this operationally (i.e. sequencing of code deploy vs. schema migration) is being drafted in https://github.com/divviup/janus-ops/pull/663

I'm going to implement Janus checking whether it supports the current schema version in #1241

@@ -4453,12 +4470,26 @@ pub mod models {
self.ord
}

/// Returns the last preparation step returned by the Helper, if any.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "by the Helper" always accurate? Wouldn't a helper implementation use this to store the leader's step?

Copy link
Member Author

@branlwyd branlwyd Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Helper always uses this field to store the last preparation step it returned to the Leader, to allow replay to occur. (during processing of an aggregation init/continue message, the Helper updates its in-memory "last" prep step to be what it will eventually return, then builds its response message based on the "last" prep steps -- the intent is to ensure as strongly as possible that what is returned on the initial response will match exactly what is returned on a replay. this implementation strategy also makes it somewhat easier to write code that repeatedly processes each report aggregation while allowing each processing step to update the returned preparation step) The Helper doesn't need to store the Leader's last step.

(and, slightly off-topic from the question, the Leader always stores None in this field.)

aggregator/src/aggregator/aggregation_job_driver.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants