Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

Merged
merged 4 commits into from
Jan 30, 2024

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Jan 26, 2024

Overview

Closes RSS-98. Takes over for #11561.

Test Plan

I'm going to spend the day putting this on various robots and smoke-testing them to make sure it doesn't break any basic flows or cause any horrific performance regressions.

Other than that, we have okay automated test coverage. But see the risk assessment below.

Changelog

  • For RunStore, convert our storage of Commands and StateSummarys from pickled dictified Pydantic objects to JSON-serialized Pydantic objects.
  • For CompletedAnalysisStore, we already did that conversion in perf(app,robot-server): Download analyses as raw JSON documents #13425, in an experimental parallel column. Promote that column to be non-experimental, and delete the old one that had pickled dicts.

Review requests

None in particular.

Risk assessment

Medium.

Correctness risks

Automated tests currently cover these:

  • Freshly-created records can survive a round-trip in and out of the database.
  • The migration doesn't outright break any endpoints—i.e., HTTP requests won't start returning HTTP 500 errors.

But they do not cover:

  • HTTP requests will return exactly the same JSON data that they were returning before. In particular, dealing with JSON null/omitted fields is always a bit hazardous with Pydantic, and this migration introduces points where something can go wrong with that.

This omission is tracked as RSS-213.

Mitigating this risk is the fact that we've been running this scheme in production, experimentally, since #13425, and it apparently hasn't caused any problems there.

Performance risks

Prior testing has shown parsing JSON to take about 1.7x as long as parsing our existing pickled dicts. So retrieving runs and commands may get slower. (Retrieving analyses is unaffected because of #13425).

Manual testing will have to bear out whether this is a problem in practice. I expect #14286 and #14348 to go a long way. If we need a nuclear option, we can do something like #13425 again.

The migration can take a long time. I think this is unavoidable to some extent and we just need to communicate it. I've done what I can here to mitigate this by combining this migration with #14348. RSS-217 makes this worse, but we can fix that separately.

Risk tolerance

We should take seriously the risks listed above, and mitigate them as much as we have time for, but we also need to weigh them against the risk of not merging this.

  • RSS-98 is really not a good thing to have around, and it gets harder to solve the longer we let it sit.
  • We were previously mitigating RSS-98 by giving it ongoing manual attention (e.g. RSS-157, RSS-198). We haven't done that in a while (RSS-371) because we've been preoccupied with the Flex and with deck configuration.
  • Because we don't have Datadog configured for the Flex yet, RSS-98 can be setting up Flex-specific traps that we don't know about.

@SyntaxColoring SyntaxColoring changed the title refactor(robot-server): refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 Jan 26, 2024
@SyntaxColoring SyntaxColoring requested a review from a team January 26, 2024 03:39
@SyntaxColoring SyntaxColoring marked this pull request as ready for review January 26, 2024 03:39
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner January 26, 2024 03:39
Comment on lines +121 to +114
old_state_summary = old_run_row.state_summary
new_state_summary = (
None
if old_run_row.state_summary is None
else pydantic_to_json(
pydantic.parse_obj_as(StateSummary, old_state_summary)
)
)
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a try: ... except: pass around each run's migration to drop it if something goes wrong, instead of making the whole server unavailable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've decided I should do this. Dropping runs silently is bad, but it's significantly better than failing the boot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten to this yet. We should do it in a follow-up.

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some background on this file:

This supports endpoints like GET /protocols/{id}/analyses/{id}.

Originally, we stored these records as pickled dicts, and fully validated them before returning them over HTTP.

In #13425, we introduced a parallel DB column and parallel endpoint: they store the records as JSON, and return them without validating them. This of course has downsides, but it seems to be working well.

This PR consolidates the parallel DB columns, but it leaves the parallel endpoints alone. So we're always reading JSON now. The difference is just whether we validate it before returning it to the client, which depends on which endpoint the client uses.

We can probably consolidate the parallel endpoints now, but that's beyond the scope of this PR.

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Jan 30, 2024

Test results so far:

  • The atomicity implemented by prior PRs seems to work, so that's cool. I haven't had problems with rebooting or power-cycling in the middle of the migration.
  • I ran a few protocols with 1000 comment commands to simulate there being a lot of data to migrate. The migration took ~15 seconds. Extrapolating to a worst case based on 20 10000-command runs (stress test cases from here), that's like 25 minutes. For comparison, our ABR protocols seem to have ~50-200 commands.
    • I have a proof of concept branch that uses multiprocessing to migrate the runs in parallel, and it seems like it can cut that in ~half, so we might do that in a follow-up PR.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ffcc548) 68.29% compared to head (4df1b1d) 68.29%.
Report is 2 commits behind head on edge.

❗ Current head 4df1b1d differs from pull request most recent head 66121ad. Consider uploading reports for the commit 66121ad to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             edge   #14355   +/-   ##
=======================================
  Coverage   68.29%   68.29%           
=======================================
  Files        1623     1623           
  Lines       54858    54858           
  Branches     4115     4115           
=======================================
+ Hits        37466    37468    +2     
+ Misses      16705    16704    -1     
+ Partials      687      686    -1     
Flag Coverage Δ
app 34.89% <ø> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
robot-server/robot_server/runs/run_store.py 100.00% <ø> (ø)

... and 1 file with indirect coverage changes

Copy link
Contributor

@jbleon95 jbleon95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good for moving this whole database migration forward. We should definitely keep an eye on the potential performance issues you noted, but at this point we should have this out on our internal robots and respond from there.

Copy link
Contributor

@CaseyBatten CaseyBatten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts on the risks section:

  • To what extent has the new migration time estimate been communicated to other parties within opentrons? Is there a plan in place to implement user/public facing prompts regarding the migration process to prepare users for this?
  • If the solutions within perf(robot-server): Discriminate command unions #14286 and perf(robot-server): Store one command per row #14348 do not provide the desired enhancement in performance we should be prepared to implement an as yet designed third option so as not to further slow communications on the system. I'm not exactly a fan of the nuclear option.

Outside of that, the changes here look appropriate and we can feel safe to move forward into more advanced testing based on the initially reported test results.

@SyntaxColoring
Copy link
Contributor Author

To what extent has the new migration time estimate been communicated to other parties within opentrons? Is there a plan in place to implement user/public facing prompts regarding the migration process to prepare users for this?

I'll make a post internally, and for the public, we'll have to add release notes.

Base automatically changed from db_run_commands_as_rows to edge January 30, 2024 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants