refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

SyntaxColoring · 2024-01-26T03:08:25Z

Overview

Closes RSS-98. Takes over for #11561.

Test Plan

I'm going to spend the day putting this on various robots and smoke-testing them to make sure it doesn't break any basic flows or cause any horrific performance regressions.

Other than that, we have okay automated test coverage. But see the risk assessment below.

Changelog

For RunStore, convert our storage of Commands and StateSummarys from pickled dictified Pydantic objects to JSON-serialized Pydantic objects.
For CompletedAnalysisStore, we already did that conversion in perf(app,robot-server): Download analyses as raw JSON documents #13425, in an experimental parallel column. Promote that column to be non-experimental, and delete the old one that had pickled dicts.

Review requests

None in particular.

Risk assessment

Medium.

Correctness risks

Automated tests currently cover these:

Freshly-created records can survive a round-trip in and out of the database.
The migration doesn't outright break any endpoints—i.e., HTTP requests won't start returning HTTP 500 errors.

But they do not cover:

HTTP requests will return exactly the same JSON data that they were returning before. In particular, dealing with JSON null/omitted fields is always a bit hazardous with Pydantic, and this migration introduces points where something can go wrong with that.

This omission is tracked as RSS-213.

Mitigating this risk is the fact that we've been running this scheme in production, experimentally, since #13425, and it apparently hasn't caused any problems there.

Performance risks

Prior testing has shown parsing JSON to take about 1.7x as long as parsing our existing pickled dicts. So retrieving runs and commands may get slower. (Retrieving analyses is unaffected because of #13425).

Manual testing will have to bear out whether this is a problem in practice. I expect #14286 and #14348 to go a long way. If we need a nuclear option, we can do something like #13425 again.

The migration can take a long time. I think this is unavoidable to some extent and we just need to communicate it. I've done what I can here to mitigate this by combining this migration with #14348. RSS-217 makes this worse, but we can fix that separately.

Risk tolerance

We should take seriously the risks listed above, and mitigate them as much as we have time for, but we also need to weigh them against the risk of not merging this.

RSS-98 is really not a good thing to have around, and it gets harder to solve the longer we let it sit.
We were previously mitigating RSS-98 by giving it ongoing manual attention (e.g. RSS-157, RSS-198). We haven't done that in a while (RSS-371) because we've been preoccupied with the Flex and with deck configuration.
Because we don't have Datadog configured for the Flex yet, RSS-98 can be setting up Flex-specific traps that we don't know about.

SyntaxColoring · 2024-01-26T03:46:57Z

robot-server/robot_server/persistence/_migrations/up_to_3.py

+        old_state_summary = old_run_row.state_summary
+        new_state_summary = (
+            None
+            if old_run_row.state_summary is None
+            else pydantic_to_json(
+                pydantic.parse_obj_as(StateSummary, old_state_summary)
+            )
+        )


We could add a try: ... except: pass around each run's migration to drop it if something goes wrong, instead of making the whole server unavailable.

Yeah I've decided I should do this. Dropping runs silently is bad, but it's significantly better than failing the boot.

I haven't gotten to this yet. We should do it in a follow-up.

SyntaxColoring · 2024-01-26T04:04:37Z

robot-server/robot_server/protocols/completed_analysis_store.py

Some background on this file:

This supports endpoints like GET /protocols/{id}/analyses/{id}.

Originally, we stored these records as pickled dicts, and fully validated them before returning them over HTTP.

In #13425, we introduced a parallel DB column and parallel endpoint: they store the records as JSON, and return them without validating them. This of course has downsides, but it seems to be working well.

This PR consolidates the parallel DB columns, but it leaves the parallel endpoints alone. So we're always reading JSON now. The difference is just whether we validate it before returning it to the client, which depends on which endpoint the client uses.

We can probably consolidate the parallel endpoints now, but that's beyond the scope of this PR.

SyntaxColoring · 2024-01-30T15:23:13Z

Test results so far:

The atomicity implemented by prior PRs seems to work, so that's cool. I haven't had problems with rebooting or power-cycling in the middle of the migration.
I ran a few protocols with 1000 comment commands to simulate there being a lot of data to migrate. The migration took ~15 seconds. Extrapolating to a worst case based on 20 10000-command runs (stress test cases from here), that's like 25 minutes. For comparison, our ABR protocols seem to have ~50-200 commands.
- I have a proof of concept branch that uses multiprocessing to migrate the runs in parallel, and it seems like it can cut that in ~half, so we might do that in a follow-up PR.

codecov · 2024-01-30T17:25:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ffcc548) 68.29% compared to head (4df1b1d) 68.29%.
Report is 2 commits behind head on edge.

❗ Current head 4df1b1d differs from pull request most recent head 66121ad. Consider uploading reports for the commit 66121ad to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             edge   #14355   +/-   ##
=======================================
  Coverage   68.29%   68.29%           
=======================================
  Files        1623     1623           
  Lines       54858    54858           
  Branches     4115     4115           
=======================================
+ Hits        37466    37468    +2     
+ Misses      16705    16704    -1     
+ Partials      687      686    -1

Flag	Coverage Δ
app	`34.89% <ø> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
robot-server/robot_server/runs/run_store.py	`100.00% <ø> (ø)`

... and 1 file with indirect coverage changes

jbleon95

I think this looks good for moving this whole database migration forward. We should definitely keep an eye on the potential performance issues you noted, but at this point we should have this out on our internal robots and respond from there.

CaseyBatten

A few thoughts on the risks section:

To what extent has the new migration time estimate been communicated to other parties within opentrons? Is there a plan in place to implement user/public facing prompts regarding the migration process to prepare users for this?
If the solutions within perf(robot-server): Discriminate command unions #14286 and perf(robot-server): Store one command per row #14348 do not provide the desired enhancement in performance we should be prepared to implement an as yet designed third option so as not to further slow communications on the system. I'm not exactly a fan of the nuclear option.

Outside of that, the changes here look appropriate and we can feel safe to move forward into more advanced testing based on the initially reported test results.

SyntaxColoring · 2024-01-30T22:47:07Z

To what extent has the new migration time estimate been communicated to other parties within opentrons? Is there a plan in place to implement user/public facing prompts regarding the migration process to prepare users for this?

I'll make a post internally, and for the public, we'll have to add release notes.

…kles, take 2 (#14355)

SyntaxColoring changed the title ~~refactor(robot-server):~~ refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 Jan 26, 2024

SyntaxColoring requested a review from a team January 26, 2024 03:39

SyntaxColoring marked this pull request as ready for review January 26, 2024 03:39

SyntaxColoring requested a review from a team as a code owner January 26, 2024 03:39

SyntaxColoring commented Jan 26, 2024

View reviewed changes

SyntaxColoring force-pushed the db_run_commands_as_rows branch from 0e8d735 to 98624e3 Compare January 30, 2024 17:21

SyntaxColoring force-pushed the db_store_json branch from 769747c to 9f01a5f Compare January 30, 2024 17:21

jbleon95 approved these changes Jan 30, 2024

View reviewed changes

CaseyBatten approved these changes Jan 30, 2024

View reviewed changes

SyntaxColoring force-pushed the db_run_commands_as_rows branch from 98624e3 to a8156e5 Compare January 30, 2024 22:45

SyntaxColoring force-pushed the db_store_json branch from 9f01a5f to 4df1b1d Compare January 30, 2024 22:45

Base automatically changed from db_run_commands_as_rows to edge January 30, 2024 22:48

SyntaxColoring added 4 commits January 30, 2024 17:49

Add pydantic_to_json() and json_to_pydantic() helper functions.

3b4b67f

Convert run and run_command tables from pickle to JSON.

cca9038

Convert analysis table from pickle to JSON.

d622c0c

Update summary comment.

66121ad

SyntaxColoring force-pushed the db_store_json branch from 4df1b1d to 66121ad Compare January 30, 2024 22:50

SyntaxColoring merged commit 278e56f into edge Jan 30, 2024
7 checks passed

SyntaxColoring deleted the db_store_json branch January 30, 2024 22:55

ncdiehl11 pushed a commit that referenced this pull request Feb 1, 2024

refactor(robot-server): Store Pydantic objects as JSON instead of pic…

2e58cfe

…kles, take 2 (#14355)

SyntaxColoring mentioned this pull request Feb 1, 2024

chore(release): Add release notes for v7.2.0 database migrations #14409

Merged

SyntaxColoring mentioned this pull request Feb 20, 2024

fix(robot-server): Fix error parsing persisted loadModule commands #14509

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

SyntaxColoring commented Jan 26, 2024 •

edited

Loading

SyntaxColoring Jan 26, 2024 •

edited

Loading

SyntaxColoring Jan 30, 2024

SyntaxColoring Jan 30, 2024

SyntaxColoring Jan 26, 2024 •

edited

Loading

SyntaxColoring commented Jan 30, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading

jbleon95 left a comment

CaseyBatten left a comment

SyntaxColoring commented Jan 30, 2024

refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

refactor(robot-server): Store Pydantic objects as JSON instead of pickles, take 2 #14355

Conversation

SyntaxColoring commented Jan 26, 2024 • edited Loading

Overview

Test Plan

Changelog

Review requests

Risk assessment

Correctness risks

Performance risks

Risk tolerance

SyntaxColoring Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

SyntaxColoring Jan 30, 2024

Choose a reason for hiding this comment

SyntaxColoring Jan 30, 2024

Choose a reason for hiding this comment

SyntaxColoring Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

SyntaxColoring commented Jan 30, 2024 • edited Loading

codecov bot commented Jan 30, 2024 • edited Loading

Codecov Report

jbleon95 left a comment

Choose a reason for hiding this comment

CaseyBatten left a comment

Choose a reason for hiding this comment

SyntaxColoring commented Jan 30, 2024

SyntaxColoring commented Jan 26, 2024 •

edited

Loading

SyntaxColoring Jan 26, 2024 •

edited

Loading

SyntaxColoring Jan 26, 2024 •

edited

Loading

SyntaxColoring commented Jan 30, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading