augur curate I/O: validate records have same fields #1518

joverlee521 · 2024-07-02T23:21:54Z

Description of proposed changes

Validate records have the same fields before and after the subcommand modifies the records.

Includes a functional test to check for the input error, but there's no good way to check for the output error. Hopefully we just never see it!

Related issue(s)

Resolves #1510

Checklist

Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2024-07-02T23:33:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.69%. Comparing base (9447c66) to head (6e35dd4).
Report is 224 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1518      +/-   ##
==========================================
+ Coverage   69.63%   69.69%   +0.05%     
==========================================
  Files          74       74              
  Lines        7812     7827      +15     
  Branches     1910     1914       +4     
==========================================
+ Hits         5440     5455      +15     
  Misses       2086     2086              
  Partials      286      286

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jameshadfield · 2024-07-03T01:43:08Z

tests/functional/curate/cram/validate-records.t

+  [2]
+
+Passing the records through multiple augur curate commands should raise the
+same error when it encounters the record with mismatched fields.


[not a request for changes]

Do you have any good resources for understanding how pipes work in situations like this? I.e. the augur curate X is printing lines one-by-one, so does an individual NDJSON line flow through all curate commands before the next one starts flowing through? That's what this output makes it seem like. But when python flushes the print buffer must come into the equation right? And unix pipes presumably have some concept of backpressure / buffering?

From my understanding, the shell starts all the commands at the same time, with file descriptors arranged so that the STDOUT of the first command is the STDIN of the second, and so on down the line.

There is a buffer for the pipe, managed by the kernel, but if it's full, it blocks on the write side (and if it's empty, it blocks on the read side). This SO answer may be helpful.

My understanding is as @genehack described ☝️ (I've only skimmed the pipe man page and have a limited understanding here)

so does an individual NDJSON line flow through all curate commands before the next one starts flowing through?

This should depend on the buffer size, where multiple records can flow through a single command to fill up the buffer before being passed to the next command.

In the case where the first command runs into an error, it should close it's "write end" so the subsequent commands will receive some end-of-file signal and terminate after writing their outputs as well. (Or exit immediately if set -eo pipefail)

Thanks both! Reading those resources this is my understanding of what's happening (using c1 for the first curate command, etc):

c1 writes the first record to the appropriate fd, and there's no buffering on the python side (since we are using print() with the default new line ending).

c2 reads this more-or-less immediately and in parallel with c1 continuing to run. It writes output to its fd, c3 reads it and so on.

This happens for the first three (valid) records - they make it through all three curate commands and are written to stdout.

At some point c1 reads the invalid record, prints to stderr, and exits code 2. This is, AFAICT, seen by c2 no differently to an "end of file" and so c2 exits (code 0) once it's consumed all the data in its input buffer.

The pipefail causes the entire pipeline to have exit code 2 because c1 had exit code 2, but this is done after the pipeline has finished. I.e. it doesn't actually change any behaviour of the pipeline -- after c1 has exited c2 will continue to run while it has data to read on it's input buffer and so on.

The order of steps (3) and (4) seems like it's a race-condition, but the fact that the pipeline's output has the error message (stderr) before the records (stdout) indicates that (4) comes before the first record has made it through the entire pipeline. Stdout seems to be line-buffered so I don't think that's important here.

James asked me to check his understanding, and generally yeah, it's about right AFAICT.

One small inaccuracy is that sys.stdout is block-buffered, not line-buffered, when not connected to a TTY. (And it can be unbuffered entirely if desired.)

If you wanted to empirically test this understanding, you could strace the pipeline (e.g. the bash process and its children) and inspect the order of read/write/exit operations for each process in the pipeline. (strace is a Linux tool; there are equivalents for macOS, but I'm not as adept with them.)

Additional things to consider if you want to dig into the order of 3 vs. 4 more is also how exactly stdout and stderr are interleaved by cram to match against the test file and the buffering mode of Python's sys.stderr (which is version dependent).

Validate records have the same fields before and after the subcommand modifies the records. Includes a functional test to check for the input error, but there's no good way to check for the output error. Hopefully we just never see it!

joverlee521 · 2024-07-03T19:21:50Z

Rebased on to master to ensure no merge conflicts in changelog and updated changelog.

Follow up to #1514

joverlee521 requested a review from a team July 2, 2024 23:22

genehack approved these changes Jul 2, 2024

View reviewed changes

jameshadfield reviewed Jul 3, 2024

View reviewed changes

jameshadfield approved these changes Jul 3, 2024

View reviewed changes

joverlee521 added 2 commits July 3, 2024 12:17

Update changelog

9811491

joverlee521 force-pushed the curate-io-validation branch from d84a0f4 to 9811491 Compare July 3, 2024 19:21

Fix changelog link

6e35dd4

Follow up to #1514

joverlee521 merged commit b69444a into master Jul 3, 2024
19 checks passed

joverlee521 deleted the curate-io-validation branch July 3, 2024 19:30

joverlee521 mentioned this pull request Jul 12, 2024

Put the subcmd name in the validate_output error message [#1539] #1540

Merged

4 tasks

joverlee521 mentioned this pull request Jul 25, 2024

ingest fails nextstrain/rabies#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augur curate I/O: validate records have same fields #1518

augur curate I/O: validate records have same fields #1518

joverlee521 commented Jul 2, 2024 •

edited

Loading

codecov bot commented Jul 2, 2024 •

edited

Loading

jameshadfield Jul 3, 2024 •

edited

Loading

genehack Jul 3, 2024

joverlee521 Jul 3, 2024

jameshadfield Jul 4, 2024 •

edited

Loading

tsibley Jul 8, 2024

tsibley Jul 8, 2024

joverlee521 commented Jul 3, 2024

augur curate I/O: validate records have same fields #1518

augur curate I/O: validate records have same fields #1518

Conversation

joverlee521 commented Jul 2, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Jul 2, 2024 • edited Loading

Codecov Report

jameshadfield Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

genehack Jul 3, 2024

Choose a reason for hiding this comment

joverlee521 Jul 3, 2024

Choose a reason for hiding this comment

jameshadfield Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

tsibley Jul 8, 2024

Choose a reason for hiding this comment

tsibley Jul 8, 2024

Choose a reason for hiding this comment

joverlee521 commented Jul 3, 2024

joverlee521 commented Jul 2, 2024 •

edited

Loading

codecov bot commented Jul 2, 2024 •

edited

Loading

jameshadfield Jul 3, 2024 •

edited

Loading

jameshadfield Jul 4, 2024 •

edited

Loading