-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augur curate I/O: validate records have same fields #1518
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1518 +/- ##
==========================================
+ Coverage 69.63% 69.69% +0.05%
==========================================
Files 74 74
Lines 7812 7827 +15
Branches 1910 1914 +4
==========================================
+ Hits 5440 5455 +15
Misses 2086 2086
Partials 286 286 ☔ View full report in Codecov by Sentry. |
[2] | ||
|
||
Passing the records through multiple augur curate commands should raise the | ||
same error when it encounters the record with mismatched fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[not a request for changes]
Do you have any good resources for understanding how pipes work in situations like this? I.e. the augur curate X
is printing lines one-by-one, so does an individual NDJSON line flow through all curate commands
before the next one starts flowing through? That's what this output makes it seem like. But when python flushes the print buffer must come into the equation right? And unix pipes presumably have some concept of backpressure / buffering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the shell starts all the commands at the same time, with file descriptors arranged so that the STDOUT of the first command is the STDIN of the second, and so on down the line.
There is a buffer for the pipe, managed by the kernel, but if it's full, it blocks on the write side (and if it's empty, it blocks on the read side). This SO answer may be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is as @genehack described ☝️ (I've only skimmed the pipe man page and have a limited understanding here)
so does an individual NDJSON line flow through all curate commands before the next one starts flowing through?
This should depend on the buffer size, where multiple records can flow through a single command to fill up the buffer before being passed to the next command.
In the case where the first command runs into an error, it should close it's "write end" so the subsequent commands will receive some end-of-file signal and terminate after writing their outputs as well. (Or exit immediately if set -eo pipefail
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both! Reading those resources this is my understanding of what's happening (using c1
for the first curate command, etc):
c1
writes the first record to the appropriate fd, and there's no buffering on the python side (since we are usingprint()
with the default new line ending).c2
reads this more-or-less immediately and in parallel withc1
continuing to run. It writes output to its fd,c3
reads it and so on.- This happens for the first three (valid) records - they make it through all three curate commands and are written to stdout.
- At some point
c1
reads the invalid record, prints to stderr, and exits code 2. This is, AFAICT, seen byc2
no differently to an "end of file" and soc2
exits (code 0) once it's consumed all the data in its input buffer. - The
pipefail
causes the entire pipeline to have exit code 2 becausec1
had exit code 2, but this is done after the pipeline has finished. I.e. it doesn't actually change any behaviour of the pipeline -- afterc1
has exitedc2
will continue to run while it has data to read on it's input buffer and so on. - The order of steps (3) and (4) seems like it's a race-condition, but the fact that the pipeline's output has the error message (stderr) before the records (stdout) indicates that (4) comes before the first record has made it through the entire pipeline. Stdout seems to be line-buffered so I don't think that's important here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James asked me to check his understanding, and generally yeah, it's about right AFAICT.
One small inaccuracy is that sys.stdout
is block-buffered, not line-buffered, when not connected to a TTY. (And it can be unbuffered entirely if desired.)
If you wanted to empirically test this understanding, you could strace
the pipeline (e.g. the bash
process and its children) and inspect the order of read/write/exit operations for each process in the pipeline. (strace
is a Linux tool; there are equivalents for macOS, but I'm not as adept with them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional things to consider if you want to dig into the order of 3 vs. 4 more is also how exactly stdout and stderr are interleaved by cram to match against the test file and the buffering mode of Python's sys.stderr
(which is version dependent).
Validate records have the same fields before and after the subcommand modifies the records. Includes a functional test to check for the input error, but there's no good way to check for the output error. Hopefully we just never see it!
d84a0f4
to
9811491
Compare
Rebased on to master to ensure no merge conflicts in changelog and updated changelog. |
Follow up to #1514
Description of proposed changes
Validate records have the same fields before and after the subcommand modifies the records.
Includes a functional test to check for the input error, but there's no good way to check for the output error. Hopefully we just never see it!
Related issue(s)
Resolves #1510
Checklist