Pick a meaning of TSV and use it consistently #1566

tsibley · 2024-08-02T21:43:00Z

We should pick between either IANA TSVs or RFC 4180 CSV-like TSVs and then be consistent about that choice everywhere. Problems arise when mixing the two, e.g. expecting one but getting the other.

IANA TSVs prohibit tabs and newlines in field values. Parsing is much simpler because of this, but generators need to ensure tabs and newlines are stripped/replaced or cause an error, lest they risk generating corrupt output. The programs in tsv-utils work with IANA TSVs.

CSV-like TSVs can represent all values without limitation. Parsing is more complicated because of quoting and escaping (but libraries exist). The programs in csvtk work with CSV-like TSVs. The Python csv module works with CSV-like TSVs by default, but can be configured to work with IANA-like TSVs instead. (IANA-like, because it will error on some outputs despite them having valid IANA TSVs forms. Some inputs may also misbehave, despite being valid.)

Augur mostly uses CSV-like TSV with the csv module configured in QUOTE_MINIMAL quoting mode (the default). The exception is augur curate commands which use IANA-like TSV with the csv module configured in QUOTE_NONE mode, meaning it will error if field values contain a tab or newline.

Nextclade outputs CSV-like TSV.

Related to:

The text was updated successfully, but these errors were encountered:

genehack · 2024-08-02T21:48:50Z

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

The reverse is not true: something expecting an IANA TSV will not accept a CSV-like TSV that contains a field with an embedded tab or newline.

If both of those statements are accurate, settling on CSV-like TSVs feels like the most compatible option — why would we not want to do that?

tsibley · 2024-08-02T22:03:52Z

~~Correct. And I agree.~~ Actually, not correct.

While this issue is about consistency within Augur itself, it relates to consistency within the broader Nextstrain ecosystem when using TSVs. We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru csv2tsv --csv-delim $'\t' to convert from CSV-like TSVs to IANA TSVs first. We might need to start doing more of that if we keep Augur producing CSV-like TSVs.

corneliusroemer · 2024-08-03T20:18:17Z

Thanks @tsibley for raising this issue (and teaching me something new)! I was not fully conscious of there being 2 incompatible ways libraries/tools parse and generate TSVs.

Technically, RFC 4180 does not mention TSVs or any separators other than comma at all. So if we wanted to do what the most authoritative spec says, it would clearly be IANA TSV we should be using. However, it is true that many tools seem to treat TSVs as RFC 4180 CSVs with the only difference being the separator used.

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like " is not valid per RFC CSV specs, so there's undefined behaviour, it might be read correctly, but it might also error. Also roundtripping through RFC-CSV might not be idempotent. E.g. Implementing tools one single quote is perfectly valdouble quotes ("""") will be wrongly (and silently wrongly) interpreted by a library that expects CSV-TSVs as only containing a single double quote ".

Example: Python's csv package if configured with delimiter="\t" misinterprets the valid IANA-TSV file A\n" (one column A, two lines, non-header line contains ") as having empty string on the second line. Round tripping causes a change, the output has 2 double quotes, instead of 1.

>>> import csv; d=list(csv.reader(['A','"'],delimiter="\t")); print(d);
[['A'], ['']]
>>> import io; output=io.StringIO(); csv.writer(output, delimiter="\t").writerows(d); print(output.getvalue())
A
""

The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV. So we can implement both in Augur with minimal (though breaking) changes. Whatever we decide, we break some augur command's handling of TSVs.

IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI. In fact, I would speculate that the reason TSVs are so commonly used in bioinformatics is precisely because one doesn't need to escape commas and double quotes. What's the point of using TSVs if you end up having to escape common characters (double quotes)? I'm pretty sure that double quotes are more commonly encountered in common augur input data than tabs or newlines.

Let me list all the options we have here, I think there's actually more than picking one or the other:

A) Interpret all TSVs according to IANA. If \t (and \n within fields) are encountered, throw error, telling the user their data is not in line with IANA TSV spec and that they should use csv instead. If double quotes are encountered, treat them literally
B) Interpret all TSVs according to RFC spec, using Python's minimal quoting dialect, which only adds double quote escapes in case of fields containing ", \t or \n - potentially
C) Support both, exposing the type of TSV to use as input/output as a command line option, choose either one as default. Input and output modes could possibly be different. That would mean we'd have two options: one for input type, one for output type.

If we were to support both, one could also add an "auto" input mode, that dynamically decides the mode when encountering the first line that positively identifies the file as being according to IANA or RFC spec (i.e. when it encounters a line that's invalid as IANA but valid as RFC, use RFC mode and vice versa, sometimes both might be valid but have semantic differences, e.g. a "" field - in that case auto mode could either fail or use a default mode). Possibly, input "auto" mode could also be used to set the output mode automatically.

I think it would really help us if we analyzed input TSV files in our repos and see whether they are a) valid for both, or b) valid only for IANA, c) valid only for RFC. With some data, it's easier to make a good decision here.

Sorry for the essay here - this is quite a fascinating topic :)

We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru csv2tsv --csv-delim $'\t' to convert from CSV-like TSVs to IANA TSVs first. We might need to start doing more of that if we keep Augur producing CSV-like TSVs.

It's good to know that some RFC-TSVs can be converted to IANA as long as only " but not \t or \n are used in fields. But we should only do this if all of the downstream usage assumes IANA tsvs, otherwise, one would have to unnecessarily re-escape ".

I wonder if it might make sense to start using non-ambiguous extensions, e.g. .itsv (IANA) and .ctsv (RFC-CSV), to be explicit about which spec the file follows.

tsibley · 2024-08-13T21:16:37Z

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like " is not valid per RFC CSV specs, so there's undefined behaviour, it might be read correctly, but it might also error.

Ah, yes, you're right. Thanks for this catch!

Example: Python's csv package if configured with delimiter="\t" misinterprets the valid IANA-TSV file A\n" (one column A, two lines, non-header line contains ") as having empty string on the second line. Round tripping causes a change, the output has 2 double quotes, instead of 1.
>>> import csv; d=list(csv.reader(['A','"'],delimiter="\t")); print(d);
[['A'], ['']]
>>> import io; output=io.StringIO(); csv.writer(output, delimiter="\t").writerows(d); print(output.getvalue())
A
""
The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV

In that example, the single double quote, ", isn't being doubled per se. It's being parsed to an empty string and then when writing that empty string, it's represented in quotes per RFC standards.

You can get the output you expect when writing a value of " if you do more than set delimiter, for example:

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['"']]);
A
"

but that's abusing quotechar and will break for values with a literal tab in them (which is ~expected, but it breaks for the wrong reason (no escape char instead of no quote char)).

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['\t']]);
A
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: need to escape, but no escapechar set

But even the dialect options I set above don't handle this case below of a single empty field:

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['']]);
A
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: single empty field record must be quoted

despite that having a valid IANA representation.

The csv module is really meant for RFC CSVs and related formats and it's non-trivial to make it produce true IANA TSVs. Maybe even impossible.

IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI.

I've seen plenty of tabs and newlines and other stuff in metadata before. If we're going to emit IANA TSVs for data we read from other formats, we need to be consistent about removing/replacing those chars before serialization. Using RFC TSVs means we don't have to worry about it: the csv library will take care of it for us.

Let me list all the options we have here, I think there's actually more than picking one or the other:

There are other options, like the ones you've listed, yes, but I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally).

tsibley · 2024-08-13T21:30:07Z

(I've updated the issue description to differentiate between IANA TSVs and IANA-like TSVs, i.e. what we can do with Python's csv.)

tsibley · 2024-09-03T21:52:20Z

I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally).

This is my main point with this issue. Currently to properly handle various Augur outputs requires knowing a priori which particular variant of TSVs it produces. It is not straightforward or clear, much less documented. We should pick one and stick to it and document it (and patterns for working with them).

From my viewpoint, since Python's csv module is so handy and used broadly in Augur but (AFAICT) can't actually produce IANA TSVs, the easiest option is to standardize our TSVs as RFC 4180 CSV-like TSVs.

genehack · 2024-09-04T18:09:54Z

From my viewpoint, since Python's csv module is so handy and used broadly in Augur but (AFAICT) can't actually produce IANA TSVs, the easiest option is to standardize our TSVs as RFC 4180 CSV-like TSVs.

+1

This construction reads much clearer and cleaner. Moves the Nextclade field map directly and more conveniently into the YAML config instead of referencing a separate TSV file. Putting the field map into a separate file seemed to be only for the sake of the --kv-file (-k) interface provided by `cvstk rename2`, which we're no longer using here. For backwards compatibility, configs that reference a TSV file are still supported and will be handled on-the-fly. Note that `augur curate` commands currently emit CSV-like TSVs that are limited to be IANA-like¹ such that parsing them with tsv-utils is most appropriate, hence the switch from `csvtk cut` to `tsv-select`. ¹ See <nextstrain/augur#1566>.

This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the rule now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>

This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>

joverlee521 · 2024-09-11T20:18:28Z

Discussion in nextstrain/measles#52 (comment), prompted me to look into how Nextclade writes TSVs. I assume it's producing CSV-like TSVs since it's using the csv::WriteBuilder and only specifies the delimiter without changing any of the other defaults such as quoting.

tsibley · 2024-09-12T20:12:33Z

@joverlee521 Ah! Yes, I concur. I'd looked into that a week or two ago and forgot to write it down/forgot I did! orz

But I came to the same conclusion as you. I also validated it by finding CSV quoting in the ncov/open nextclade.tsv.

tsibley · 2024-09-17T19:00:34Z

This seems settled on emitting RFC 4180 CSV-like TSVs.

We should

Ensure all Augur TSV output is compliant
Update workflows and other places that we manipulate TSVs to properly handle them

To properly handle CSV-like TSVs, I believe our rules should be

Use csvtk. The --lazy (-l) option and fix-quotes/del-quotes commands should not be necessary and should be avoided.
Use tsv-utils, but only after first passing input thru csv2tsv --csv-delim $'\t' (to convert from CSV-like TSV to IANA TSV, possibly lossily) and last passing output thru csvtk fix-quotes --tabs (to convert back from IANA TSV to CSV-like TSV).

This construction reads much clearer and cleaner. Moves the Nextclade field map directly and more conveniently into the YAML config instead of referencing a separate TSV file. Putting the field map into a separate file seemed to be only for the sake of the --kv-file (-k) interface provided by `cvstk rename2`, which we're no longer using here. Backwards compatibility with configs that name a TSV file is not preserved since this pathogen-repo-guide is expected to be used to stamp out new repos, and we don't have any particular process/plan for how to update previously stamped out repos. Note that `augur curate` commands currently emit CSV-like TSVs that are limited to be IANA-like¹ such that parsing them with tsv-utils is most appropriate, hence the switch from `csvtk cut` to `tsv-select`. ¹ See <nextstrain/augur#1566>. Ported-from: <nextstrain/measles@faebd64> Related-to: <nextstrain/measles#52> Related-to: <#65>

This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields. We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566> Ported-from: <nextstrain/measles@4d73b7f> Related-to: <nextstrain/measles#52> Related-to: <#65>

tsibley mentioned this issue Aug 2, 2024

Fix curate internal quotes take 2 #1565

Open

4 tasks

tsibley mentioned this issue Aug 2, 2024

augur merge #1563

Merged

7 tasks

tsibley mentioned this issue Sep 16, 2024

Update Nextclade metadata merge to use augur curate rename and augur merge nextstrain/measles#52

Merged

1 task

joverlee521 mentioned this issue Oct 2, 2024

ingest: Properly handle CSV-like TSVs nextstrain/pathogen-repo-guide#66

Open

joverlee521 mentioned this issue Oct 4, 2024

Group segments by strains nextstrain/oropouche#18

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pick a meaning of TSV and use it consistently #1566

Pick a meaning of TSV and use it consistently #1566

tsibley commented Aug 2, 2024 •

edited

Loading

genehack commented Aug 2, 2024

tsibley commented Aug 2, 2024 •

edited

Loading

corneliusroemer commented Aug 3, 2024 •

edited

Loading

tsibley commented Aug 13, 2024

tsibley commented Aug 13, 2024

tsibley commented Sep 3, 2024

genehack commented Sep 4, 2024

joverlee521 commented Sep 11, 2024

tsibley commented Sep 12, 2024

tsibley commented Sep 17, 2024 •

edited

Loading

Pick a meaning of TSV and use it consistently #1566

Pick a meaning of TSV and use it consistently #1566

Comments

tsibley commented Aug 2, 2024 • edited Loading

genehack commented Aug 2, 2024

tsibley commented Aug 2, 2024 • edited Loading

corneliusroemer commented Aug 3, 2024 • edited Loading

tsibley commented Aug 13, 2024

tsibley commented Aug 13, 2024

tsibley commented Sep 3, 2024

genehack commented Sep 4, 2024

joverlee521 commented Sep 11, 2024

tsibley commented Sep 12, 2024

tsibley commented Sep 17, 2024 • edited Loading

tsibley commented Aug 2, 2024 •

edited

Loading

tsibley commented Aug 2, 2024 •

edited

Loading

corneliusroemer commented Aug 3, 2024 •

edited

Loading

tsibley commented Sep 17, 2024 •

edited

Loading