-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403
Conversation
85bf1bf
to
114c901
Compare
114c901
to
d1aef81
Compare
0e07091
to
7b45b2f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1403 +/- ##
=======================================
Coverage 67.46% 67.47%
=======================================
Files 69 69
Lines 7484 7474 -10
Branches 1840 1840
=======================================
- Hits 5049 5043 -6
+ Misses 2161 2159 -2
+ Partials 274 272 -2 ☔ View full report in Codecov by Sentry. |
augur/parse.py
Outdated
strain_key = 'strain' | ||
else: | ||
strain_key = None | ||
for possible_id in [args.output_id_field, DEFAULT_ID_COLUMNS]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @victorlin's suggested alternative to this block, but I wanted to mention that this code on line 167 doesn't do what you want it to do, @j23414. Instead of looping through a list of the user-provided output id field and the default ids, this will loop through the user-provided field and the tuple of defaults:
>>> DEFAULT_ID_COLUMNS = ("strain", "name")
>>> output_id_field = "accession"
>>> for possible_id in [output_id_field, DEFAULT_ID_COLUMNS]:
... print(possible_id)
...
accession
('strain', 'name')
You could make this work as expected a couple of ways including iterable unpacking with the *
operator or list concatenation.
# Iterable unpacking with *
>>> for possible_id in [output_id_field, *DEFAULT_ID_COLUMNS]:
... print(possible_id)
...
accession
strain
name
# List concatenation
>>> for possible_id in [output_id_field] + list(DEFAULT_ID_COLUMNS):
... print(possible_id)
...
accession
strain
name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gah, thanks! 😅
7b45b2f
to
8367e59
Compare
fc6c6cc
to
f587119
Compare
40e9a00
to
9810855
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good pending some final touch ups
e41dbb5
to
d313f58
Compare
Originally augur parse first checks for 'name' then 'strain' and then the first field to figure out the sequence id to use. This test checks the default id column order.
During augur parse, instead of hardcoding a check for a 'name' field and then a 'strain' field to use as a sequence ID, reuse and check against the io.metadata DEFAULT_ID_COLUMNS list to be more consistent with the rest of augur. This will result in a breaking behavior change in that parse is now checking for 'strain' before 'name' as the default ID column. However, the argument to be more consistent with the rest of augur is that the DEFAULT_ID_COLUMNS list is already used in other parts of augur and augur parse should be no different.
Include a `--output-id-field` parameter in augur parse to indicate the ID field for the output sequences file, such as using "accession" instead of "strain". It is important to note that the `--output-id-field` parameter is not required, and augur parse will fall back to the DEFAULT_ID_COLUMNS (e.g. ('strain','name')) if it is not present. If none of the DEFAULT_ID_COLUMNS are present in the fields, fall back to using the first field. The `--output-id-field` parameter is designed to accept a single field name, not multiple. User are required to provide a `--fields col1 col2 strain accession` argument in the same invocation. It seems reasonable to expect the user to choose a specific field name for the ID. To prevent unintended behaviors, if `--output-id-field` is not present in `--fields` (e.g., due to a typo), augur parse will raise an error instead of falling back to DEFAULT_ID_COLUMNS.
… future. Instead of a refactor and breaking change from 8dc0694, this commit adds a deprecation warning to the `parse` step that the default `id` field used will be reordered from ('name', 'strain') to ('strain', 'name'). This will give users time to update their scripts and workflows with an `--output-id-field 'name'` before the change is made. co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
da2a23c
to
9d96cad
Compare
Description of proposed changes
As discussed in #1402, include a parameter in augur parse to indicate the ID field for the output sequences file, such as using "accession" instead of "strain" or "name."
Open to other suggestions for parameter names or let me know if I missed adding a test somewhere.
Related issue(s)
--metadata-id-columns
param to augur parse? #1402Checklist