Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403

j23414 · 2024-01-29T23:14:03Z

Description of proposed changes

As discussed in #1402, include a parameter in augur parse to indicate the ID field for the output sequences file, such as using "accession" instead of "strain" or "name."

Open to other suggestions for parameter names or let me know if I missed adding a test somewhere.

Related issue(s)

Feat request: add --metadata-id-columns param to augur parse? #1402

Checklist

Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

augur/parse.py

codecov · 2024-01-30T00:17:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (499f0e9) 67.46% compared to head (d313f58) 67.47%.

❗ Current head d313f58 differs from pull request most recent head 9d96cad. Consider uploading reports for the commit 9d96cad to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1403   +/-   ##
=======================================
  Coverage   67.46%   67.47%           
=======================================
  Files          69       69           
  Lines        7484     7474   -10     
  Branches     1840     1840           
=======================================
- Hits         5049     5043    -6     
+ Misses       2161     2159    -2     
+ Partials      274      272    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huddlej · 2024-01-31T00:14:38Z

augur/parse.py

-        strain_key = 'strain'
-    else:
+    strain_key = None
+    for possible_id in [args.output_id_field, DEFAULT_ID_COLUMNS]:


I like @victorlin's suggested alternative to this block, but I wanted to mention that this code on line 167 doesn't do what you want it to do, @j23414. Instead of looping through a list of the user-provided output id field and the default ids, this will loop through the user-provided field and the tuple of defaults:

>>> DEFAULT_ID_COLUMNS = ("strain", "name") >>> output_id_field = "accession" >>> for possible_id in [output_id_field, DEFAULT_ID_COLUMNS]: ... print(possible_id) ... accession ('strain', 'name')

You could make this work as expected a couple of ways including iterable unpacking with the * operator or list concatenation.

# Iterable unpacking with * >>> for possible_id in [output_id_field, *DEFAULT_ID_COLUMNS]: ... print(possible_id) ... accession strain name # List concatenation >>> for possible_id in [output_id_field] + list(DEFAULT_ID_COLUMNS): ... print(possible_id) ... accession strain name

gah, thanks! 😅

augur/parse.py

victorlin

Looks good pending some final touch ups

DEPRECATED.md

augur/parse.py

CHANGES.md

Originally augur parse first checks for 'name' then 'strain' and then the first field to figure out the sequence id to use. This test checks the default id column order.

During augur parse, instead of hardcoding a check for a 'name' field and then a 'strain' field to use as a sequence ID, reuse and check against the io.metadata DEFAULT_ID_COLUMNS list to be more consistent with the rest of augur. This will result in a breaking behavior change in that parse is now checking for 'strain' before 'name' as the default ID column. However, the argument to be more consistent with the rest of augur is that the DEFAULT_ID_COLUMNS list is already used in other parts of augur and augur parse should be no different.

Include a `--output-id-field` parameter in augur parse to indicate the ID field for the output sequences file, such as using "accession" instead of "strain". It is important to note that the `--output-id-field` parameter is not required, and augur parse will fall back to the DEFAULT_ID_COLUMNS (e.g. ('strain','name')) if it is not present. If none of the DEFAULT_ID_COLUMNS are present in the fields, fall back to using the first field. The `--output-id-field` parameter is designed to accept a single field name, not multiple. User are required to provide a `--fields col1 col2 strain accession` argument in the same invocation. It seems reasonable to expect the user to choose a specific field name for the ID. To prevent unintended behaviors, if `--output-id-field` is not present in `--fields` (e.g., due to a typo), augur parse will raise an error instead of falling back to DEFAULT_ID_COLUMNS.

… future. Instead of a refactor and breaking change from 8dc0694, this commit adds a deprecation warning to the `parse` step that the default `id` field used will be reordered from ('name', 'strain') to ('strain', 'name'). This will give users time to update their scripts and workflows with an `--output-id-field 'name'` before the change is made. co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>

j23414 force-pushed the parse-output-metadata-id-columns branch from 85bf1bf to 114c901 Compare January 29, 2024 23:21

j23414 changed the title ~~Add output-metadata-id-field to augur parse~~ Add a parameter to augur parse to specify a different record ID for output sequences FASTA Jan 29, 2024

j23414 force-pushed the parse-output-metadata-id-columns branch from 114c901 to d1aef81 Compare January 29, 2024 23:26

j23414 marked this pull request as ready for review January 29, 2024 23:27

victorlin requested changes Jan 29, 2024

View reviewed changes

augur/parse.py Outdated Show resolved Hide resolved

j23414 force-pushed the parse-output-metadata-id-columns branch from 0e07091 to 7b45b2f Compare January 30, 2024 00:06

huddlej reviewed Jan 31, 2024

View reviewed changes

j23414 force-pushed the parse-output-metadata-id-columns branch from 7b45b2f to 8367e59 Compare January 31, 2024 18:46

j23414 linked an issue Jan 31, 2024 that may be closed by this pull request

Feat request: add --metadata-id-columns param to augur parse? #1402

Closed

victorlin requested changes Feb 2, 2024

View reviewed changes

augur/parse.py Outdated Show resolved Hide resolved

j23414 force-pushed the parse-output-metadata-id-columns branch 2 times, most recently from fc6c6cc to f587119 Compare February 2, 2024 23:12

victorlin requested changes Feb 7, 2024

View reviewed changes

augur/parse.py Outdated Show resolved Hide resolved

j23414 marked this pull request as draft February 7, 2024 21:03

j23414 marked this pull request as ready for review February 7, 2024 23:31

j23414 force-pushed the parse-output-metadata-id-columns branch from 40e9a00 to 9810855 Compare February 7, 2024 23:40

victorlin approved these changes Feb 7, 2024

View reviewed changes

DEPRECATED.md Outdated Show resolved Hide resolved

augur/parse.py Show resolved Hide resolved

victorlin reviewed Feb 8, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

j23414 force-pushed the parse-output-metadata-id-columns branch 3 times, most recently from e41dbb5 to d313f58 Compare February 9, 2024 10:02

j23414 and others added 5 commits February 12, 2024 13:38

Add test for default id columns

4a12746

Originally augur parse first checks for 'name' then 'strain' and then the first field to figure out the sequence id to use. This test checks the default id column order.

Update CHANGELOG

9d96cad

j23414 force-pushed the parse-output-metadata-id-columns branch from da2a23c to 9d96cad Compare February 12, 2024 18:40

j23414 merged commit dd8a1cb into master Feb 12, 2024
19 checks passed

j23414 deleted the parse-output-metadata-id-columns branch February 12, 2024 18:43

victorlin mentioned this pull request Sep 10, 2024

parse: Prefer strain over name as sequence ID field #1629

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403

Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403

j23414 commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading

huddlej Jan 31, 2024

j23414 Jan 31, 2024

victorlin left a comment

Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403

Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403

Conversation

j23414 commented Jan 29, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Jan 30, 2024 • edited Loading

Codecov Report

huddlej Jan 31, 2024

Choose a reason for hiding this comment

j23414 Jan 31, 2024

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

j23414 commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading