-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use "accession" column as ID column #12
Conversation
Move uncompressed sequence and metadata files to the `data` folder instead of the `results` folder. This change aligns with the monkeypox and zika pipelines, smoothly allowing the example_data input files to be in either compressed or uncompressed format during automated checks.
5090ae4
to
7e34cb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good (and works!).
I don't think strain_id_field
is the best name, but this follows what we do in mpx so I think consistency is the best approach here and so I'm happy for it to stay.
During testing of this it's clear that the "strain" column in the Dengue metadata needs improvement, and on the face of it I'd suggest keeping accession as the exported node name for auspice. However I'm not experienced enough in Dengue to know what to do here, so I'll leave the decision here to your judgement. Perhaps this is a data curation problem and can be fixed by future PRs? (Perhaps it's already fixed in new_ingest
?) As examples:
- We have "strain" names such as
2.36E+11
,00697/11
,DBS1
. - Capatilisation is irregular, e.g. DENV2 & denv2
- Sometimes
/
is used as the word-separator, sometimes|
- A lot of strain names are duplicates. When the dataset is loaded into Auspice you'll see error messages in the console such as
Tree node detected with a duplicate name. Changing 'New_Guinea_C' to 'New_Guinea_C_670d9b' and continuing...
, but the tree will still be displayed and almost all functionality will remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Noting that this is the third copy of this script (after monkeypox and rsv).
Seems like with GenBank data, this will be a common pattern. Maybe we should push on nextstrain/augur#1264 or nextstrain/auspice#1668 to solve this across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, agree that pushing the linked issue and PR would be a more streamlined solution for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jameshadfield for the review!
Thanks for the context on how strain names affect Auspice, along with the console messages! Completely agree, yes, I'm hoping to discuss and address the "strain" column transformations in future PRs. |
After a quick check-in with @joverlee521, I'm going to merge this for the time being. I'll plan for subsequent issues and PRs to better address outstanding discussion points. |
Description of proposed changes
The main purpose of this commit is to ID records by "accession" to directly match changes in nextstrain/mpox@927ad6c
Additionally, the uncompressed sequence and metadata files are moved to the
data
folder instead of theresults
folder to align with the monkeypox and zika pipelines.Related issue(s)
Checklist