Use "accession" column as ID column #12

j23414 · 2023-10-11T19:55:16Z

Description of proposed changes

The main purpose of this commit is to ID records by "accession" to directly match changes in nextstrain/mpox@927ad6c

Additionally, the uncompressed sequence and metadata files are moved to the data folder instead of the results folder to align with the monkeypox and zika pipelines.

Related issue(s)

This change is a prerequisite to Generalize Ingest #6

Checklist

Checks pass
Can run a manual check

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout id_by_accession

nextstrain build \
  --aws-batch \
  --aws-batch-cpus 4 \
  --aws-batch-memory 7200 . --jobs 4

Move uncompressed sequence and metadata files to the `data` folder instead of the `results` folder. This change aligns with the monkeypox and zika pipelines, smoothly allowing the example_data input files to be in either compressed or uncompressed format during automated checks.

Snakefile

jameshadfield

Code looks good (and works!).

I don't think strain_id_field is the best name, but this follows what we do in mpx so I think consistency is the best approach here and so I'm happy for it to stay.

During testing of this it's clear that the "strain" column in the Dengue metadata needs improvement, and on the face of it I'd suggest keeping accession as the exported node name for auspice. However I'm not experienced enough in Dengue to know what to do here, so I'll leave the decision here to your judgement. Perhaps this is a data curation problem and can be fixed by future PRs? (Perhaps it's already fixed in new_ingest?) As examples:

We have "strain" names such as 2.36E+11, 00697/11, DBS1.
Capatilisation is irregular, e.g. DENV2 & denv2
Sometimes / is used as the word-separator, sometimes |
A lot of strain names are duplicates. When the dataset is loaded into Auspice you'll see error messages in the console such as Tree node detected with a duplicate name. Changing 'New_Guinea_C' to 'New_Guinea_C_670d9b' and continuing..., but the tree will still be displayed and almost all functionality will remain.

joverlee521 · 2023-10-12T17:37:20Z

bin/set_final_strain_name.py

non-blocking

Noting that this is the third copy of this script (after monkeypox and rsv).

Seems like with GenBank data, this will be a common pattern. Maybe we should push on nextstrain/augur#1264 or nextstrain/auspice#1668 to solve this across the board.

Huh, agree that pushing the linked issue and PR would be a more streamlined solution for the future.

https://github.com/nextstrain/private/issues/131

j23414 · 2023-10-12T20:01:51Z

Thanks @jameshadfield for the review!

the "strain" column in the Dengue metadata needs improvement

Thanks for the context on how strain names affect Auspice, along with the console messages! Completely agree, yes, I'm hoping to discuss and address the "strain" column transformations in future PRs.

j23414 · 2023-10-12T20:08:10Z

After a quick check-in with @joverlee521, I'm going to merge this for the time being. I'll plan for subsequent issues and PRs to better address outstanding discussion points.

ID strain by accession

8ab810f

j23414 marked this pull request as draft October 11, 2023 20:00

j23414 force-pushed the id_by_accession branch from 5090ae4 to 7e34cb6 Compare October 11, 2023 20:54

j23414 marked this pull request as ready for review October 11, 2023 20:57

j23414 requested a review from a team October 11, 2023 20:58

jameshadfield self-assigned this Oct 11, 2023

victorlin self-assigned this Oct 11, 2023

victorlin reviewed Oct 11, 2023

View reviewed changes

Snakefile Show resolved Hide resolved

j23414 changed the title ~~Use "accession" column as ID column directly~~ Use "accession" column as ID column Oct 11, 2023

jameshadfield approved these changes Oct 12, 2023

View reviewed changes

joverlee521 reviewed Oct 12, 2023

View reviewed changes

j23414 merged commit 75beaf1 into main Oct 12, 2023
6 checks passed

j23414 deleted the id_by_accession branch October 12, 2023 20:20

joverlee521 mentioned this pull request Jan 16, 2024

Persephone nextstrain/zika#28

Merged

1 task

victorlin mentioned this pull request Feb 7, 2024

Bug: Update dropped strains file to list accession instead of strain #24

Closed

victorlin mentioned this pull request Nov 7, 2024

Replace set_final_strain_name.py nextstrain/public#5

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use "accession" column as ID column #12

Use "accession" column as ID column #12

j23414 commented Oct 11, 2023 •

edited

Loading

jameshadfield left a comment

joverlee521 Oct 12, 2023

j23414 Oct 12, 2023

victorlin Aug 14, 2024

j23414 commented Oct 12, 2023

j23414 commented Oct 12, 2023

Use "accession" column as ID column #12

Use "accession" column as ID column #12

Conversation

j23414 commented Oct 11, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

jameshadfield left a comment

Choose a reason for hiding this comment

joverlee521 Oct 12, 2023

Choose a reason for hiding this comment

j23414 Oct 12, 2023

Choose a reason for hiding this comment

victorlin Aug 14, 2024

Choose a reason for hiding this comment

j23414 commented Oct 12, 2023

j23414 commented Oct 12, 2023

j23414 commented Oct 11, 2023 •

edited

Loading