Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "accession" column as ID column #12

Merged
merged 2 commits into from
Oct 12, 2023
Merged

Use "accession" column as ID column #12

merged 2 commits into from
Oct 12, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Oct 11, 2023

Description of proposed changes

The main purpose of this commit is to ID records by "accession" to directly match changes in nextstrain/mpox@927ad6c

Additionally, the uncompressed sequence and metadata files are moved to the data folder instead of the results folder to align with the monkeypox and zika pipelines.

Related issue(s)

Checklist

  • Checks pass
  • Can run a manual check
git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout id_by_accession

nextstrain build \
  --aws-batch \
  --aws-batch-cpus 4 \
  --aws-batch-memory 7200 . --jobs 4

@j23414 j23414 marked this pull request as draft October 11, 2023 20:00
Move uncompressed sequence and metadata files to the `data` folder
instead of the `results` folder. This change aligns with the monkeypox
and zika pipelines, smoothly allowing the example_data input files to be
in either compressed or uncompressed format during automated checks.
@j23414 j23414 marked this pull request as ready for review October 11, 2023 20:57
@j23414 j23414 requested a review from a team October 11, 2023 20:58
@jameshadfield jameshadfield self-assigned this Oct 11, 2023
@victorlin victorlin self-assigned this Oct 11, 2023
Snakefile Show resolved Hide resolved
@j23414 j23414 changed the title Use "accession" column as ID column directly Use "accession" column as ID column Oct 11, 2023
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good (and works!).

I don't think strain_id_field is the best name, but this follows what we do in mpx so I think consistency is the best approach here and so I'm happy for it to stay.

During testing of this it's clear that the "strain" column in the Dengue metadata needs improvement, and on the face of it I'd suggest keeping accession as the exported node name for auspice. However I'm not experienced enough in Dengue to know what to do here, so I'll leave the decision here to your judgement. Perhaps this is a data curation problem and can be fixed by future PRs? (Perhaps it's already fixed in new_ingest?) As examples:

  • We have "strain" names such as 2.36E+11, 00697/11, DBS1.
  • Capatilisation is irregular, e.g. DENV2 & denv2
  • Sometimes / is used as the word-separator, sometimes |
  • A lot of strain names are duplicates. When the dataset is loaded into Auspice you'll see error messages in the console such as Tree node detected with a duplicate name. Changing 'New_Guinea_C' to 'New_Guinea_C_670d9b' and continuing..., but the tree will still be displayed and almost all functionality will remain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Noting that this is the third copy of this script (after monkeypox and rsv).

Seems like with GenBank data, this will be a common pattern. Maybe we should push on nextstrain/augur#1264 or nextstrain/auspice#1668 to solve this across the board.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, agree that pushing the linked issue and PR would be a more streamlined solution for the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j23414
Copy link
Contributor Author

j23414 commented Oct 12, 2023

Thanks @jameshadfield for the review!

the "strain" column in the Dengue metadata needs improvement

Thanks for the context on how strain names affect Auspice, along with the console messages! Completely agree, yes, I'm hoping to discuss and address the "strain" column transformations in future PRs.

@j23414
Copy link
Contributor Author

j23414 commented Oct 12, 2023

After a quick check-in with @joverlee521, I'm going to merge this for the time being. I'll plan for subsequent issues and PRs to better address outstanding discussion points.

@j23414 j23414 merged commit 75beaf1 into main Oct 12, 2023
6 checks passed
@j23414 j23414 deleted the id_by_accession branch October 12, 2023 20:20
@joverlee521 joverlee521 mentioned this pull request Jan 16, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants