Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[on hold] dedup ncbi segments #93

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jameshadfield
Copy link
Member

WIP - here for discussion with @joverlee521

The only "disagreements" (which I haven't yet resolved) are a handful of strains which have multiple sequences for (all) segments. So that's reassuring!

The phylo workflow hasn't been updated to use the new metadata format

DAG is a bit simpler (before: above, after: below):

image

@jameshadfield
Copy link
Member Author

jameshadfield commented Oct 7, 2024

Here's the 3 (yes, only 3) strains which were dropped:

Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pb2. Accessions: PP761255, PP761574. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pb1. Accessions: PP761260, PP761572. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pa. Accessions: PP761262, PP761577. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment ha. Accessions: PP761257, PP761548, PP761557, PP761576. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment np. Accessions: PP761261, PP761550, PP761553, PP761571. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment na. Accessions: PP761256, PP761552, PP761555, PP761578. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment mp. Accessions: PP761259, PP761551, PP761554, PP761573. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment ns. Accessions: PP761258, PP761549, PP761556, PP761575. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pb2. Accessions: PP761569, PP766982. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pb1. Accessions: PP761570, PP766984. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pa. Accessions: PP761563, PP766987. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment ha. Accessions: PP761566, PP766985. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment np. Accessions: PP761567, PP766983. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment na. Accessions: PP761568, PP766981. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment mp. Accessions: PP761564, PP766980. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment ns. Accessions: PP761565, PP766986. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pb2. Accessions: PP862906, PQ367318. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pb1. Accessions: PP862905, PQ367313. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pa. Accessions: PP862901, PQ367316. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment ha. Accessions: PP862902, PQ367314. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment np. Accessions: PP862907, PQ367312. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment na. Accessions: PP862903, PQ367315. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment mp. Accessions: PP862904, PQ367317. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment ns. Accessions: PP862908, PQ367311. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

@joverlee521 and I discussed this today and we're going to leave this PR open for the moment and explore NCBI's new API in #82 which promises to group segments together and compare those results to ours from this PR.

@jameshadfield jameshadfield changed the title James/dedup ncbi segments [on hold] dedup ncbi segments Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant