Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add representative samples from early clades to "broad" H1N1pdm HA Nextclade dataset #172

Merged
merged 2 commits into from
Jun 25, 2024

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jun 21, 2024

Description of proposed changes

Extends the min date for the "broad" H1N1pdm HA dataset from 2014 to 2009 and adds manually curated representative strain names for early clades 2, 3, 4, 7, and 8 to the "force-include" list for the Nextclade workflow. These changes allow the broad Nextclade dataset to represent most early clades (except clade 1) such that early sequences can be properly assigned to those clades.

This approach of forcing inclusion of representative strains works around the workflow's filter of QC=bad sequences where the QC is based on the more recent Nextclade dataset. Since that dataset lacks early clades, early sequences from those clades map to the newer tree with too many private mutations and get flagged with bad QC. A better approach could be to run Nextclade with the "broad" dataset for each lineage, to minimize the number of false positive bad QC labels, but that is for a future discussion/PR.

The following image shows the updated tree with clades 2, 3, 4, 6C, 7, and 8 represented by multiple sequences:

image

After adding clade 1 to the H1 HA definitions, I added representative clade 1 samples to be force-included in the broad H1 HA Nextclade dataset and rebuilt the tree. The updated tree looks like this with clade 1 as the MRCA instead of clade 2:

image

Related issue(s)

Closes #171

Checklist

Extends the min date for the "broad" H1N1pdm HA dataset from 2014 to
2009 and adds manually curated representative strain names for early
clades 2, 3, 4, 7, and 8 to the "force-include" list for the Nextclade
workflow. These changes allow the broad Nextclade dataset to represent
most early clades (except clade 1 [1]) such that early sequences can be
properly assigned to those clades.

This approach of forcing inclusion of representative strains works
around the workflow's filter of QC=bad sequences where the QC is based
on the more recent Nextclade dataset. Since that dataset lacks early
clades, early sequences from those clades map to the newer tree with too
many private mutations and get flagged with bad QC. A better approach
could be to run Nextclade with the "broad" dataset for each lineage, to
minimize the number of false positive bad QC labels, but that is for a
future discussion/PR.

[1] influenza-clade-nomenclature/seasonal_A-H1N1pdm_HA#2
@huddlej huddlej force-pushed the expand-early-h1n1pdm-ha-broad-nextclade branch from e74a5fd to c0fa6e3 Compare June 21, 2024 20:03
@huddlej huddlej requested a review from rneher June 22, 2024 00:16
huddlej added a commit to nextstrain/nextclade_data that referenced this pull request Jun 22, 2024
Add representative samples from early clades to "broad" H1N1pdm HA
Nextclade dataset. Improves clade label annotations for older sequences
by including clades 2, 3, 4, 6C, 7, and 8 in the dataset.

Related to nextstrain/seasonal-flu#172
Now that clade 1 has an official entry in the clade nomenclature, we can
add representative samples for this clade to be force-included to the
Nextclade tree for the broad H1N1pdm HA dataset. Including these samples
produces a tree where the MRCA is clade 1 and all other clades descend
from this clade.
@huddlej huddlej merged commit e2af9e6 into master Jun 25, 2024
3 checks passed
@huddlej huddlej deleted the expand-early-h1n1pdm-ha-broad-nextclade branch June 25, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nextclade dataset for "broad" H1N1pdm HA misassigns early clade labels
1 participant