Add representative samples from early clades to "broad" H1N1pdm HA Nextclade dataset #172
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Extends the min date for the "broad" H1N1pdm HA dataset from 2014 to 2009 and adds manually curated representative strain names for early clades 2, 3, 4, 7, and 8 to the "force-include" list for the Nextclade workflow. These changes allow the broad Nextclade dataset to represent most early clades (except clade 1) such that early sequences can be properly assigned to those clades.
This approach of forcing inclusion of representative strains works around the workflow's filter of QC=bad sequences where the QC is based on the more recent Nextclade dataset. Since that dataset lacks early clades, early sequences from those clades map to the newer tree with too many private mutations and get flagged with bad QC. A better approach could be to run Nextclade with the "broad" dataset for each lineage, to minimize the number of false positive bad QC labels, but that is for a future discussion/PR.
The following image shows the updated tree with clades 2, 3, 4, 6C, 7, and 8 represented by multiple sequences:
After adding clade 1 to the H1 HA definitions, I added representative clade 1 samples to be force-included in the broad H1 HA Nextclade dataset and rebuilt the tree. The updated tree looks like this with clade 1 as the MRCA instead of clade 2:
Related issue(s)
Closes #171
Checklist