-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augur filter not identifying duplicates, causing pandas error #915
Comments
Thanks for sending this! I can reproduce this on my end, noting that it's more predictable when passing echo -e 'strain\tdate
a\t2010-10-10
a\t2010-10-10
b\t2010-10-10
c\t2010-10-10
d\t2010-10-10
' > meta.tsv
augur filter \
--metadata meta.tsv \
--group-by year \
--sequences-per-group 2 \
--subsample-seed 0 \
--output-metadata meta2.tsv Version 12.0.0 was the last to provide a meaningful output:
And the versions after that come with varying warnings, lack of warnings, and the error This is due to the change in 12.1.0 where I agree that this uncaught exception should not be exposed to the user. Two things that can be done here (not mutually exclusive):
|
Another tricky thing here is that a collection of names must be maintained outside of the metadata chunk iteration. Notice that there is no error when using augur filter \
--metadata meta.tsv \
--group-by year \
--sequences-per-group 2 \
--subsample-seed 0 \
--metadata-chunk-size 1 \
--output-metadata meta2.tsv
# 3 strains were dropped during filtering
# 3 of these were dropped because of subsampling criteria
# 1 strains passed all filters I believe we already have a similar collection being maintained, so this should not be difficult to fix. |
Thanks for the super quick response Victor! Any chance of a quick and dirty fix as a stop gap? This bug combined with #616 means duplicates are a hassle to deal with in the current version of |
@ammaraziz Regarding this exact issue, the "quick and dirty fix" would still result in an error to the user ( However, I've started planning the new subcommand for removing duplicates, which should solve broader issue of duplicates being a hassle to deal with in Augur: #919 I can't think of anything that can be done quickly to alleviate the pains here, besides work on the de-duplication feature. Please let me know if I missed anything. |
Very good point. I look forward to the new subcommand! |
@ammaraziz In the short term, you may want to check out our discussion of how to resolve duplicates in metadata and sequences using other standard bioinformatics tools. |
Current Behavior
When filtering a metadata file that contains duplicates in the
strain
field, this error is produced:Expected behavior
Previous behavior was that the
filter
subcommand produces a helpful error (see #751 for an example), catching the error before processing data.How to reproduce
Given a
meta.tsv
with contents:Run this command:
It will produce this error and trace:
The output of
record
when there is no duplicate:The same with duplicates:
Your environment: if running Nextstrain locally
Additional context
The bug is hard to detect when subsampling with grouping option. It will only result in an error when a duplicate
strain
name is in the queue. This means data is being written to the output but stops when duplicate is in the queue.The text was updated successfully, but these errors were encountered: