augur filter not identifying duplicates, causing pandas error #915

ammaraziz · 2022-05-06T09:06:29Z

Current Behavior

When filtering a metadata file that contains duplicates in the strain field, this error is produced:

AttributeError: 'DataFrame' object has no attribute 'name'

Expected behavior

Previous behavior was that the filter subcommand produces a helpful error (see #751 for an example), catching the error before processing data.

How to reproduce

Given a meta.tsv with contents:

strain	date
a	2010-10-10
a	2010-10-10
b	2011-10-10
c	2012-10-10
d	2013-10-10

Run this command:

augur filter --metadata meta.tsv --group-by year --sequences-per-group 2 --output-metadata meta2.tsv

It will produce this error and trace:

Traceback (most recent call last):
  File "/home/amar/miniconda3/bin/augur", line 10, in <module>
    sys.exit(main())
  File "/home/amar/miniconda3/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/home/amar/miniconda3/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
  File "/home/amar/miniconda3/lib/python3.8/site-packages/augur/filter.py", line 1544, in run
    subsampled_strains.add(record.name)
  File "/home/amar/miniconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 5465, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'name'

The output of record when there is no duplicate:

date    2011-10-10
Name: b, dtype: object
date    2012-10-10
Name: c, dtype: object
date    2013-10-10
Name: d, dtype: object

The same with duplicates:

              date
strain            
a       2010-10-10
a       2010-10-10

Your environment: if running Nextstrain locally

Operating system: linux mint 20 + bash + conda
Test on Augur versions: 15.0.1 (latest), 14.0, 13.1.2, 12.1.1. Last version which worked was 11.3

Additional context

The bug is hard to detect when subsampling with grouping option. It will only result in an error when a duplicate strain name is in the queue. This means data is being written to the output but stops when duplicate is in the queue.

The text was updated successfully, but these errors were encountered:

victorlin · 2022-05-06T18:16:49Z

Thanks for sending this! I can reproduce this on my end, noting that it's more predictable when passing --subsample-seed. This is what I ran to reproduce:

echo -e 'strain\tdate
a\t2010-10-10
a\t2010-10-10
b\t2010-10-10
c\t2010-10-10
d\t2010-10-10
' > meta.tsv

augur filter \
  --metadata meta.tsv \
  --group-by year \
  --sequences-per-group 2 \
  --subsample-seed 0 \
  --output-metadata meta2.tsv

Version 12.0.0 was the last to provide a meaningful output:

ERROR: Problem reading in meta.tsv:
Duplicated strain in metadata: a

And the versions after that come with varying warnings, lack of warnings, and the error AttributeError: 'DataFrame' object has no attribute 'name'.

This is due to the change in 12.1.0 where augur filter reads metadata in chunks - during the process of that refactor, which came with nice improvements, the feature to detect duplicates in metadata was silently lost.

I agree that this uncaught exception should not be exposed to the user. Two things that can be done here (not mutually exclusive):

Show a proper error message on duplicates in augur filter, as previously done.
Implement a new subcommand as outlined in Add augur curate subcommands #860, which include duplicate checking.

victorlin · 2022-05-06T18:21:07Z

Another tricky thing here is that a collection of names must be maintained outside of the metadata chunk iteration. Notice that there is no error when using --metadata-chunk-size 1 since each chunk does not have duplicates:

augur filter \
  --metadata meta.tsv \
  --group-by year \
  --sequences-per-group 2 \
  --subsample-seed 0 \
  --metadata-chunk-size 1 \
  --output-metadata meta2.tsv
# 3 strains were dropped during filtering
# 	3 of these were dropped because of subsampling criteria
# 1 strains passed all filters

I believe we already have a similar collection being maintained, so this should not be difficult to fix.

ammaraziz · 2022-05-09T03:40:45Z

Thanks for the super quick response Victor!

Any chance of a quick and dirty fix as a stop gap? This bug combined with #616 means duplicates are a hassle to deal with in the current version of augur.

victorlin · 2022-05-10T04:56:26Z

@ammaraziz Regarding this exact issue, the "quick and dirty fix" would still result in an error to the user (ERROR: Duplicated strain in metadata vs. a cryptic AttributeError). So it would only be a cosmetic change to user experience, not a functional improvement.

However, I've started planning the new subcommand for removing duplicates, which should solve broader issue of duplicates being a hassle to deal with in Augur: #919

I can't think of anything that can be done quickly to alleviate the pains here, besides work on the de-duplication feature. Please let me know if I missed anything.

ammaraziz · 2022-05-10T05:07:26Z

Very good point. I look forward to the new subcommand!

huddlej · 2022-05-11T19:21:18Z

@ammaraziz In the short term, you may want to check out our discussion of how to resolve duplicates in metadata and sequences using other standard bioinformatics tools.

ammaraziz added the bug Something isn't working label May 6, 2022

victorlin self-assigned this May 6, 2022

victorlin mentioned this issue May 10, 2022

Subcommand for data de-duplication #919

Open

victorlin mentioned this issue May 11, 2022

Properly error on duplicates in parse and filter, handle AugurError globally #918

Merged

victorlin closed this as completed in #918 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augur filter not identifying duplicates, causing pandas error #915

augur filter not identifying duplicates, causing pandas error #915

ammaraziz commented May 6, 2022

victorlin commented May 6, 2022 •

edited

Loading

victorlin commented May 6, 2022

ammaraziz commented May 9, 2022

victorlin commented May 10, 2022

ammaraziz commented May 10, 2022

huddlej commented May 11, 2022

augur filter not identifying duplicates, causing pandas error #915

augur filter not identifying duplicates, causing pandas error #915

Comments

ammaraziz commented May 6, 2022

Current Behavior

Expected behavior

How to reproduce

Your environment: if running Nextstrain locally

Additional context

victorlin commented May 6, 2022 • edited Loading

victorlin commented May 6, 2022

ammaraziz commented May 9, 2022

victorlin commented May 10, 2022

ammaraziz commented May 10, 2022

huddlej commented May 11, 2022

victorlin commented May 6, 2022 •

edited

Loading