Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete dates are not interpreted as ambiguous by augur filter #747

Open
corneliusroemer opened this issue Jul 12, 2021 · 8 comments · Fixed by #841
Open

Incomplete dates are not interpreted as ambiguous by augur filter #747

corneliusroemer opened this issue Jul 12, 2021 · 8 comments · Fixed by #841
Assignees
Labels
bug Something isn't working

Comments

@corneliusroemer
Copy link
Member

Current Behavior
I was surprised to find that despite there being mention of ambiguous date handling in the documentation of augur filter, incomplete dates (e.g. 2019 or 2019-04 as opposed to 2019-04-13) are counter-intuitively not supported by augur filter. In a query with augur filter --min-date 2018 samples with datestring 2019 or 2019-04 are currently excluded by augur, despite clearly satisfying the min-date condition.

Rather than complaining that the dates are incomplete, the samples with incomplete dates are silently excluded.

Fludb provides incomplete dates, and so it would be very convenient, if they were accepted as such by augur filter.

I tried adding --exclude-ambiguous-dates-by year as a workaround, but it change behaviour.

This is related to #662

@victorlin
Copy link
Member

victorlin commented Aug 12, 2021

@corneliusroemer this is what is shown by augur filter --help:

  --min-date MIN_DATE   minimal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD
                        (default: None)
  --max-date MAX_DATE   maximal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD
                        (default: None)

I don't see any reference of ambiguity so technically it is correct, but regardless I've created a PR to add support for this.

Note that ambiguous dates still follow the format YYYY-MM-DD so instead of augur filter --min-date 2018 it should be augur filter --min-date 2018-XX-XX.

@victorlin
Copy link
Member

Thinking a bit more, this feature probably isn't very useful unless there is support for incomplete date strings (#662 (comment)). --min-date 2021 would be a nice shorthand but 2021-XX-XX is just as verbose as its effective value 2021-01-01. The ambiguity syntax is useful for the metadata, but these cutoffs are user-defined.

@huddlej
Copy link
Contributor

huddlej commented Aug 18, 2021

I'm just now looking at this issue in preparation for reviewing @victorlin's PR #756. The most recent version of Augur includes some more realistic metadata in the functional test suite, so a minimal way to confirm the issue here is with the following command:

augur filter \
  --metadata filter/metadata.tsv \
  --min-date 2015 \
  --max-date 2017-01-01 \
  --output-strains filtered_strains.txt \
  --output-log filtered_log.tsv

The strain SG_018 has an incomplete date of 2016, but it should still pass the filters above. Instead it appears in the log output along with the strain COL/FLR_00024/2015 which has no date value:

strain	filter	kwargs
SG_018	filter_by_date	"[[""date_column"", ""date""], [""max_date"", 2017.0], [""min_date"", 2015.0]]"
COL/FLR_00024/2015	filter_by_date	"[[""date_column"", ""date""], [""max_date"", 2017.0], [""min_date"", 2015.0]]"

A potential functional test in tests/functional/filter.t would run the above command and confirm that the contents of the log is only one strain.

@victorlin
Copy link
Member

victorlin commented Aug 19, 2021

@huddlej the incomplete date on SG_018 is a separate but related issue - here we are talking about incomplete date on --min-date/--max-date not the values in metadata.tsv.

With current implementation, date values in metadata.tsv are parsed strictly with format %Y-%m-%d. I suspect utils.get_numerical_dates will need some rewriting to accept a variety of of date representations.

Thanks for the example! It helps better understand how this component is used. I'll look into adding some functional testing to my PR.

@victorlin
Copy link
Member

@huddlej I implemented a fix here: victorlin@94efe60

Decided not to add this to the current PR since it's a pretty invasive change removing the need for fmt parameters altogether. I also fixed a few incorrect functional tests and added the example you provided above.

@victorlin
Copy link
Member

Ah... when I was working on this earlier, I failed to realize:

  1. Ambiguous dates can either include XX or be incomplete (all these are valid: 2018 == 2018-XX-XX, 2018-03 == 2018-03-XX)
  2. @corneliusroemer was talking about incomplete dates in metadata file, but I interpreted it as an issue with --min-date. Both are valid issues.

Adding another example of the general issue on date format handing, this test fails when it shouldn't: victorlin@8b24536

@huddlej
Copy link
Contributor

huddlej commented Jan 11, 2022

I just ran into this problem, too, trying to build a tree for the country of Chad which has 9 sequences all of which have an ambiguous date of 2021. All 9 sequences get thrown out by augur filter's min/max date filter because the 2021 isn't interpreted properly as an ambiguous date.

@victorlin victorlin self-assigned this Jan 19, 2022
victorlin added a commit that referenced this issue Feb 4, 2022
Currently, date format handling is inaccurate (#747) as numeric dates are thrown out:

https://github.com/nextstrain/augur/blob/a85194c243db8d85e6fc06ea2d614e0b6095a0c4/augur/utils.py#L115-L119

This change ensures numeric dates are processed, and that non-negative integers are evaluated as year-only ambiguous dates.

Also including a few refactors:

- Remove `raise_error` parameter. The intent is unclear and tests still pass without it.
- Use `return` instead of an intermediate variable.

Testing:

- Add broken tests and verify new changes pass.
- Fix inaccurate existing tests.
@victorlin
Copy link
Member

Re-opening since the example of 2019-04 still does not work:

cat > metadata.tsv << ~~
strain	date
SEQ1	2019
SEQ2	2019-04
SEQ3	2019-04-13
~~

augur filter \
--metadata metadata.tsv \
--min-date 2018 \
--output-strains filtered.txt

cat filtered.txt
# SEQ3
# SEQ1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: In Progress
3 participants