Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust transform-date-fields #45

Merged
merged 2 commits into from
Jun 10, 2022
Merged

Robust transform-date-fields #45

merged 2 commits into from
Jun 10, 2022

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jun 9, 2022

My attempt to make the ingest transform-date-fields more robust.
See commits for details.


Manually ran ingest on branch, resulting files will be available at s3://nextstrain-data/files/workflows/monkeypox/branch/robust-format-date/

Previous implementation of masking incomplete dates was dependent on
the date string using '-' to separate date fields. Providing date
strings with any other separator would result in a masked date.

This commit updates `format_date` to check directives within the date
format to verify if year, month, and day are included in the date format.

If date format does not contain any directives for year, then return a
completely masked date because a date without the year is completely
useless to us.

If the date format does not contain any directives for month,
then mask both month and day fields with 'XX' because just year and day
is completely useless to us.
There are potential date strings that can match multiple formats.
For example, '2020-01' can match both '%Y-%m' and '%Y-%d'. I'm not sure
how we can robustly detect these clashing formats. So the next best
thing is to respect the order in which the `expected-date-formats`
are provided and use the first matching format in the list.
@joverlee521 joverlee521 requested a review from a team June 9, 2022 01:55
@emmahodcroft
Copy link
Member

The code is beyond my reviewing capability, but I really like the ideas here - seems like a really robust solution Jover! This would be a great thing to integrate into auspice as part of (or attached to) transform, too. This is such a common issue.

@rneher rneher merged commit 1d6be8e into master Jun 10, 2022
@rneher rneher deleted the robust-format-date branch June 10, 2022 05:49
Comment on lines +70 to +74
(
(isinstance(directive, str) and directive in date_format) or
(isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive))
)
for directive in potential_directives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With polymorphic values like this (str or tuple), it can often be simplifying to just pick one or the other. In this case, since we need tuples, we could either do that dynamically with something like:

any(
    all(
        sub_directive in date_format
        for sub_directive in (directive if isinstance(directive, tuple) else (directive,))
    )
    for directive in potential_directives
)

but I'd suggest defining them all as tuples instead, e.g.

year_directives = {('%y',), ('%Y',)}

and leaving this function's code to only deal with tuples.

Comment on lines +50 to +56
all_field_directives = {'%c', '%x',
('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w'), ('%G', '%V', '%u')
}
month_and_day_directives = {'%j',
('%U', '%A'), ('%U', '%a'), ('%U', '%w'), ('%U', '%u'),
('%W', '%A'), ('%W', '%a'), ('%W', '%w'), ('%W', '%u')
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure it's valid to combine ISO week-based numbering (%G, %u, %V) with non-ISO week-based numbering (%U, %W, %a, %A, %w), as the two systems are "indexed" differently. So we may want to drop ('%U', '%u') and ('%W', '%u') from month_and_day_directives and ('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w') from all_field_directives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I included them since they were valid combinations for datetime.strptime based on the error message when trying to parse an incomplete ISO directive:

ValueError: ISO year directive '%G' must be used with the ISO week directive '%V' and a weekday directive ('%A', '%a', '%w', or '%u').

I did see that the Year/week directives cannot be interchanged between ISO/non-ISO, I think that matches what you mean by "indexed" differently? I would think the weekday directives can be used across ISO/non-ISO directives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hmm. I guess they're technically valid, in the sense that datetimes can be formatted with them and re-parsed back successfully. Ok to keep them then.

I guess I was more thinking about if they're valid for reasonable use, as you can't tell unambiguously from just the formatted datetime how it should be parsed, so it's a bit "dangerous" to mix them. For example, 2022 01 Sun (year, week, weekday) means different days depending on how it was produced:

>>> datetime.strptime("2022 01 Sun", "%Y %U %a")
datetime.datetime(2022, 1, 2, 0, 0)
>>> datetime.strptime("2022 01 Sun", "%G %V %a")
datetime.datetime(2022, 1, 9, 0, 0)

By "indexed" differently, I mean the starting points for counting weeks in a year and the weekday of weeks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we would be leaving it up to the user to understand the datetime directives and provide the correct combinations. The same problem exists even within non-ISO:

>>> datetime.strptime("2022 01 Sun", "%Y %U %a")
datetime.datetime(2022, 1, 2, 0, 0)
>>> datetime.strptime("2022 01 Sun", "%Y %W %a")
datetime.datetime(2022, 1, 9, 0, 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Ok to leave as is.

Comment on lines +71 to +72
(isinstance(directive, str) and directive in date_format) or
(isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the in operator is super nice for readability here, it doesn't handle the edge case of a format string containing an escaped directive (e.g. %%Y means literal %Y not "a 4+ digit year").

Regexps provide a nice way to handle this with negative lookbehind assertions, e.g. (?<!%)%Y, generalized to something like:

def included(directive, format):
    return bool(re.search(f"(?<!%){re.escape(directive)}", format))

Comment on lines +98 to +116
# If not all fields directives are included, then check year
# directive was included in date format
elif(directive_is_included(year_directives, date_format)):
year_string = parsed_year_string

# Only check for month and day directives if year is included
# Check if directive for BOTH month and year is included in date format
if (directive_is_included(month_and_day_directives, date_format)):
month_string = parsed_month_string
day_string = parsed_day_string

# If not directives for BOTH month and day are included, then check
# month directive was included in date format
elif(directive_is_included(month_directives, date_format)):
month_string = parsed_month_string

# Only check for day directives if month is included
if(directive_is_included(day_directives, date_format)):
day_string = parsed_day_string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cascade of checks to greater specificity makes sense as an implementation approach. It also serves an important functional role because some directives are dependent on others to parse correctly, e.g. %U requires %Y (or %y), and we enforce that thru this cascade.

However, I wonder if the sets of directives above and this code would be easier to reason about if we explicitly listed the combinations for each level instead of cascading. It'd be more verbose, but I think not overly so.

For example, what if instead of cascading down these sets

all_field_directives
month_and_day_directives
year_directives
month_directives
day_directives

we used a flat if/elif/elif chain against these sets:

year_month_day_directives
year_month_directives
year_directives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current way makes it easier to add new directives. However, that may not be a worry since the directives rarely get updated 🤔

Copy link
Member

@tsibley tsibley Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the directives are likely to change approximately never.¹ ;-) And in general, I think the default should be to optimize the code for reading, not writing, as code will be read many many more times by many more people than it is written.

(But this doesn't mean you have to take my suggestion here!)

¹ These in particular been the same my entire programming career, because they were defined as part of C89, as in the year 1989.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I might not put in the effort here in monkeypox since I expect it to be replaced by the new augur command, but will definitely go for readability in the augur version of date transformations.

@tsibley
Copy link
Member

tsibley commented Jun 16, 2022

Hmm. Realizing that there are also probably platform-specific directives that we should include here, as they can be successfully used if someone's only running augur on platforms that support more than what C89 does. The Python docs mention this:

The full set of format codes supported varies across platforms, because Python calls the platform C library’s strftime() function, and platform variations are common. To see the full set of format codes supported on your platform, consult the strftime(3) documentation. There are also differences between platforms in handling of unsupported format specifiers.

@victorlin
Copy link
Member

This is neat! Much broader support than Augur's dates.py, but also noting that incomplete dates such as 2020, 2020-01 which are addressed here, are also addressed in current Augur date support (though it isn't perfect: nextstrain/augur#747).

Once this makes its way into the Augur subcommand, we can probably scope downstream Augur date support to just:

  1. Exact ISO-8601 dates
  2. Ambiguous ISO-8601-formatted dates
  3. Numeric dates
  4. Relative dates

Related: nextstrain/augur#928

joverlee521 added a commit to nextstrain/augur that referenced this pull request Jan 19, 2023
Changes the cascade of directive checks to explicit sets of tuples of
directives so that it is easier to read/understand as suggested by
@tsibley in a previous review.¹ This also allows us to simplify the
`directive_is_included` function to only expect tuples as potential
directives.²

¹ nextstrain/mpox#45 (comment)
² nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jan 19, 2023
'%%' represents an escaped directive that is a literal '%' character,
so it needs to be excluded in the date format check. Using the regex
negative lookbehind assertion suggested by @tsibley in a
previous review.¹

¹ nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jan 21, 2023
Changes the cascade of directive checks to explicit sets of tuples of
directives so that it is easier to read/understand as suggested by
@tsibley in a previous review.¹ This also allows us to simplify the
`directive_is_included` function to only expect tuples as potential
directives.²

¹ nextstrain/mpox#45 (comment)
² nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jan 21, 2023
'%%' represents an escaped directive that is a literal '%' character,
so it needs to be excluded in the date format check. Using the regex
negative lookbehind assertion suggested by @tsibley in a
previous review.¹

¹ nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jan 28, 2023
Changes the cascade of directive checks to explicit sets of tuples of
directives so that it is easier to read/understand as suggested by
@tsibley in a previous review.¹ This also allows us to simplify the
`directive_is_included` function to only expect tuples as potential
directives.²

¹ nextstrain/mpox#45 (comment)
² nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jan 28, 2023
'%%' represents an escaped directive that is a literal '%' character,
so it needs to be excluded in the date format check. Using the regex
negative lookbehind assertion suggested by @tsibley in a
previous review.¹

¹ nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jun 12, 2023
Changes the cascade of directive checks to explicit sets of tuples of
directives so that it is easier to read/understand as suggested by
@tsibley in a previous review.¹ This also allows us to simplify the
`directive_is_included` function to only expect tuples as potential
directives.²

¹ nextstrain/mpox#45 (comment)
² nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jun 12, 2023
'%%' represents an escaped directive that is a literal '%' character,
so it needs to be excluded in the date format check. Using the regex
negative lookbehind assertion suggested by @tsibley in a
previous review.¹

¹ nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jul 7, 2023
Changes the cascade of directive checks to explicit sets of tuples of
directives so that it is easier to read/understand as suggested by
@tsibley in a previous review.¹ This also allows us to simplify the
`directive_is_included` function to only expect tuples as potential
directives.²

¹ nextstrain/mpox#45 (comment)
² nextstrain/mpox#45 (comment)
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jul 7, 2023
'%%' represents an escaped directive that is a literal '%' character,
so it needs to be excluded in the date format check. Using the regex
negative lookbehind assertion suggested by @tsibley in a
previous review.¹

¹ nextstrain/mpox#45 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants