-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust transform-date-fields #45
Conversation
Previous implementation of masking incomplete dates was dependent on the date string using '-' to separate date fields. Providing date strings with any other separator would result in a masked date. This commit updates `format_date` to check directives within the date format to verify if year, month, and day are included in the date format. If date format does not contain any directives for year, then return a completely masked date because a date without the year is completely useless to us. If the date format does not contain any directives for month, then mask both month and day fields with 'XX' because just year and day is completely useless to us.
There are potential date strings that can match multiple formats. For example, '2020-01' can match both '%Y-%m' and '%Y-%d'. I'm not sure how we can robustly detect these clashing formats. So the next best thing is to respect the order in which the `expected-date-formats` are provided and use the first matching format in the list.
The code is beyond my reviewing capability, but I really like the ideas here - seems like a really robust solution Jover! This would be a great thing to integrate into |
( | ||
(isinstance(directive, str) and directive in date_format) or | ||
(isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive)) | ||
) | ||
for directive in potential_directives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With polymorphic values like this (str or tuple), it can often be simplifying to just pick one or the other. In this case, since we need tuples, we could either do that dynamically with something like:
any(
all(
sub_directive in date_format
for sub_directive in (directive if isinstance(directive, tuple) else (directive,))
)
for directive in potential_directives
)
but I'd suggest defining them all as tuples instead, e.g.
year_directives = {('%y',), ('%Y',)}
and leaving this function's code to only deal with tuples.
all_field_directives = {'%c', '%x', | ||
('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w'), ('%G', '%V', '%u') | ||
} | ||
month_and_day_directives = {'%j', | ||
('%U', '%A'), ('%U', '%a'), ('%U', '%w'), ('%U', '%u'), | ||
('%W', '%A'), ('%W', '%a'), ('%W', '%w'), ('%W', '%u') | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure it's valid to combine ISO week-based numbering (%G
, %u
, %V
) with non-ISO week-based numbering (%U
, %W
, %a
, %A
, %w
), as the two systems are "indexed" differently. So we may want to drop ('%U', '%u')
and ('%W', '%u')
from month_and_day_directives
and ('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w')
from all_field_directives
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I included them since they were valid combinations for datetime.strptime
based on the error message when trying to parse an incomplete ISO directive:
ValueError: ISO year directive '%G' must be used with the ISO week directive '%V' and a weekday directive ('%A', '%a', '%w', or '%u').
I did see that the Year/week directives cannot be interchanged between ISO/non-ISO, I think that matches what you mean by "indexed" differently? I would think the weekday directives can be used across ISO/non-ISO directives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, hmm. I guess they're technically valid, in the sense that datetimes can be formatted with them and re-parsed back successfully. Ok to keep them then.
I guess I was more thinking about if they're valid for reasonable use, as you can't tell unambiguously from just the formatted datetime how it should be parsed, so it's a bit "dangerous" to mix them. For example, 2022 01 Sun
(year, week, weekday) means different days depending on how it was produced:
>>> datetime.strptime("2022 01 Sun", "%Y %U %a")
datetime.datetime(2022, 1, 2, 0, 0)
>>> datetime.strptime("2022 01 Sun", "%G %V %a")
datetime.datetime(2022, 1, 9, 0, 0)
By "indexed" differently, I mean the starting points for counting weeks in a year and the weekday of weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we would be leaving it up to the user to understand the datetime directives and provide the correct combinations. The same problem exists even within non-ISO:
>>> datetime.strptime("2022 01 Sun", "%Y %U %a")
datetime.datetime(2022, 1, 2, 0, 0)
>>> datetime.strptime("2022 01 Sun", "%Y %W %a")
datetime.datetime(2022, 1, 9, 0, 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Ok to leave as is.
(isinstance(directive, str) and directive in date_format) or | ||
(isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the in
operator is super nice for readability here, it doesn't handle the edge case of a format string containing an escaped directive (e.g. %%Y
means literal %Y
not "a 4+ digit year").
Regexps provide a nice way to handle this with negative lookbehind assertions, e.g. (?<!%)%Y
, generalized to something like:
def included(directive, format):
return bool(re.search(f"(?<!%){re.escape(directive)}", format))
# If not all fields directives are included, then check year | ||
# directive was included in date format | ||
elif(directive_is_included(year_directives, date_format)): | ||
year_string = parsed_year_string | ||
|
||
# Only check for month and day directives if year is included | ||
# Check if directive for BOTH month and year is included in date format | ||
if (directive_is_included(month_and_day_directives, date_format)): | ||
month_string = parsed_month_string | ||
day_string = parsed_day_string | ||
|
||
# If not directives for BOTH month and day are included, then check | ||
# month directive was included in date format | ||
elif(directive_is_included(month_directives, date_format)): | ||
month_string = parsed_month_string | ||
|
||
# Only check for day directives if month is included | ||
if(directive_is_included(day_directives, date_format)): | ||
day_string = parsed_day_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cascade of checks to greater specificity makes sense as an implementation approach. It also serves an important functional role because some directives are dependent on others to parse correctly, e.g. %U
requires %Y
(or %y
), and we enforce that thru this cascade.
However, I wonder if the sets of directives above and this code would be easier to reason about if we explicitly listed the combinations for each level instead of cascading. It'd be more verbose, but I think not overly so.
For example, what if instead of cascading down these sets
all_field_directives
month_and_day_directives
year_directives
month_directives
day_directives
we used a flat if/elif/elif chain against these sets:
year_month_day_directives
year_month_directives
year_directives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current way makes it easier to add new directives. However, that may not be a worry since the directives rarely get updated 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the directives are likely to change approximately never.¹ ;-) And in general, I think the default should be to optimize the code for reading, not writing, as code will be read many many more times by many more people than it is written.
(But this doesn't mean you have to take my suggestion here!)
¹ These in particular been the same my entire programming career, because they were defined as part of C89, as in the year 1989.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I might not put in the effort here in monkeypox since I expect it to be replaced by the new augur command, but will definitely go for readability in the augur version of date transformations.
Hmm. Realizing that there are also probably platform-specific directives that we should include here, as they can be successfully used if someone's only running
|
This is neat! Much broader support than Augur's dates.py, but also noting that incomplete dates such as Once this makes its way into the Augur subcommand, we can probably scope downstream Augur date support to just:
Related: nextstrain/augur#928 |
Changes the cascade of directive checks to explicit sets of tuples of directives so that it is easier to read/understand as suggested by @tsibley in a previous review.¹ This also allows us to simplify the `directive_is_included` function to only expect tuples as potential directives.² ¹ nextstrain/mpox#45 (comment) ² nextstrain/mpox#45 (comment)
'%%' represents an escaped directive that is a literal '%' character, so it needs to be excluded in the date format check. Using the regex negative lookbehind assertion suggested by @tsibley in a previous review.¹ ¹ nextstrain/mpox#45 (comment)
Changes the cascade of directive checks to explicit sets of tuples of directives so that it is easier to read/understand as suggested by @tsibley in a previous review.¹ This also allows us to simplify the `directive_is_included` function to only expect tuples as potential directives.² ¹ nextstrain/mpox#45 (comment) ² nextstrain/mpox#45 (comment)
'%%' represents an escaped directive that is a literal '%' character, so it needs to be excluded in the date format check. Using the regex negative lookbehind assertion suggested by @tsibley in a previous review.¹ ¹ nextstrain/mpox#45 (comment)
Changes the cascade of directive checks to explicit sets of tuples of directives so that it is easier to read/understand as suggested by @tsibley in a previous review.¹ This also allows us to simplify the `directive_is_included` function to only expect tuples as potential directives.² ¹ nextstrain/mpox#45 (comment) ² nextstrain/mpox#45 (comment)
'%%' represents an escaped directive that is a literal '%' character, so it needs to be excluded in the date format check. Using the regex negative lookbehind assertion suggested by @tsibley in a previous review.¹ ¹ nextstrain/mpox#45 (comment)
Changes the cascade of directive checks to explicit sets of tuples of directives so that it is easier to read/understand as suggested by @tsibley in a previous review.¹ This also allows us to simplify the `directive_is_included` function to only expect tuples as potential directives.² ¹ nextstrain/mpox#45 (comment) ² nextstrain/mpox#45 (comment)
'%%' represents an escaped directive that is a literal '%' character, so it needs to be excluded in the date format check. Using the regex negative lookbehind assertion suggested by @tsibley in a previous review.¹ ¹ nextstrain/mpox#45 (comment)
Changes the cascade of directive checks to explicit sets of tuples of directives so that it is easier to read/understand as suggested by @tsibley in a previous review.¹ This also allows us to simplify the `directive_is_included` function to only expect tuples as potential directives.² ¹ nextstrain/mpox#45 (comment) ² nextstrain/mpox#45 (comment)
'%%' represents an escaped directive that is a literal '%' character, so it needs to be excluded in the date format check. Using the regex negative lookbehind assertion suggested by @tsibley in a previous review.¹ ¹ nextstrain/mpox#45 (comment)
My attempt to make the ingest
transform-date-fields
more robust.See commits for details.
Manually ran ingest on branch, resulting files will be available at
s3://nextstrain-data/files/workflows/monkeypox/branch/robust-format-date/