Robust transform-date-fields #45

joverlee521 · 2022-06-09T01:55:49Z

My attempt to make the ingest transform-date-fields more robust.
See commits for details.

Manually ran ingest on branch, resulting files will be available at s3://nextstrain-data/files/workflows/monkeypox/branch/robust-format-date/

Previous implementation of masking incomplete dates was dependent on the date string using '-' to separate date fields. Providing date strings with any other separator would result in a masked date. This commit updates `format_date` to check directives within the date format to verify if year, month, and day are included in the date format. If date format does not contain any directives for year, then return a completely masked date because a date without the year is completely useless to us. If the date format does not contain any directives for month, then mask both month and day fields with 'XX' because just year and day is completely useless to us.

There are potential date strings that can match multiple formats. For example, '2020-01' can match both '%Y-%m' and '%Y-%d'. I'm not sure how we can robustly detect these clashing formats. So the next best thing is to respect the order in which the `expected-date-formats` are provided and use the first matching format in the list.

emmahodcroft · 2022-06-09T07:20:28Z

The code is beyond my reviewing capability, but I really like the ideas here - seems like a really robust solution Jover! This would be a great thing to integrate into auspice as part of (or attached to) transform, too. This is such a common issue.

tsibley · 2022-06-15T21:44:22Z

ingest/bin/transform-date-fields

+            (
+                (isinstance(directive, str) and directive in date_format) or
+                (isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive))
+            )
+            for directive in potential_directives


With polymorphic values like this (str or tuple), it can often be simplifying to just pick one or the other. In this case, since we need tuples, we could either do that dynamically with something like:

any( all( sub_directive in date_format for sub_directive in (directive if isinstance(directive, tuple) else (directive,)) ) for directive in potential_directives )

but I'd suggest defining them all as tuples instead, e.g.

year_directives = {('%y',), ('%Y',)}

and leaving this function's code to only deal with tuples.

tsibley · 2022-06-15T22:10:38Z

ingest/bin/transform-date-fields

+    all_field_directives = {'%c', '%x',
+        ('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w'), ('%G', '%V', '%u')
+    }
+    month_and_day_directives = {'%j',
+        ('%U', '%A'), ('%U', '%a'), ('%U', '%w'), ('%U', '%u'),
+        ('%W', '%A'), ('%W', '%a'), ('%W', '%w'), ('%W', '%u')
+    }


Hmm, I'm not sure it's valid to combine ISO week-based numbering (%G, %u, %V) with non-ISO week-based numbering (%U, %W, %a, %A, %w), as the two systems are "indexed" differently. So we may want to drop ('%U', '%u') and ('%W', '%u') from month_and_day_directives and ('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w') from all_field_directives.

Hmm, I included them since they were valid combinations for datetime.strptime based on the error message when trying to parse an incomplete ISO directive:

ValueError: ISO year directive '%G' must be used with the ISO week directive '%V' and a weekday directive ('%A', '%a', '%w', or '%u').

I did see that the Year/week directives cannot be interchanged between ISO/non-ISO, I think that matches what you mean by "indexed" differently? I would think the weekday directives can be used across ISO/non-ISO directives.

Ah, hmm. I guess they're technically valid, in the sense that datetimes can be formatted with them and re-parsed back successfully. Ok to keep them then.

I guess I was more thinking about if they're valid for reasonable use, as you can't tell unambiguously from just the formatted datetime how it should be parsed, so it's a bit "dangerous" to mix them. For example, 2022 01 Sun (year, week, weekday) means different days depending on how it was produced:

>>> datetime.strptime("2022 01 Sun", "%Y %U %a") datetime.datetime(2022, 1, 2, 0, 0) >>> datetime.strptime("2022 01 Sun", "%G %V %a") datetime.datetime(2022, 1, 9, 0, 0)

By "indexed" differently, I mean the starting points for counting weeks in a year and the weekday of weeks.

Yeah, I guess we would be leaving it up to the user to understand the datetime directives and provide the correct combinations. The same problem exists even within non-ISO:

>>> datetime.strptime("2022 01 Sun", "%Y %U %a") datetime.datetime(2022, 1, 2, 0, 0) >>> datetime.strptime("2022 01 Sun", "%Y %W %a") datetime.datetime(2022, 1, 9, 0, 0)

Yep! Ok to leave as is.

tsibley · 2022-06-15T22:18:44Z

ingest/bin/transform-date-fields

+                (isinstance(directive, str) and directive in date_format) or
+                (isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive))


While the in operator is super nice for readability here, it doesn't handle the edge case of a format string containing an escaped directive (e.g. %%Y means literal %Y not "a 4+ digit year").

Regexps provide a nice way to handle this with negative lookbehind assertions, e.g. (?<!%)%Y, generalized to something like:

def included(directive, format): return bool(re.search(f"(?<!%){re.escape(directive)}", format))

tsibley · 2022-06-15T22:26:00Z

ingest/bin/transform-date-fields

+        # If not all fields directives are included, then check year
+        # directive was included in date format
+        elif(directive_is_included(year_directives, date_format)):
+            year_string = parsed_year_string
+
+            # Only check for month and day directives if year is included
+            # Check if directive for BOTH month and year is included in date format
+            if (directive_is_included(month_and_day_directives, date_format)):
+                month_string = parsed_month_string
+                day_string = parsed_day_string
+
+            # If not directives for BOTH month and day are included, then check
+            # month directive was included in date format
+            elif(directive_is_included(month_directives, date_format)):
+                month_string = parsed_month_string
+
+                # Only check for day directives if month is included
+                if(directive_is_included(day_directives, date_format)):
+                    day_string = parsed_day_string


This cascade of checks to greater specificity makes sense as an implementation approach. It also serves an important functional role because some directives are dependent on others to parse correctly, e.g. %U requires %Y (or %y), and we enforce that thru this cascade.

However, I wonder if the sets of directives above and this code would be easier to reason about if we explicitly listed the combinations for each level instead of cascading. It'd be more verbose, but I think not overly so.

For example, what if instead of cascading down these sets

all_field_directives month_and_day_directives year_directives month_directives day_directives

we used a flat if/elif/elif chain against these sets:

year_month_day_directives year_month_directives year_directives

I think the current way makes it easier to add new directives. However, that may not be a worry since the directives rarely get updated 🤔

Yeah, the directives are likely to change approximately never.¹ ;-) And in general, I think the default should be to optimize the code for reading, not writing, as code will be read many many more times by many more people than it is written.

(But this doesn't mean you have to take my suggestion here!)

¹ These in particular been the same my entire programming career, because they were defined as part of C89, as in the year 1989.

Yeah, I might not put in the effort here in monkeypox since I expect it to be replaced by the new augur command, but will definitely go for readability in the augur version of date transformations.

tsibley · 2022-06-16T19:10:10Z

Hmm. Realizing that there are also probably platform-specific directives that we should include here, as they can be successfully used if someone's only running augur on platforms that support more than what C89 does. The Python docs mention this:

The full set of format codes supported varies across platforms, because Python calls the platform C library’s strftime() function, and platform variations are common. To see the full set of format codes supported on your platform, consult the strftime(3) documentation. There are also differences between platforms in handling of unsupported format specifiers.

victorlin · 2022-06-18T00:12:30Z

This is neat! Much broader support than Augur's dates.py, but also noting that incomplete dates such as 2020, 2020-01 which are addressed here, are also addressed in current Augur date support (though it isn't perfect: nextstrain/augur#747).

Once this makes its way into the Augur subcommand, we can probably scope downstream Augur date support to just:

Exact ISO-8601 dates
Ambiguous ISO-8601-formatted dates
Numeric dates
Relative dates

Related: nextstrain/augur#928

@tsibley

Changes the cascade of directive checks to explicit sets of tuples of directives so that it is easier to read/understand as suggested by @tsibley in a previous review.¹ This also allows us to simplify the `directive_is_included` function to only expect tuples as potential directives.² ¹ nextstrain/mpox#45 (comment) ² nextstrain/mpox#45 (comment)

@tsibley