-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest/transform: add apply-geolocation-rules
#41
Conversation
Applies geolocation rules provided in the geolocation rules TSV file to the geolocation fields (region, country, division, location). This allows us to create general rules for geolocations that are not tied to specific record accession (i.e. not user annotations). Currently fetches the geolocation rules TSV from ncov-ingest, but in the future this can point to a more centralized location for general rules. Monkeypox data specific rules in ingest/source-data/geolocation-rules.tsv will be concatenated to the general rules. Since these local rules are added at the end, they can overwrite existing general rules.
This looks very good to me! |
# Find the index of the first of the consecutive '*' from the | ||
# end of the rule_traversal | ||
# [A, *, B, *] => first_consecutive_general_rule_index = 3 | ||
# [A, *, *, *] => first_consecutive_general_rule_index = 1 | ||
first_consecutive_general_rule_index = 0 | ||
for index, field_value in reversed(list(enumerate(rule_traversal))): | ||
if field_value == '*': | ||
first_consecutive_general_rule_index = index | ||
else: | ||
break | ||
|
||
# Edit the raw value of the field directly before the first consecutive '*' | ||
# in the rule traversal in hopes that by moving to a general rule on | ||
# a higher level, we can find a matching rule. | ||
rule_traversal[first_consecutive_general_rule_index - 1] = '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joverlee521 I'm a bit confused here, could you provide an example of what this is trying to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this might be hard to explain. I'll walk through an example.
Example rules:
rule # | raw | annotated |
---|---|---|
1 | a/b/c/d | 1/2/3/4 |
2 | a/*/*/* | 1/*/*/* |
A example case: ( a, b, c, e )
Here is how rule_traversal
will be edited per functional call:
- [a]
- [a, b]
- [a, b, c] <- Up to this point, we are matching rule # 1
- [a, b, c, e] <- This doesn't match any available rules, so we try to backfill with
*
- [a, b, c, * ] <- Still no matching rule, continue
*
backfill - [a, b, *, *] <- Still no matching rule, continue
*
backfill - [a, *, *, * ] <- This now matches rule # 2!
And of course as I explain this, I'm realizing there are cases where this logic would not work...I'll create a PR to update this tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #55 for the updated logic 🙏
Previously, we would backfill `rule_traversal` with "*" general rules if we could not find a matching rule for a particular `raw_geolocation`. This would NOT work for cases where the "*" general rules preceded actual geolocations such as "*/*/*/D". I realized the flaw in this logic while responding to @victorlin's post-merge review: #41 (comment) This commit updates the logic to reset `rule_traversal` to the last index that is currently not a "*" general rule, then change this value to a "*". This allows the recursive function to try different iterations of `rule_traversal` with different combinations of raw values and "*"s.
Previously, we would backfill `rule_traversal` with "*" general rules if we could not find a matching rule for a particular `raw_geolocation`. This would NOT work for cases where the "*" general rules preceded actual geolocations such as "*/*/*/D". I realized the flaw in this logic while responding to @victorlin's post-merge review: #41 (comment) This commit updates the logic to reset `rule_traversal` to the last index that is currently not a "*" general rule, then change this value to a "*". This allows the recursive function to try different iterations of `rule_traversal` with different combinations of raw values and "*"s.
Previously, we would backfill `rule_traversal` with wildcards ("*") if we could not find a matching rule for a particular `raw_geolocation`. This would NOT work for cases where there are partial rule matches AND wildcard rules that match. I realized the flaw in this logic while responding to @victorlin's post-merge review: #41 (comment) This commit updates the logic for when there are no matching rules. The `rule_traversal` is reset to the last index that is currently not a wildcard rule, and then change this value to a wildcard. This allows the recursive function to try different iterations of `rule_traversal` with different combinations of raw values and wildcards.
Applies geolocation rules provided in the geolocation rules TSV file to
the geolocation fields (region, country, division, location).
This allows us to create general rules for geolocations that are not
tied to specific record accession (i.e. not user annotations).
Currently fetches the geolocation rules TSV from ncov-ingest, but in the
future this can point to a more centralized location for general rules.
Monkeypox data specific rules in ingest/source-data/geolocation-rules.tsv
will be concatenated to the general rules. Since these local rules are
added at the end, they can overwrite existing general rules.
Closes #34