Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/transform: add apply-geolocation-rules #41

Merged
merged 1 commit into from
Jun 10, 2022
Merged

Conversation

joverlee521
Copy link
Contributor

Applies geolocation rules provided in the geolocation rules TSV file to
the geolocation fields (region, country, division, location).
This allows us to create general rules for geolocations that are not
tied to specific record accession (i.e. not user annotations).

Currently fetches the geolocation rules TSV from ncov-ingest, but in the
future this can point to a more centralized location for general rules.

Monkeypox data specific rules in ingest/source-data/geolocation-rules.tsv
will be concatenated to the general rules. Since these local rules are
added at the end, they can overwrite existing general rules.

Closes #34

Applies geolocation rules provided in the geolocation rules TSV file to
the geolocation fields (region, country, division, location).
This allows us to create general rules for geolocations that are not
tied to specific record accession (i.e. not user annotations).

Currently fetches the geolocation rules TSV from ncov-ingest, but in the
future this can point to a more centralized location for general rules.

Monkeypox data specific rules in ingest/source-data/geolocation-rules.tsv
will be concatenated to the general rules. Since these local rules are
added at the end, they can overwrite existing general rules.
@MoiraZuber
Copy link

This looks very good to me!

@rneher rneher merged commit 16f1c38 into master Jun 10, 2022
@rneher rneher deleted the ingest-geolocation branch June 10, 2022 05:53
Comment on lines +119 to +133
# Find the index of the first of the consecutive '*' from the
# end of the rule_traversal
# [A, *, B, *] => first_consecutive_general_rule_index = 3
# [A, *, *, *] => first_consecutive_general_rule_index = 1
first_consecutive_general_rule_index = 0
for index, field_value in reversed(list(enumerate(rule_traversal))):
if field_value == '*':
first_consecutive_general_rule_index = index
else:
break

# Edit the raw value of the field directly before the first consecutive '*'
# in the rule traversal in hopes that by moving to a general rule on
# a higher level, we can find a matching rule.
rule_traversal[first_consecutive_general_rule_index - 1] = '*'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joverlee521 I'm a bit confused here, could you provide an example of what this is trying to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this might be hard to explain. I'll walk through an example.

Example rules:

rule # raw annotated
1 a/b/c/d 1/2/3/4
2 a/*/*/* 1/*/*/*

A example case: ( a, b, c, e )

Here is how rule_traversal will be edited per functional call:

  1. [a]
  2. [a, b]
  3. [a, b, c] <- Up to this point, we are matching rule # 1
  4. [a, b, c, e] <- This doesn't match any available rules, so we try to backfill with *
  5. [a, b, c, * ] <- Still no matching rule, continue * backfill
  6. [a, b, *, *] <- Still no matching rule, continue * backfill
  7. [a, *, *, * ] <- This now matches rule # 2!

And of course as I explain this, I'm realizing there are cases where this logic would not work...I'll create a PR to update this tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #55 for the updated logic 🙏

joverlee521 added a commit that referenced this pull request Jun 14, 2022
Previously, we would backfill `rule_traversal` with "*" general rules
if we could not find a matching rule for a particular `raw_geolocation`.
This would NOT work for cases where the "*" general rules preceded
actual geolocations such as "*/*/*/D". I realized the flaw in this logic
while responding to @victorlin's post-merge review:
#41 (comment)

This commit updates the logic to reset `rule_traversal` to the last
index that is currently not a "*" general rule, then change this value
to a "*". This allows the recursive function to try different iterations
of `rule_traversal` with different combinations of raw values and "*"s.
joverlee521 added a commit that referenced this pull request Jun 14, 2022
Previously, we would backfill `rule_traversal` with "*" general rules
if we could not find a matching rule for a particular `raw_geolocation`.
This would NOT work for cases where the "*" general rules preceded
actual geolocations such as "*/*/*/D". I realized the flaw in this logic
while responding to @victorlin's post-merge review:
#41 (comment)

This commit updates the logic to reset `rule_traversal` to the last
index that is currently not a "*" general rule, then change this value
to a "*". This allows the recursive function to try different iterations
of `rule_traversal` with different combinations of raw values and "*"s.
joverlee521 added a commit that referenced this pull request Jun 15, 2022
Previously, we would backfill `rule_traversal` with wildcards ("*")
if we could not find a matching rule for a particular `raw_geolocation`.
This would NOT work for cases where there are partial rule matches AND
wildcard rules that match. I realized the flaw in this logic
while responding to @victorlin's post-merge review:
#41 (comment)

This commit updates the logic for when there are no matching rules. The
`rule_traversal` is reset to the last index that is currently not a
wildcard rule, and then change this value to a wildcard. This allows the
recursive function to try different iterations of `rule_traversal` with
different combinations of raw values and wildcards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

ingest: adopt geolocation rules
4 participants