Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations to the dictionary comparison strategy #51

Merged
merged 12 commits into from
May 16, 2022

Conversation

ESultanik
Copy link
Collaborator

Take these two JSON files as an example:

$ cat f1.json
{
    "foo": [1, 2, 3],
    "oof": [1, "two", 3]
}
$ cat f2.json
{
    "bar": [1, 2, 3],
    "foo": [1, "two", 3] 
}

By default Graphtage used to try all possible matchings between dictionary key/value pairs; comparing graphtage f1.json f2.json would result in the "foo" key being replaced by "bar" and the "f" in the "oof" key being moved to the front of the string.

This sort of matching is polynomial time in the size of the input, but often is still intractable for large files. Therefore, Graphtage had an option, --no-key-edits or -k that would prevent two dictionary key/value pairs from being compared to each other unless their keys were identical. graphtage -k f1.json f2.json would have resulted in the 2 being replaced by "two", the entire "oof" key/value pair being removed, and the entire "foo" key/value pair being added.

This PR…

  1. generalizes these two options with a new --dict-strategy/-ds option which sets the strategy: match for the old default behavior and none for the old --no-key-edits behavior. The --no-key-edits option still exists, but now is an alias to --dict-strategy none.
  2. adds a new strategy, --dict-strategy auto, which is now the default, that behaves exactly the same as the match strategy, but in the event that two key/value pairs have then exact same key, then they are automatically matched.

graphtage --dict-strategy auto f1.json f2.json will now result in 2 being replaces with "two", oof being replaced by bar, and "two" being replaced by 2.

@ESultanik ESultanik self-assigned this May 12, 2022
@ESultanik ESultanik added the enhancement New feature or request label May 12, 2022
@ESultanik ESultanik merged commit 73639f6 into master May 16, 2022
@ESultanik ESultanik deleted the dict-comparison-optimizations branch May 16, 2022 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant