Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor dedupe middleware #1222

Merged
merged 6 commits into from
Oct 30, 2018
Merged

refactor dedupe middleware #1222

merged 6 commits into from
Oct 30, 2018

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Oct 29, 2018

ready to merge

this PR refactors the dedupe middleware for readability, the old code had some unusual conventions such as throwing instead of returning values and using _.some purely to break iterator execution.

the new code is hopefully much easier to read and edit.

I have made a couple of changes to the way the deduper works with custom data:

  • all custom layers are treated as analogous to a venue
  • when selecting a preferred match between a canonical source and a non-canonical source, the non-canonical one will be selected.

@missinglink missinglink force-pushed the dedupe branch 2 times, most recently from 6599280 to 464e71c Compare October 29, 2018 19:04
@@ -1,6 +1,9 @@
const _ = require('lodash');
const elasticsearch = require('elasticsearch');

// a list of the canonical sources included in the default Pelias configuration
const CANONICAL_SOURCES = ['whosonfirst', 'openstreetmap', 'openaddresses', 'geonames'];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to pelias.json to avoid more hardcoded lists of sources and layers (see #1161)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. the problem with moving it to somewhere it can be edited is that it's no longer canonical.

The whole idea with this is to be able to know which things differ from what we expect vs. what's configured.

Copy link
Member

@orangejulius orangejulius Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

Ideally there would be a single source of truth for this, but opening it up to edits would be weird. Maybe we can write some code in pelias-config to essentially enable "constant" config values. It could just be a function that returns something like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in pelias/config#103

@orangejulius
Copy link
Member

orangejulius commented Oct 30, 2018

This is really great. I found some cases we should think about, that perhaps might fit in this PR, or might make sense for further improvements:

  • if record one has only a postal code, and no other address fields, and record two has has no address fields at all, they won't be considered dupes. Example test case here: https://github.com/pelias/api/tree/dedupe-failing-test-case
  • Likewise, a record with an address and a record without an address aren't considered dupes. This is a bigger change but is currently happening for the Nike case, and is probably something users will want. Otherwise, they can't "augment" an existing record by adding an address

Copy link
Member Author

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orangejulius I don't think you committed that fixture file?

git cherry-pick aa28db0bd38a153c167decc427c22e20210ec532

internal/modules/cjs/loader.js:583
    throw err;
    ^

Error: Cannot find module '../fixture/dedupe_only_postalcode_differs'
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:581:15)
    at Function.Module._load (internal/modules/cjs/loader.js:507:25)
    at Module.require (internal/modules/cjs/loader.js:637:17)
    at require (internal/modules/cjs/helpers.js:20:18)
    at Object.<anonymous> (/code/pelias/api/test/unit/middleware/dedupe.js:4:29)
    at Module._compile (internal/modules/cjs/loader.js:689:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
    at Module.load (internal/modules/cjs/loader.js:599:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
    at Function.Module._load (internal/modules/cjs/loader.js:530:3)

if( _.isUndefined(res) || !_.isArray(res.data) || _.isEmpty(res.data) ){ return next(); }

// loop through data items and only copy unique items to unique
// note: the first reqults must always be unique!
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

@@ -1,6 +1,9 @@
const _ = require('lodash');
const elasticsearch = require('elasticsearch');

// a list of the canonical sources included in the default Pelias configuration
const CANONICAL_SOURCES = ['whosonfirst', 'openstreetmap', 'openaddresses', 'geonames'];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in pelias/config#103

@orangejulius
Copy link
Member

Just gave this a whirl on a full planet build. All acceptance tests pass.

Do we know of any examples where this might change the deduping results in a full planet build, for example from the case where only one record had an address_parts object with postal code? It would be nice to add an acceptance test.

We have had several postal code deduping issues reported over the years, such as pelias/pelias#541, but I don't currently know of a good test case.

@orangejulius orangejulius merged commit 3535f9e into master Oct 30, 2018
@orangejulius orangejulius deleted the dedupe branch October 30, 2018 18:09
@missinglink missinglink restored the dedupe branch October 30, 2018 18:19
@orangejulius orangejulius deleted the dedupe branch October 30, 2018 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants