-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor dedupe middleware #1222
Conversation
6599280
to
464e71c
Compare
helper/TypeMapping.js
Outdated
@@ -1,6 +1,9 @@ | |||
const _ = require('lodash'); | |||
const elasticsearch = require('elasticsearch'); | |||
|
|||
// a list of the canonical sources included in the default Pelias configuration | |||
const CANONICAL_SOURCES = ['whosonfirst', 'openstreetmap', 'openaddresses', 'geonames']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to pelias.json
to avoid more hardcoded lists of sources and layers (see #1161)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. the problem with moving it to somewhere it can be edited is that it's no longer canonical.
The whole idea with this is to be able to know which things differ from what we expect vs. what's configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
Ideally there would be a single source of truth for this, but opening it up to edits would be weird. Maybe we can write some code in pelias-config
to essentially enable "constant" config values. It could just be a function that returns something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in pelias/config#103
This is really great. I found some cases we should think about, that perhaps might fit in this PR, or might make sense for further improvements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orangejulius I don't think you committed that fixture file?
git cherry-pick aa28db0bd38a153c167decc427c22e20210ec532
internal/modules/cjs/loader.js:583
throw err;
^
Error: Cannot find module '../fixture/dedupe_only_postalcode_differs'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:581:15)
at Function.Module._load (internal/modules/cjs/loader.js:507:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:20:18)
at Object.<anonymous> (/code/pelias/api/test/unit/middleware/dedupe.js:4:29)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
at Module.load (internal/modules/cjs/loader.js:599:32)
at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
at Function.Module._load (internal/modules/cjs/loader.js:530:3)
middleware/dedupe.js
Outdated
if( _.isUndefined(res) || !_.isArray(res.data) || _.isEmpty(res.data) ){ return next(); } | ||
|
||
// loop through data items and only copy unique items to unique | ||
// note: the first reqults must always be unique! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
helper/TypeMapping.js
Outdated
@@ -1,6 +1,9 @@ | |||
const _ = require('lodash'); | |||
const elasticsearch = require('elasticsearch'); | |||
|
|||
// a list of the canonical sources included in the default Pelias configuration | |||
const CANONICAL_SOURCES = ['whosonfirst', 'openstreetmap', 'openaddresses', 'geonames']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in pelias/config#103
Just gave this a whirl on a full planet build. All acceptance tests pass. Do we know of any examples where this might change the deduping results in a full planet build, for example from the case where only one record had an We have had several postal code deduping issues reported over the years, such as pelias/pelias#541, but I don't currently know of a good test case. |
…, prefer non-canonical records
…ddress_parts" propertiees are not set
ready to merge
this PR refactors the dedupe middleware for readability, the old code had some unusual conventions such as throwing instead of returning values and using
_.some
purely to break iterator execution.the new code is hopefully much easier to read and edit.
I have made a couple of changes to the way the deduper works with custom data: