Import Who's on First venues #94

orangejulius · 2016-06-03T19:58:40Z

Who's on First now includes many venues. The data is split across several hundred repos in the whosonfirst-data Github organization, so a big challenge will simply be gathering all the data. Several of the repositories use git-lfs as well.

On the importer side, we are currently able to squeeze all the WOF administrative area records into memory, which obviously won't work with millions of venues.

has to be done to allow for dev work

~~create script to set up test data directory with example venue data~~ (no longer needed, because the venue bundles are published and can be downloaded directly)
publish WIP code to support better memory management in importer (Use a single stream for importing records #119, related to Store fewer objects in memory during import #7)

has to be done before production readiness

improve WOF venue generation code to build more venue bundles (preferably we get one giant venue bundle) (this has been done by the Who's on First team!)
change list of WOF layers in API so venue and address records in WOF can be queried (similar to Update allowed list of layers for Geonames api#569) (fixed in Add venues to list of wof layers api#645)
ensure admin lookup code doesn't try to load all the venues if it's pointed at a directory containging WOF admin and venue data (update: it doesn't, because like the wof importer, wof-pip-service uses the meta files to know what to load)
(Mapzen Search specific) update chef scripts to download venue bundles OR update chef scripts to use Javascript downloader, and update that to include venue data (preferred). For now, this can just be a big hardcoded list
Update whosonfirst repo readme with venue information
Introduce configuration option to disable importing venues (Read venue import and download settings from config #142)
review any acceptance test failures with a full planet build and resolve them to our satisfaction
Ensure performance is reasonable, since this may add 10-15 million new records!
Update the installation docs and data sources docs

can be done as follow up improvements

Rewrite downloader script (again) to be faster and smarter about downloading venue data (Re-rewrite downloader #135)
write code that allows us to handle street addresses in WOF records. this can mirror the address duplicating code in the OSM importer
import category tags and normalize them to the common taxonomy (this still needs some definition). The category info will live in https://github.com/pelias/categories soon

trescube · 2016-06-03T19:59:45Z

I was poking around in the venue data recently and noticed that there are some Manhattan records with multiple hierarchies that are also placed in New Jersey.

orangejulius · 2016-06-03T20:04:14Z

Are there enough that reporting them and fixing them manually(-ish) would be difficult?

trescube · 2016-06-03T20:07:09Z

I found 4090 just in that area but am working on a script to check elsewhere.

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Fixes #7 Connects #94

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

orangejulius · 2016-10-13T23:30:59Z

Taking a look at the acceptance tests, there are 5 different issues happening. You can compare against dev2 as of this writing (October 13, 2016) to see the difference.

Daly City

I believe this is a variant on the issue where we almost never return admin areas for autocomplete queries with a focus. There were already venues being returned ahead of daly city, now there are just more.

4th and King

There's a new entry for the 4th and king transit station in SF. This one is probably ok.

Newfoundland and Labrador

The scores for the venues that start with "Newfoundland and Labrador" are actually identical to the region. Perhaps we should apply a small boost to all admin areas? Even a 1.1x boost here would be enough. I'll investigate later

Maui, Hawaii

This actually has nothing to do with the duplicate Maui, it appears that it's simply because "Maui Maui" is shorter than "Maui County", and so the relevance score is higher. Other "Maui XXXX" results show up with a tied score. Here the score is significantly higher for "Maui Maui", so I don't think we can boost our way out of it. One solution might be to add "Maui" as an alt name for the county, but this would mean we can't fix it until next quarter.

New South Wales

We already return the Geonames record for New South Wales first, but it has the name "State of New South Wales". It's boosted by the population, but the WOF record (name: "New South Wales") has no population info). I think this one is ok, and additionally we can and should add the population data to WOF.

Summary

Other than Maui Maui, most of these are easily fixable.

This includes an edge case for Hawaii to handle islands which are mostly stored as counties in our data currently. See pelias/whosonfirst#94.

This will not be able to pass until we have alt-names, or better data for islands. See pelias/whosonfirst#94

Our previous regular expression for filtering venue bundles was too strict, and would filter out bundles that are specific to a single region within a country (most bundles are for an entire country). In particular, venue bundles are split up for each US state, so we were missing quite a few important venues. Connects #94

orangejulius · 2018-08-03T14:49:09Z

I have received word from the WOF team that WOF Venues are pretty low priority for them, as there's lots of other work to be done. At this time enabling venue imports should still be as easy as toggling a config flag (importVenues in pelias.json). We welcome reports of how well this works out for people, but don't intend to support it as a production-ready configuration any time soon.

orangejulius · 2020-04-23T21:29:38Z

After some recent discussion it sounds like we have no plans to continue supporting WOF venue downloads going forward. The new data hosting for Who's on First sponsored by Geocode Earth is not going to publish them, and we expect to remove support for this functionality in this importer.

BREAKING CHANGE: Because we do not expect Who's on First to update venue data, and we will likely never publish that data to the new data host from Geocode Earth, the `importVenues` option is now deprecated. Connects #94

dianashk mentioned this issue Jun 3, 2016

WOF Phase 2 Details pelias/pelias#324

Closed

3 tasks

orangejulius added the Q2-2016 label Jun 3, 2016

orangejulius added on-deck and removed Q2-2016 labels Jun 3, 2016

orangejulius mentioned this issue Jun 6, 2016

Laayoune / El-Aaiún / العيون (capital of Western Sahara) needs improvements whosonfirst-data/whosonfirst-data#302

Closed

2 tasks

dianashk assigned orangejulius Jul 20, 2016

dianashk added in progress and removed in progress labels Jul 20, 2016

dianashk added this to the WOF Venues milestone Jul 29, 2016

orangejulius mentioned this issue Aug 3, 2016

Use a single stream for importing records #119

Merged

orangejulius added in progress and removed on-deck labels Aug 3, 2016

dianashk added on-deck and removed in progress labels Aug 10, 2016

dianashk added in progress and removed on-deck labels Aug 24, 2016

orangejulius mentioned this issue Aug 30, 2016

Re-rewrite downloader #135

Closed

3 tasks

orangejulius mentioned this issue Sep 13, 2016

Test Pelias vagrant setup on Windows pelias-deprecated/vagrant#11

Closed

orangejulius mentioned this issue Nov 4, 2016

Add island to field mapping pelias-deprecated/text-analyzer#15

Merged

orangejulius added a commit to pelias/api that referenced this issue Nov 4, 2016

Support island input from libpostal

09c45b5

This includes an edge case for Hawaii to handle islands which are mostly stored as counties in our data currently. See pelias/whosonfirst#94.

orangejulius mentioned this issue Nov 4, 2016

Support "county, region" in FallbackQuery with edge case for Hawaii pelias/api#715

Closed

orangejulius added a commit to pelias/acceptance-tests that referenced this issue Nov 22, 2016

Mark Maui, Hawaii test failing

d15c374

This will not be able to pass until we have alt-names, or better data for islands. See pelias/whosonfirst#94

orangejulius mentioned this issue Dec 9, 2016

Importer runs out of memory when importing venues #180

Closed

dianashk added processed and removed in progress labels Jun 15, 2017

orangejulius removed their assignment Jul 27, 2017

orangejulius mentioned this issue Nov 9, 2017

Correctly filter for region-specific bundles #306

Merged

ghost assigned orangejulius Nov 9, 2017

ghost added in progress and removed processed labels Nov 9, 2017

orangejulius added low priority and removed in progress labels Aug 2, 2018

orangejulius removed their assignment Aug 3, 2018

orangejulius mentioned this issue Aug 13, 2018

Error during import #237

Closed

orangejulius closed this as completed Apr 23, 2020

orangejulius mentioned this issue Apr 23, 2020

Remove default value for imports.whosonfirst.importVenues pelias/config#128

Merged

orangejulius mentioned this issue Apr 23, 2020

Deprecate importVenues #503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Who's on First venues #94

Import Who's on First venues #94

orangejulius commented Jun 3, 2016 •

edited

Loading

trescube commented Jun 3, 2016

orangejulius commented Jun 3, 2016

trescube commented Jun 3, 2016

orangejulius commented Oct 13, 2016 •

edited

Loading

orangejulius commented Aug 3, 2018

orangejulius commented Apr 23, 2020

Import Who's on First venues #94

Import Who's on First venues #94

Comments

orangejulius commented Jun 3, 2016 • edited Loading

has to be done to allow for dev work

has to be done before production readiness

can be done as follow up improvements

trescube commented Jun 3, 2016

orangejulius commented Jun 3, 2016

trescube commented Jun 3, 2016

orangejulius commented Oct 13, 2016 • edited Loading

Daly City

4th and King

Newfoundland and Labrador

Maui, Hawaii

New South Wales

Summary

orangejulius commented Aug 3, 2018

orangejulius commented Apr 23, 2020

orangejulius commented Jun 3, 2016 •

edited

Loading

orangejulius commented Oct 13, 2016 •

edited

Loading