Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Who's on First venues #94

Closed
9 of 14 tasks
orangejulius opened this issue Jun 3, 2016 · 6 comments
Closed
9 of 14 tasks

Import Who's on First venues #94

orangejulius opened this issue Jun 3, 2016 · 6 comments

Comments

@orangejulius
Copy link
Member

orangejulius commented Jun 3, 2016

Who's on First now includes many venues. The data is split across several hundred repos in the whosonfirst-data Github organization, so a big challenge will simply be gathering all the data. Several of the repositories use git-lfs as well.

On the importer side, we are currently able to squeeze all the WOF administrative area records into memory, which obviously won't work with millions of venues.

has to be done to allow for dev work

has to be done before production readiness

  • improve WOF venue generation code to build more venue bundles (preferably we get one giant venue bundle) (this has been done by the Who's on First team!)
  • change list of WOF layers in API so venue and address records in WOF can be queried (similar to Update allowed list of layers for Geonames api#569) (fixed in Add venues to list of wof layers api#645)
  • ensure admin lookup code doesn't try to load all the venues if it's pointed at a directory containging WOF admin and venue data (update: it doesn't, because like the wof importer, wof-pip-service uses the meta files to know what to load)
  • (Mapzen Search specific) update chef scripts to download venue bundles OR update chef scripts to use Javascript downloader, and update that to include venue data (preferred). For now, this can just be a big hardcoded list
  • Update whosonfirst repo readme with venue information
  • Introduce configuration option to disable importing venues (Read venue import and download settings from config #142)
  • review any acceptance test failures with a full planet build and resolve them to our satisfaction
  • Ensure performance is reasonable, since this may add 10-15 million new records!
  • Update the installation docs and data sources docs

can be done as follow up improvements

  • Rewrite downloader script (again) to be faster and smarter about downloading venue data (Re-rewrite downloader #135)
  • write code that allows us to handle street addresses in WOF records. this can mirror the address duplicating code in the OSM importer
  • import category tags and normalize them to the common taxonomy (this still needs some definition). The category info will live in https://github.com/pelias/categories soon
@trescube
Copy link
Contributor

trescube commented Jun 3, 2016

I was poking around in the venue data recently and noticed that there are some Manhattan records with multiple hierarchies that are also placed in New Jersey.

@orangejulius
Copy link
Member Author

Are there enough that reporting them and fixing them manually(-ish) would be difficult?

@trescube
Copy link
Contributor

trescube commented Jun 3, 2016

I found 4090 just in that area but am working on a script to check elsewhere.

@dianashk dianashk added this to the WOF Venues milestone Jul 29, 2016
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Fixes #7
Connects #94
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
orangejulius added a commit that referenced this issue Aug 4, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
@orangejulius orangejulius mentioned this issue Aug 30, 2016
3 tasks
@orangejulius
Copy link
Member Author

orangejulius commented Oct 13, 2016

Taking a look at the acceptance tests, there are 5 different issues happening. You can compare against dev2 as of this writing (October 13, 2016) to see the difference.

Daly City

I believe this is a variant on the issue where we almost never return admin areas for autocomplete queries with a focus. There were already venues being returned ahead of daly city, now there are just more.

4th and King

There's a new entry for the 4th and king transit station in SF. This one is probably ok.

Newfoundland and Labrador

screenshot from 2016-10-13 19-21-26

The scores for the venues that start with "Newfoundland and Labrador" are actually identical to the region. Perhaps we should apply a small boost to all admin areas? Even a 1.1x boost here would be enough. I'll investigate later

Maui, Hawaii

screenshot from 2016-10-13 19-31-27

This actually has nothing to do with the duplicate Maui, it appears that it's simply because "Maui Maui" is shorter than "Maui County", and so the relevance score is higher. Other "Maui XXXX" results show up with a tied score. Here the score is significantly higher for "Maui Maui", so I don't think we can boost our way out of it. One solution might be to add "Maui" as an alt name for the county, but this would mean we can't fix it until next quarter.

New South Wales

We already return the Geonames record for New South Wales first, but it has the name "State of New South Wales". It's boosted by the population, but the WOF record (name: "New South Wales") has no population info). I think this one is ok, and additionally we can and should add the population data to WOF.

Summary

Other than Maui Maui, most of these are easily fixable.

orangejulius added a commit to pelias/api that referenced this issue Nov 4, 2016
This includes an edge case for Hawaii to handle islands which are mostly
stored as counties in our data currently. See pelias/whosonfirst#94.
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Nov 22, 2016
This will not be able to pass until we have alt-names, or better data
for islands. See pelias/whosonfirst#94
@orangejulius orangejulius removed their assignment Jul 27, 2017
orangejulius added a commit that referenced this issue Nov 9, 2017
Our previous regular expression for filtering venue bundles was too
strict, and would filter out bundles that are specific to a single
region within a country (most bundles are for an entire country).

In particular, venue bundles are split up for each US state, so we were
missing quite a few important venues.

Connects #94
@ghost ghost assigned orangejulius Nov 9, 2017
@ghost ghost added in progress and removed processed labels Nov 9, 2017
@orangejulius orangejulius removed their assignment Aug 3, 2018
@orangejulius
Copy link
Member Author

I have received word from the WOF team that WOF Venues are pretty low priority for them, as there's lots of other work to be done. At this time enabling venue imports should still be as easy as toggling a config flag (importVenues in pelias.json). We welcome reports of how well this works out for people, but don't intend to support it as a production-ready configuration any time soon.

@orangejulius
Copy link
Member Author

After some recent discussion it sounds like we have no plans to continue supporting WOF venue downloads going forward. The new data hosting for Who's on First sponsored by Geocode Earth is not going to publish them, and we expect to remove support for this functionality in this importer.

orangejulius added a commit that referenced this issue Apr 23, 2020
BREAKING CHANGE: Because we do not expect Who's on First to update venue data, and we
will likely never publish that data to the new data host from Geocode
Earth, the `importVenues` option is now deprecated.

Connects #94
orangejulius added a commit that referenced this issue Jun 9, 2020
BREAKING CHANGE: Because we do not expect Who's on First to update venue data, and we
will likely never publish that data to the new data host from Geocode
Earth, the `importVenues` option is now deprecated.

Connects #94
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants