-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Who's on First venues #94
Comments
I was poking around in the venue data recently and noticed that there are some Manhattan records with multiple hierarchies that are also placed in New Jersey. |
Are there enough that reporting them and fixing them manually(-ish) would be difficult? |
I found 4090 just in that area but am working on a script to check elsewhere. |
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Fixes #7 Connects #94
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
Taking a look at the acceptance tests, there are 5 different issues happening. You can compare against dev2 as of this writing (October 13, 2016) to see the difference. Daly CityI believe this is a variant on the issue where we almost never return admin areas for autocomplete queries with a focus. There were already venues being returned ahead of daly city, now there are just more. 4th and KingThere's a new entry for the 4th and king transit station in SF. This one is probably ok. Newfoundland and LabradorThe scores for the venues that start with "Newfoundland and Labrador" are actually identical to the region. Perhaps we should apply a small boost to all admin areas? Even a 1.1x boost here would be enough. I'll investigate later Maui, HawaiiThis actually has nothing to do with the duplicate Maui, it appears that it's simply because "Maui Maui" is shorter than "Maui County", and so the relevance score is higher. Other "Maui XXXX" results show up with a tied score. Here the score is significantly higher for "Maui Maui", so I don't think we can boost our way out of it. One solution might be to add "Maui" as an alt name for the county, but this would mean we can't fix it until next quarter. New South WalesWe already return the Geonames record for New South Wales first, but it has the name "State of New South Wales". It's boosted by the population, but the WOF record (name: "New South Wales") has no population info). I think this one is ok, and additionally we can and should add the population data to WOF. SummaryOther than Maui Maui, most of these are easily fixable. |
This includes an edge case for Hawaii to handle islands which are mostly stored as counties in our data currently. See pelias/whosonfirst#94.
This will not be able to pass until we have alt-names, or better data for islands. See pelias/whosonfirst#94
Our previous regular expression for filtering venue bundles was too strict, and would filter out bundles that are specific to a single region within a country (most bundles are for an entire country). In particular, venue bundles are split up for each US state, so we were missing quite a few important venues. Connects #94
I have received word from the WOF team that WOF Venues are pretty low priority for them, as there's lots of other work to be done. At this time enabling venue imports should still be as easy as toggling a config flag ( |
After some recent discussion it sounds like we have no plans to continue supporting WOF venue downloads going forward. The new data hosting for Who's on First sponsored by Geocode Earth is not going to publish them, and we expect to remove support for this functionality in this importer. |
BREAKING CHANGE: Because we do not expect Who's on First to update venue data, and we will likely never publish that data to the new data host from Geocode Earth, the `importVenues` option is now deprecated. Connects #94
BREAKING CHANGE: Because we do not expect Who's on First to update venue data, and we will likely never publish that data to the new data host from Geocode Earth, the `importVenues` option is now deprecated. Connects #94
Who's on First now includes many venues. The data is split across several hundred repos in the whosonfirst-data Github organization, so a big challenge will simply be gathering all the data. Several of the repositories use git-lfs as well.
On the importer side, we are currently able to squeeze all the WOF administrative area records into memory, which obviously won't work with millions of venues.
has to be done to allow for dev work
create script to set up test data directory with example venue data(no longer needed, because the venue bundles are published and can be downloaded directly)has to be done before production readiness
can be done as follow up improvements
The text was updated successfully, but these errors were encountered: