-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SIP-28] Proposal for Geocoding #8544
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
Hi @Sascha-Gschwind - thanks for submitting the SIP. Do you plan on implementing the actual geocoding workflow as a Celery job? |
https://www.mapbox.com/legal/tos/#[GAGA] Are you assuming doing atomic updates against the database or processing a block of data in-memory? I'm not sure whether data munging / enrichment is in scope for Superset. We spoke about a geo service in the past, but enrichment wasn't in scope. |
Thanks @willbarrett and @mistercrunch for your feedback! We have to admit we did not check the mapbox ToS. We are now looking into other geocoding services that we could use where we do not infringe any ToS. One possible geocoding service would be Maptiler: https://www.maptiler.com/blog/2019/09/announcing-maptiler-cloud-geocoding-api.html. Our bachelor thesis examinator has a very good contact to the people in charge of Maptiler. Maptiler Geocoding is not yet implemented in GeoPy though, but could definetely be achieved meaning in a first step there would be an implementation without GeoPy which later can be changed. Regarding implementing the actual geocoding workflow as a Celery job: Yes it is planned that the geocoding workflow is implemented this way in the final version. First we want to create a proof of concept and since we are not very familiar with Celery yet we planned to implement that later in the process. |
Hi @Sascha-Gschwind. Thanks for the response. It sounds to me like this isn't a great fit for inclusion in Superset at this point. As Max brought up, data enrichment isn't part of the scope of this project. You may want to check out one of our related projects, Apache Airflow. Airflow is designed for building data pipelines, and you could implement what you need for data enrichment on top of that platform without needing to extend the core. After enriching your data, Superset can be used to visualize and explore the data. Were I tasked with a similar problem in my day to day work, that's how I would implement it. I hope this recommendation helps! |
@willbarrett Thanks for clarifying! We really thought this is a useful feature for Superset since many other BI tools support this kind of functionality, but can totally see where you guys are coming from since there seems to be an already well established workflow using Apache tools! |
@willbarrett: (I'm the advisor of Sascha-Geschwind and his two student colleagis). I respect the principle to keep a narrow scope of this nice BI tool. But I'm unsure why you insist calling Geocoding an "enrichment", since "Geocoding is the computational process of transforming a physical address description to a location on the Earth's surface (spatial representation in numerical coordinates)." Without geocoding you miss many opportunities of location intelligence to visualize spatial data. Obviously, behind an adress Geocoding API there's a data set of building addresses. But involving other data is also the case for many other geocoding systems, like plus code (see https://en.wikipedia.org/wiki/Geocode ). And the result of this "transformation" is being stored just for computational reasons - very much like aggregations are also often materialized. Are there other reasons of this kind of geocoding which bother you? |
Data Enrichment is the process of supplementing incomplete raw data with additional data from an authoritative source. Geocoding is a classic example. Data enrichment isn't something that Superset currently does at all, so it's a big proposal and should be thought about in the context of other possible data enrichment adds. I am generally of the opinion that whatever format people's data is in, Superset should make an attempt to "just work". While the changes proposed above are probably not a great fit, there might be another approach that would fit better. Maybe adding a Geocode option to the deck.gl charts could work? This wouldn't include the optimizations around things like requests per second or caching, but it would be a simpler implementation, and would stay within the MapBox TOS. |
I agree that adding some form of geocoding to the front-end would be more acceptable than a back-end solution. I concur with @suddjian's definition of data enrichment, and concur that geocoding a list of addresses in a database is a classic example of data enrichment. It is also an activity that is against the terms of service of the majority of map vendors, or those vendors provide rules for the allowed duration of retention. I have a few concerns:
|
Geocoding using GeoPy and the public Nominatim (OpenStreetMaps geocoding service) should work without the TOS limitations of Mapbox or other commercial providers. That said, I agree with @willbarrett that it seems outside of Superset's core features and scope. I've prepared a PR which adds a geocoder component to some of the Deck.gl chart types but it is not exactly what you're looking for. It simply takes in user text input and geocodes that input to the current map but does not persist this data anywhere. The existing PR also needs to be refactored to the new plugins approach before it can be merged in. You can find that here |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue |
@Sascha-Gschwind adding support for this in the regular viz UI flow is still some ways off, but geocoding functionality was recently added to the backend as a chart data request post processing operation: #9661 . Currently only gehoash decoding/encoding and parsing of geodesic point strings is supported, but we could potentially add geocoding of addresses to the flow once this feature matures. |
@Sascha-Gschwind is there any intention to carry this proposal forward in discussion/voting or implementation? If not, shall we close it out? |
@rusackas We don't plan to carry on with the proposal since we all moved on to different companies etc. where we are working. You can definitely close it out. |
[SIP] Proposal for Geocoding
Motivation
A Superset user wants to use address-based data to generate charts like
deck.gl Scatterplot
. They then first need to convert their address based data using an external source so it has the requiredlatitude/longitude
columns.Many other BI tools can convert addresses automatically.
Proposed Change
We want to implement a feature using the
GeoPy
package with theMapbox Geocoding API
that can convert addresses tolatitude/longitude
and save those values as additional columns or overwrite certain columns in the same table.To make the API calls we plan to use the same API-Key that is already used for the background maps (
Mapbox API Key
).The feature will be available under the menu "Sources" > "Geocode Addresses" and will be implemented asynchronously. Only one geocoding can be in progress at once though. There are multiple reasons for this decision:
If the geocoding is in process and the user navigates to the "Geocode Addresses" URL he will see a progressbar and will have the ability to cancel the process. If no process is ongoing the geocoding form will be shown.
The user can decide what happens if anything goes wrong (for ex. call limit reached, connection issues, etc.) or the process is interrupted. He can choose to save the already converted data or discard it.
New or Changed Public Interfaces
New dependencies
Migration Plan and Compatibility
The documentation will likely need to be added which describes the usage of this new feature once this is merged into master
Rejected Alternatives
deck.gl Scatterplot
but rejected the idea since geocoding itself is an expensive operation and should only be done once on a specific dataset.The text was updated successfully, but these errors were encountered: