-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dedupe is too strong #65
Comments
Just a few extra examples of where deduping is working and not working. Two buildings with the same properties and only the closest shows up in the results. Likely that they don't have IDs and we are using the properties to dedupe. Two parks/baseball pitches that have the same exact properties but are deemed as different features. Likely due to their IDs being unique. This makes me think we should only compare properties of features across tiles, not features and properties in the same tile. This still doesn't satisfy the situation where two buildings across tiles have the same properties and would be considered duplicates, though. It would continue to sold tile boundary duplicates though. |
I am not sure there is a clear answer to the "right" way to do deduping. I think part of this is that it really depends on the type of data that exists:
If you wanted to find the one closest building in OSM right now, it would be ideal to dedupe. If you wanted to find all the closest buildings, I feel that deduping might not be correct. The problems you have seen with multiple tiles does in fact make the results appear strange and I think it something we should heavily consider. The problem comes down to the vast type of data that we can have in vector tiles. If you are attempting to find a specific rubber ducky that is closest to you, it can be quite complex. You could have a standard sized rubber ducky that fits quite well into a single tile, and it may be the only rubber ducky around. However, you might also have a jumbo rubber ducky that spreads across multiple tiles and has false edges on it from the other tiles you query. In this case deduping is very good. Additionally, there might be a set of rubber duckies in your tile and you want to know all the rubber duckies in your area. In this case deduping might be too agressive because it would think all ruber duckies are the same, because their properties are the same. If all our rubber duckies have unique ids on them, then we do want to dedupe: However, if they do not -- then we might be overwhelmed by the number of rubber duckies if we do not enable deduping. Very simply put it is not always smooth sailing when you are looking for rubber duckies: Therefore, I suggest that we allow users to decide if they want to dedupe or not. We could even set a flag for what type of deduping occurs. |
@flippmoke 🦆 ❤️ Totally agree. There's no perfect solution (unless we start unioning geometries, which isn't out of the question, but is out of scope of this issue). I like the idea of providing options to the user, and think we can do a good job at keeping it simple in the code base, especially since we have the logic written already. Examples of dedupe options (not saying we have to implement them all):
Maybe another way of breaking it out is:
|
Deduplication of features is too strong right now. Consider two buildings (polygons) as two unique features but they have few properties (or no properties) and no ID. For all intensive purposes these should be two unique features to avoid removing important data.
Perhaps it's best to only dedupe based on IDs for now, while we think about other ways to best dedupe with properties.
cc @flippmoke
The text was updated successfully, but these errors were encountered: