Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TZDB shapefile is out of date #3817

Closed
ianthetechie opened this issue Nov 3, 2022 · 6 comments · Fixed by #4446
Closed

TZDB shapefile is out of date #3817

ianthetechie opened this issue Nov 3, 2022 · 6 comments · Fixed by #4446

Comments

@ianthetechie
Copy link
Contributor

Hey guys,

I noticed that the TZDB shapefile is now pretty far out of date. I thought updating this would be straightforward, but looking at the script I'm seeing what appears to be a set of manually defined aliases. This raised a few questions.

  1. Could we document why / how these are used?
  2. I noticed as well that there is a USE_SYSTEM_TZ_DB CMAKE option as well that's disabled by default, but wasn't sure how/if it related to this. Obviously the point of the sqlite tzdb is to determine which timezone a point is located in. I would assume system functions handle the rest, since this is basically just a shapefile and not the whole tzdb ruleset.

TL;DR any gotchas just doing an update to 2022f?

@kevinkreiser
Copy link
Member

kevinkreiser commented Nov 3, 2022

@ianthetechie this is a great question!

  1. yes we could and we should document how this works
  2. yep so many os's bundle timezone information. our date_time library from (howard hinnant) can make use of that bundled data or bring its own. the problem with using the platform dependent (os bundled data) is that we build timezone information into the tiles. every node in the graph stores an integer which is a simple index into a list of timezones built into the library. so the library has a static list of timezones and the tiles have integer indices into that list. the timezone database needs to have entries which match those found in the date_time timezone list so that we can say oh which timezone polygon does this node fall within, now which timezone index is that for the date_time class. so that is why we disable using the system tz db. it is also why we must be careful when updating valhalla_build_timezones to the latest data

long story short, i am not opposed to updating to newer data but we need to do the work of making sure the same indices are used for the same semantic timezones. eg. maybe there are new ones, these would have to go on the end. maybe names have changed we would then have to update the aliases. i havent worked out all the details and maybe its less tedious than i am imagining but we need to proceed cautiously to avoid breaking backward/forward compatibility. in other words, when we make a chance we either have to do one of two things:

  1. make sure that both: old data paired with new code AND new data paired with old code will work after the change, OR
  2. update the major version of valhalla

I guess you probably realized we are trying very very hard not to do the latter until we have sufficiently many important breaking changes as to warrant it.

@nilsnolde
Copy link
Member

This has gotten some new traction, see #4363. That's pretty bad, apart from the changes in geometry (not too little in the US, I visually compared the two releases), DST is probably the worst affected.

I looked into it for a bit and this is my summary of the process:

  1. We get the shapefile for the geometries and build our sqlite from that, containing the IANA timezone IDs (e.g. Europe/Berlin), but also various aliases for a bit of normalization (e.g. Europe/Vaduz -> Europe/Zurich). This is ONLY the timezone ID with geometries though, no rules.
  2. The actual IANA timezone rules are in ./date_time with ASCII files and they need to match the shapefile release, e.g. here. They're compiled to simple headers so we can include the info in baldr/tz_alt.cpp
  3. tz_alt.cpp creates a date::tzdb object from 2., where internally, the timezones will be sorted by their name
  4. When we want to use any tz info while building the graph, we use indices to that sorted timezone name vector of tzdb from 3. right here and below with from_index for the service.

As @kevinkreiser said, it's not super straight forward. While I think the aliases are fairly easy to deal with or will be dealt with implicitly, coming up with (ideally):

  • a stable way in the code to refer to timezones, so it's an easy process to update IANA tzdata every year which seems really important
  • while doing that, as kevin said, making sure we stay both forward-/backward-compatible with old data/new client and new data/old client

@nilsnolde
Copy link
Member

nilsnolde commented Oct 28, 2023

Not a very thought-through idea:

  1. to preserve old data/new client:
    1. instead of sorting the tzdb object's tz names, we need a stable structure that gives us numerical IDs. The ID -> tz_name from today's data has to be taken. That should be it.
  2. to preserve new data/old client:
    1. the aliases have to be reviewed to update maybe changed names (unless the tz names we alias to have changed, i.e. the ones we end up using, see 2.iii I guess)
    2. if there's new timezone names which qualify to alias to a more common name, they need to be added to aliases
    3. if aliased to timezone names changed, we'll have to add that as alias as well and not add it to the structure of 1., otherwise a new client wouldn't work with old data
    4. if there are entirely new timezones which can't be aliased to another one (e.g. half-hour timezones are sometimes added), they need to go to the back of the ID space of 1..
    5. if timezones were completely removed, we'll have to keep them in 1. as legacy with the same ID

So 2. is a bit of work & diving into the diff of 2017 tz's & the latest release. A bit tedious, but still pretty manageable. Most of that will have to be repeated every time we update the tzdata release, it'll never be a single line change. We'll have to live with that until a major version update, so best this is neatly documented by whoever ends up doing this chore. I'll update here if that should be me:)

Please tell me if there's any misunderstandings or so..

@nilsnolde
Copy link
Member

Can someone explain what the motivation was behind the aliases? It can't be mismatch between the shapefile and IANA data I'd hope, so maybe it's just an opinionated list for deduplication? Or how did you decide which timezone to alias to others?

It seems a lot of the names which should be aliased don't even exist in the shapefile (e.g. all the UTC ones), so likely the aliases come from somewhere else?

@nilsnolde
Copy link
Member

In fact, the vast majority of the aliases doesn't exist in the shapefile. Unless I'm missing some other part where those are used (doubt it though), that could be cleaned as part of the effort as well.

@nilsnolde
Copy link
Member

I'll keep this as a log of thoughts that occur. I'll aim for a more fleshed out implementation suggestion before starting the job, so this is more a log to keep things in one place.

before even updating any data, we could try to remove our hard-coded list of aliases and instead use what date_time already provides which would:

  • make the logic quite a bit simpler
  • lessen the maintenance & manual hassle when updating tzdb data
  • one more obstacle removed to try and move to c++20 date time lib at some point

the IANA data has LINKs which are their way of aliasing and hinnant's library already deals with these links as a fallback, and it seems our own list is identical, but will need to verify that, also if it's even feasible considering compat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants