TZDB shapefile is out of date #3817

ianthetechie · 2022-11-03T07:16:05Z

Hey guys,

I noticed that the TZDB shapefile is now pretty far out of date. I thought updating this would be straightforward, but looking at the script I'm seeing what appears to be a set of manually defined aliases. This raised a few questions.

Could we document why / how these are used?
I noticed as well that there is a USE_SYSTEM_TZ_DB CMAKE option as well that's disabled by default, but wasn't sure how/if it related to this. Obviously the point of the sqlite tzdb is to determine which timezone a point is located in. I would assume system functions handle the rest, since this is basically just a shapefile and not the whole tzdb ruleset.

TL;DR any gotchas just doing an update to 2022f?

The text was updated successfully, but these errors were encountered:

kevinkreiser · 2022-11-03T14:03:52Z

@ianthetechie this is a great question!

yes we could and we should document how this works
yep so many os's bundle timezone information. our date_time library from (howard hinnant) can make use of that bundled data or bring its own. the problem with using the platform dependent (os bundled data) is that we build timezone information into the tiles. every node in the graph stores an integer which is a simple index into a list of timezones built into the library. so the library has a static list of timezones and the tiles have integer indices into that list. the timezone database needs to have entries which match those found in the date_time timezone list so that we can say oh which timezone polygon does this node fall within, now which timezone index is that for the date_time class. so that is why we disable using the system tz db. it is also why we must be careful when updating valhalla_build_timezones to the latest data

long story short, i am not opposed to updating to newer data but we need to do the work of making sure the same indices are used for the same semantic timezones. eg. maybe there are new ones, these would have to go on the end. maybe names have changed we would then have to update the aliases. i havent worked out all the details and maybe its less tedious than i am imagining but we need to proceed cautiously to avoid breaking backward/forward compatibility. in other words, when we make a chance we either have to do one of two things:

make sure that both: old data paired with new code AND new data paired with old code will work after the change, OR
update the major version of valhalla

I guess you probably realized we are trying very very hard not to do the latter until we have sufficiently many important breaking changes as to warrant it.

nilsnolde · 2023-10-28T12:33:47Z

This has gotten some new traction, see #4363. That's pretty bad, apart from the changes in geometry (not too little in the US, I visually compared the two releases), DST is probably the worst affected.

I looked into it for a bit and this is my summary of the process:

We get the shapefile for the geometries and build our sqlite from that, containing the IANA timezone IDs (e.g. Europe/Berlin), but also various aliases for a bit of normalization (e.g. Europe/Vaduz -> Europe/Zurich). This is ONLY the timezone ID with geometries though, no rules.
The actual IANA timezone rules are in ./date_time with ASCII files and they need to match the shapefile release, e.g. here. They're compiled to simple headers so we can include the info in baldr/tz_alt.cpp
tz_alt.cpp creates a date::tzdb object from 2., where internally, the timezones will be sorted by their name
When we want to use any tz info while building the graph, we use indices to that sorted timezone name vector of tzdb from 3. right here and below with from_index for the service.

As @kevinkreiser said, it's not super straight forward. While I think the aliases are fairly easy to deal with or will be dealt with implicitly, coming up with (ideally):

a stable way in the code to refer to timezones, so it's an easy process to update IANA tzdata every year which seems really important
while doing that, as kevin said, making sure we stay both forward-/backward-compatible with old data/new client and new data/old client

nilsnolde · 2023-10-28T13:24:28Z

Not a very thought-through idea:

to preserve old data/new client:
1. instead of sorting the tzdb object's tz names, we need a stable structure that gives us numerical IDs. The ID -> tz_name from today's data has to be taken. That should be it.
to preserve new data/old client:
1. the aliases have to be reviewed to update maybe changed names (unless the tz names we alias to have changed, i.e. the ones we end up using, see 2.iii I guess)
2. if there's new timezone names which qualify to alias to a more common name, they need to be added to aliases
3. if aliased to timezone names changed, we'll have to add that as alias as well and not add it to the structure of 1., otherwise a new client wouldn't work with old data
4. if there are entirely new timezones which can't be aliased to another one (e.g. half-hour timezones are sometimes added), they need to go to the back of the ID space of 1..
5. if timezones were completely removed, we'll have to keep them in 1. as legacy with the same ID

So 2. is a bit of work & diving into the diff of 2017 tz's & the latest release. A bit tedious, but still pretty manageable. Most of that will have to be repeated every time we update the tzdata release, it'll never be a single line change. We'll have to live with that until a major version update, so best this is neatly documented by whoever ends up doing this chore. I'll update here if that should be me:)

Please tell me if there's any misunderstandings or so..

nilsnolde · 2023-10-28T13:47:33Z

Can someone explain what the motivation was behind the aliases? It can't be mismatch between the shapefile and IANA data I'd hope, so maybe it's just an opinionated list for deduplication? Or how did you decide which timezone to alias to others?

It seems a lot of the names which should be aliased don't even exist in the shapefile (e.g. all the UTC ones), so likely the aliases come from somewhere else?

nilsnolde · 2023-10-28T13:57:50Z

In fact, the vast majority of the aliases doesn't exist in the shapefile. Unless I'm missing some other part where those are used (doubt it though), that could be cleaned as part of the effort as well.

nilsnolde · 2023-11-03T17:51:52Z

I'll keep this as a log of thoughts that occur. I'll aim for a more fleshed out implementation suggestion before starting the job, so this is more a log to keep things in one place.

before even updating any data, we could try to remove our hard-coded list of aliases and instead use what date_time already provides which would:

make the logic quite a bit simpler
lessen the maintenance & manual hassle when updating tzdb data
one more obstacle removed to try and move to c++20 date time lib at some point

the IANA data has LINKs which are their way of aliasing and hinnant's library already deals with these links as a fallback, and it seems our own list is identical, but will need to verify that, also if it's even feasible considering compat.

kevinkreiser added the question label Nov 3, 2022

nilsnolde added the enhancement label Oct 30, 2023

nilsnolde mentioned this issue Nov 5, 2023

remove timezone aliases from sql script #4382

Merged

nilsnolde self-assigned this Nov 26, 2023

nilsnolde mentioned this issue Dec 11, 2023

update timezone database #4446

Merged

nilsnolde closed this as completed in #4446 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TZDB shapefile is out of date #3817

TZDB shapefile is out of date #3817

ianthetechie commented Nov 3, 2022

kevinkreiser commented Nov 3, 2022 •

edited

Loading

nilsnolde commented Oct 28, 2023

nilsnolde commented Oct 28, 2023 •

edited

Loading

nilsnolde commented Oct 28, 2023

nilsnolde commented Oct 28, 2023

nilsnolde commented Nov 3, 2023

TZDB shapefile is out of date #3817

TZDB shapefile is out of date #3817

Comments

ianthetechie commented Nov 3, 2022

kevinkreiser commented Nov 3, 2022 • edited Loading

nilsnolde commented Oct 28, 2023

nilsnolde commented Oct 28, 2023 • edited Loading

nilsnolde commented Oct 28, 2023

nilsnolde commented Oct 28, 2023

nilsnolde commented Nov 3, 2023

kevinkreiser commented Nov 3, 2022 •

edited

Loading

nilsnolde commented Oct 28, 2023 •

edited

Loading