Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher performance / less allocating ZonedDateTimes & TimeZones #182

Open
kcajf opened this issue Jan 27, 2019 · 20 comments
Open

Higher performance / less allocating ZonedDateTimes & TimeZones #182

kcajf opened this issue Jan 27, 2019 · 20 comments

Comments

@kcajf
Copy link

kcajf commented Jan 27, 2019

I have a lot of code that deals with DateTimes. Occasionally I do stuff with funny timezones, but a vast majority of the time I represent things in UTC. Operations on ZoneDateTimes are far more expensive than the same operations naive DateTimes (for good reasons: timezones have to be compared + ZonedDateTimes are not bits type, so operations require some indirection & allocation). For performance reasons, I would rather store and manipulate all my UTC ZonedDateTimes as naive DateTimes with the timezone implicitly understood by the programmer, but for clarity, safety, ease of reasoning & programming, etc. I would prefer to have the timezone be explicit.

This got me thinking: it would be neat to have some sort of ZonedDateTime whose UTC-ness is expressed and enforced at the type level, and that is stored as a DateTime, such that UTC-to-UTC comparisons & arithmetic can take a fast path and just be do direct integer operations like DateTimes, with more general ZonedDateTime operations going via the slower ZonedDateTime paths.

This is clearly do-able, and I was more trying to work out how easily it would slot in to TimeZones.jl & where it could go in the type hierarchy etc. in order to avoid as much duplication of code as possible.
One possible means of achieving this (and possibly other interesting specializations) would be if a ZonedDateTime's timezone type were exposed as a type parameter. I'm writing this in a hurry so haven't given it full thought though..

@omus
Copy link
Member

omus commented Jan 28, 2019

Thanks for the feedback! I think there is definitely room for some performance improvements around allocations with ZonedDateTimes. I'll note that the latest master of TimeZones drops support for Julia 0.6 and some of the internal changes should slightly help with performance.

it would be neat to have some sort of ZonedDateTime whose UTC-ness is expressed and enforced at the type level, and that is stored as a DateTime, such that UTC-to-UTC comparisons & arithmetic can take a fast path and just be do direct integer operations like DateTimes, with more general ZonedDateTime operations going via the slower ZonedDateTime paths.

Internally ZonedDateTime uses UTC instant representations for performing comparisons. For equality checks and most arithmetic you should see performance equivalent to DateTime. Could you provide an example of the code which has performance issues for you? That would be very helpful in targeting your specific performance issues.

@kcajf
Copy link
Author

kcajf commented Jan 28, 2019

Thanks for the explanation - that makes sense, and I have probably been misattributing performance to the timezone logic when in fact is more likely due to allocations (e.g. adding Minute(1) element-wise to a vector of ZonedDateTime), having not spent much time in the TimeZones code.

Are there any obvious things that can be done to reduce allocations?

@omus
Copy link
Member

omus commented Jan 31, 2019

The benchmark you mentioned:

julia> using Dates, TimeZones, BenchmarkTools

julia> @btime (DateTime(2019):Day(1):DateTime(2020)) .+ Minute(1);
  309.351 ns (1 allocation: 3.00 KiB)

julia> @btime (ZonedDateTime(2019,tz"UTC"):Day(1):ZonedDateTime(2020,tz"UTC")) .+ Minute(1);
  2.300 μs (372 allocations: 14.59 KiB)

There are definitely some improvements to be made here. I'll mention first that if you're working primarily in UTC then the only part of that time zone which isn't a bits-type is the name which is a String. I think we can make some easy gains on this by making ZonedDateTime parametric. I'm hoping to do some additional TimeZones.jl work next week so I'll be sure to do some experimentation regarding performance.

@kcajf
Copy link
Author

kcajf commented Jan 31, 2019

Nice, thank you. I did spend a little more time reading the code and it got me wondering if the current common Timezones (Fixed & Variable) could be entirely bitstypes as well. Given that there is a predefined set of timezones, these could be encoded as primitves (Enums or Ints), and have the name (+ any other heavyweight / derivative attributes) be stored or computed externally, e.g. lookup tables or function dispatches. This would all be code-generated, ideally only once when first loading the timezone database. A ZonedDateTime could then just be a DateTime (Int64) + a Timezone encoded as an Int primitive (Int16?) and would be super lightweight for most operations other than displaying & localising I think.

If I understand correctly, this would still require ZonedDateTime to be parametric in order for the full ZonedDateTime struct to become bitstype though, so hoisting the type out is probably worth doing anyway.

@kcajf
Copy link
Author

kcajf commented Feb 1, 2019

By the way: do you know of any real scenarios where users of the library are defining their own TimeZones that aren't in the IANA database? e.g. creating a FixedTimeZone with a UTC offset and a name that isn't of the form "UTC+N", or a VariableTimeZone with some custom transitions? I can't think of any applications myself, but would be curious.

@omus
Copy link
Member

omus commented Feb 4, 2019

An example would be:

julia> FixedTimeZone("CST", -6 * 3600)  # Central Standard Time (North America)
CST (UTC-6)

I'll also note that internally VariableTimeZones use FixedTimeZone for storing the transition abbreviation (e.g. CST/CDT) along with the offset.

@kcajf
Copy link
Author

kcajf commented Jun 21, 2019

Thanks, I hadn't realised that something like CST wasn't actually in the IANA database.

It would be nice to have a majority of common timezones be entirely bitstype for performance, but if that isn't so easy, perhaps the timezone name could be optional (i.e. nothing by default). If absent, it could be constructed on the fly when needed for presentation, using the default format of UTC+/-N. Alternatively, these strings could be interned since they're very short and very repeated (using e.g. InternedStrings)?

@athre0z
Copy link

athre0z commented Aug 6, 2019

I really like the idea of having the timezone as a type parameter -- I was about to suggest the same thing before I saw this issue. I'm currently writing a database client for a time series DB where I have to deal with rather large arrays of datetime objects. While converting my Vector{UInt64} to Vector{DateTime} is practically a memcpy / reinterpret cast, converting them to an array of ZonedDateTime involves 1 allocation as well as 16 byte of additional memory consumption per item. Also, with the type parameter approach, changing the timezone of a big array of zoned date times could be easily vectorized by just calculating the delta once and adding it to all objects in the array etc..

@kcajf
Copy link
Author

kcajf commented Aug 23, 2019

Yeah, having the timezone strings duplicated everywhere is a bit of a pain. In my code, I prefer for every DateTime to be tz-aware, since it is the correct thing to do (they usually represent an unambiguous instant in time). In almost all cases the timezone attached to them is UTC (other timezones are really a "presentation layer" thing). The only place where I use non-UTC is when ingesting datasets in other timezones, which are always normalised to UTC. Being able to represent at least UTC, but ideally a wider set of common timezones, without too much (or any!) overhead would be fantastic. Currently I am considering using naive DateTimes everywhere I would have a UTC ZonedDateTime, purely for performance reasons.

Some interesting stuff is happening in the C++ world wrt datetimes / timezones. https://github.com/HowardHinnant/date/tree/master/include/date is basically the implementation that has been accepted in the c++ 20 standard. I wonder if some inspiration could be taken from there around zero / low cost abstraction handling of this stuff. The C++ community are usually pretty solid in that respect.

@athre0z
Copy link

athre0z commented Aug 23, 2019

I also went with naive DateTime objects for now. While I would strongly prefer timezoned objects, it was just too much overhead.

An example of a lib that already does timestamps as type arguments would be chrono for Rust.

@kcajf
Copy link
Author

kcajf commented Aug 23, 2019

Yes, that's a nice example - special casing UTC and FixedOffset.

The tricky part is how to make this non-(performance)-breaking. If ZonedDateTime were redefined to ZoneDateTime{TZ}, and someone had previously defined a struct like:

struct Foo
    a::ZonedDateTime
end

operations on a will now require a dynamic dispatch I think.

@omus
Copy link
Member

omus commented Aug 23, 2019

In almost all cases the timezone attached to them is UTC (other timezones are really a "presentation layer" thing)

I'll note with TimeZones.jl this isn't true as the time zone information is relevant in arithmetic with DatePeriods (a Period of a Day or larger):

julia> using TimeZones, Dates

julia> zdt = ZonedDateTime(2015,11,1,tz"America/Winnipeg")
2015-11-01T00:00:00-05:00

julia> zdt + Day(1)
2015-11-02T00:00:00-06:00

julia> zdt + Hour(24)
2015-11-01T23:00:00-06:00

julia> astimezone(astimezone(zdt, tz"UTC") + Day(1), tz"America/Winnipeg")
2015-11-01T23:00:00-06:00

@kcajf
Copy link
Author

kcajf commented Aug 23, 2019 via email

@kcajf
Copy link
Author

kcajf commented Aug 25, 2019

@omus do you have any opinions on if / how to make a large breaking change like this?

@omus
Copy link
Member

omus commented Aug 25, 2019

I would say just go for it and don't worry about how breaking the change is to start with. We need a prototype that demonstrates a performance improvement over the original code first through benchmarks (BenchmarkTools.jl). Then we discuss how breaking the change is and possible ways transition cleanly with deprecations. In the worst case scenario we can just use a new type.

Unfortunately, I probably won't have time to experiment with this myself so if you want to do the legwork I'll do my best to support.

@kcajf
Copy link
Author

kcajf commented Sep 13, 2019

Just reading more into the rust chrono implementation.

There is the main chrono linked above, which implements Utc and FixedOffset types. Utc is a singleton, and FixedOffset just contains an int32 (i.e. enough for 24 * 60 * 60 * 2 = 172800 seconds) - it doesn't support a custom name / string like your "FixedTimeZone("CST", -6 * 3600) example above. As a result, Utc chrono::DateTimes are zero-overhead, and FixedOffset chrono::DateTimes are pretty lightweight.

Then there is chrono-tz which adds the IANA database - basically doing the job of VariableTimeZone in TimeZones.jl. My rust isn't great, but I think it defines an enum of type Tz and implements the Timezone interface for this enum, doing a match (switch) on the enum value within each method. Looks a bit like:

#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub enum Tz {
    Africa__Abidjan,
    Africa__Accra,
    Africa__Addis_Ababa,
    Africa__Algiers,
    Africa__Asmara,
    Africa__Asmera,
    Africa__Bamako,
    Africa__Bangui,
    Africa__Banjul,
...


impl Tz {
    pub fn name(self: &Tz) -> &'static str {
        match *self {
            Tz::Africa__Abidjan => "Africa/Abidjan",
            Tz::Africa__Accra => "Africa/Accra",
            Tz::Africa__Addis_Ababa => "Africa/Addis_Ababa",
            Tz::Africa__Algiers => "Africa/Algiers",
            Tz::Africa__Asmara => "Africa/Asmara",
            Tz::Africa__Asmera => "Africa/Asmera",
...


impl TimeSpans for Tz {
    fn timespans(&self) -> FixedTimespanSet {
        match *self {
            Tz::Africa__Abidjan => {
                const REST: &'static [(i64, FixedTimespan)] = &[
                    (-1830383032, FixedTimespan { utc_offset: 0, dst_offset: 0, name: "GMT" }),
                ];
                FixedTimespanSet {
                    first: FixedTimespan {
                        utc_offset: -968,
                        dst_offset: 0,
                        name: "LMT",
                    },
                    rest: REST
                }
            },

            Tz::Africa__Accra => {
                const REST: &'static [(i64, FixedTimespan)] = &[
                    (-1640995148, FixedTimespan { utc_offset: 0, dst_offset: 0, name: "GMT" }),
                    (-1556841600, FixedTimespan { utc_offset: 0, dst_offset: 1200, name: "+0020" }),
                    (-1546388400, FixedTimespan { utc_offset: 0, dst_offset: 0, name: "GMT" }),
                    (-1525305600, FixedTimespan { utc_offset: 0, dst_offset: 1200, name: "+0020" }),
                    (-1514852400, FixedTimespan { utc_offset: 0, dst_offset: 0, name: "GMT" }),
                    (-1493769600, FixedTimespan { utc_offset: 0, dst_offset: 1200, name: "+0020" }),
                    (-1483316400, FixedTimespan { utc_offset: 0, dst_offset: 0, name: "GMT" }),
                    (-1462233600, FixedTimespan { utc_offset: 0, dst_offset: 1200, name: "+0020" }),
                    (-1451780400, FixedTimespan { utc_offset: 0, dst_offset: 0, name: "GMT" }),
...

(This code is all generated during build time obviously). This design is also very efficient: a chrono::DateTime with a Tz timezone type only carries an int16 since there are ~600 timezones.

I think this design would be great to mirror in Julia. I notice that Dates in the standard library exposes a UTC singleton which I think we should use (currently tz"UTC" is just constructing a FixedTimeZone with zero offset, but UTC datetimes are so common that I think they should be special-cased).

I'm not sure where stuff like your CST example above fits into all of this. Given that CST just appears to be an well-known (but 'unofficial') alias for UTC-6, perhaps it's not really in the domain of a timezone library? Neither chrono-tz nor python pytz seem to know about it, for example. (Of course, people could still subtype TimeZone if they wanted to do something funky like a FixedTimeZone with a custom name.)

@omus @athre0z do either of you have strong opinions on what I've described above? @athre0z also please correct me if I've misunderstood the Rust code - have not written rust in about 5 years...

@kcajf kcajf changed the title High performance UTC DateTimes Higher performance / less allocating ZonedDateTimes & TimeZones Sep 14, 2019
@kcajf
Copy link
Author

kcajf commented Sep 15, 2019

Instead of the enum / bytes based switching, it would also be possible do this using Julia’s method dispatch machinery - i.e. have a VariableTimeZone subtype for every IANA zone. This would be third-party extensible, and faster (zero logic) when specific timezones are known at compile time, but type based dynamic dispatch would be much more expensive than a switch statement otherwise.

The nice thing, I suppose, is that these different TimeZone subtypes can be experimented with alongside the current ones, and compared. The issue of lighter weight TimeZone subtypes is pretty orthogonal to the issue of adding a type parameter to ZonedDateTime.

@oxinabox
Copy link
Contributor

It seems bad that the ZonedDateTime has a abstractly typed TimeZone field

timezone::TimeZone

@omus
Copy link
Member

omus commented Feb 27, 2020

I've been doing some experimentation around having FixedTimeZone without the name field and the results are promising so far. The design is rather rough so far but I'll continue to work on it.

Just wanted to share a very short update so people following this issue know this is being worked on.

@kcajf
Copy link
Author

kcajf commented May 11, 2020

Thinking about this again. @omus one thing I'm confused about: does ZonedDateTime have the zone field, if the utc_datetime field is meant to be UTC? EDIT: I see, it's not the zone of the utc_datetime, just a cached lookup of the active zone at that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants