Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move timezone support to extension #482

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

palday
Copy link
Contributor

@palday palday commented Aug 14, 2023

When looking at @time_imports on a transitive reverse dependency, I saw that TimeZones adds a substantial time to a package that doesn't need that part of Arrow. So I took a stab at moving time zone support to an extension. I've done this in a way that should still be Julia pre 1.9 compatible. There is one slightly hacky thing: the extension mechanism is based primarily on the idea of adding additional methods to existing functions and not for defining additional types. As such, it's a little bit weird (setglobal! in 1.9) to get types defined in an extension available at the top-level module again.

This is almost assuredly a breaking change because deserialization behavior will change when TimeZones isn't loaded either directly or through some other package. For serialization, I feel like it's less of an issue because if you've writing ZonedDateTime to disk, then you've already got TimeZones loaded. Still, I thought this was a nice experiment to show a potential speed up. I'm opening this PR as the starting point of a discussion, not because I expect it to be merged immediately.

current main

julia> @time @time_imports using Arrow
      1.7 ms  LoggingExtras
      2.7 ms  DataAPI
      1.3 ms  DataValueInterfaces
      1.1 ms  IteratorInterfaceExtensions
      1.1 ms  TableTraits
     39.3 ms  Tables
     22.9 ms  SentinelArrays
     12.5 ms  PooledArrays
      1.0 ms  Lz4_jll
      2.5 ms  TranscodingStreams
      3.2 ms  CodecLz4
      0.7 ms  Zstd_jll
      1.8 ms  CEnum
      3.7 ms  CodecZstd
      0.6 ms  Scratch
      0.4 ms  PrecompileTools
      8.8 ms  RecipesBase
     19.8 ms  Parsers
      5.6 ms  InlineStrings
      0.7 ms  Compat
      0.5 ms  Compat  CompatLinearAlgebraExt
      0.4 ms  ExprTools
      0.8 ms  Mocking
    346.8 ms  TimeZones 83.55% compilation time (40% recompilation)
     11.2 ms  BitIntegers
      2.7 ms  ConcurrentUtilities
      0.8 ms  EnumX
      3.3 ms  ArrowTypes
     22.5 ms  Arrow
  0.606927 seconds (1.62 M allocations: 100.383 MiB, 4.84% gc time, 56.60% compilation time: 39% of which was recompilation)

as an extension

julia> @time @time_imports using Arrow
      1.5 ms  LoggingExtras
      0.8 ms  DataAPI
      0.4 ms  DataValueInterfaces
      0.4 ms  IteratorInterfaceExtensions
      0.4 ms  TableTraits
     20.7 ms  Tables
     17.5 ms  SentinelArrays
     12.2 ms  PooledArrays
      0.8 ms  Lz4_jll
      2.0 ms  TranscodingStreams
      3.3 ms  CodecLz4
      1.0 ms  Zstd_jll
      3.0 ms  CEnum
      3.1 ms  CodecZstd
     12.2 ms  BitIntegers
      2.1 ms  ConcurrentUtilities
      0.6 ms  EnumX
      2.8 ms  ArrowTypes
     20.7 ms  Arrow
  0.130832 seconds (201.42 k allocations: 13.516 MiB, 6.01% compilation time)

julia> @time @time_imports using TimeZones
      0.6 ms  Scratch
      0.4 ms  PrecompileTools
      9.2 ms  RecipesBase
     19.8 ms  Parsers
      5.2 ms  InlineStrings
      0.9 ms  Compat
      1.2 ms  Compat  CompatLinearAlgebraExt
      0.5 ms  ExprTools
      1.2 ms  Mocking
    341.1 ms  TimeZones 82.49% compilation time (44% recompilation)
    156.7 ms  Arrow  ArrowTimeZonesExt
  0.612256 seconds (1.42 M allocations: 87.153 MiB, 2.65% gc time, 54.20% compilation time: 43% of which was recompilation)

to do if this change is desired

  • update documentation to reflect the change in behavior re TimeZones

@ericphanson
Copy link
Member

This is almost assuredly a breaking change because deserialization behavior will change when TimeZones isn't loaded either directly or through some other package

Regardless of breakingness, I'm not sure this is a good idea, since the Arrow format itself has a notion of timezone, I think it makes sense that Arrow.jl itself can deserialize a table with such types (e.g. coming from another language), without needing to load another package.

IMO it would be better to try to fix invalidation / other sources of re-compilation in TimeZones itself, or otherwise improve it's load time, rather than making it optional.

@codecov-commenter
Copy link

codecov-commenter commented Aug 16, 2023

Codecov Report

Merging #482 (899bddb) into main (95efe95) will increase coverage by 0.01%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
+ Coverage   87.45%   87.46%   +0.01%     
==========================================
  Files          26       27       +1     
  Lines        3283     3286       +3     
==========================================
+ Hits         2871     2874       +3     
  Misses        412      412              
Files Changed Coverage Δ
src/Arrow.jl 97.43% <ø> (ø)
src/eltypes.jl 85.37% <ø> (-0.39%) ⬇️
ext/ArrowTimeZonesExt.jl 88.88% <88.88%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@palday
Copy link
Contributor Author

palday commented Aug 16, 2023

I also wasn't sure that this was the best path -- I really did it just to see what would happen. And I also agree that the (re)compilation overhead in TimeZones is a good target for optimization.

That said, there is a different path forward (which I'm not saying is a good idea): define a very restricted Arrow.ArrowTimeZone type and and make it easy to convert to ZonedDateTime when TimeZones is loaded. If you don't need to manipulate the columns with timezones (or have none), then you can skip a dependency, while still being able to deserialize a file.

@quinnj
Copy link
Member

quinnj commented Aug 16, 2023

Ugh, we just really need a cleaned up/nice TimeZones.jl package where people don't feel like they need to work-around/avoid it. That's possibly the problem of the Dates stdlib making it easier for TimeZones.jl to integrate cleanly.

@visr
Copy link
Contributor

visr commented Apr 26, 2024

Perhaps this is no longer needed when these TimeZones load time improvements are merged: JuliaTime/TimeZones.jl#457.

@omus
Copy link
Contributor

omus commented May 21, 2024

With the release of TimeZones.jl 1.16 I decide to re-run the timings

Julia 1.9.4 with TimeZones 1.15:

(Arrow) pkg> st
Project Arrow v2.7.2
Status `~/.julia/dev/Arrow/Project.toml`
  [31f734f8] ArrowTypes v2.3.0 `src/ArrowTypes`
  [c3b6d118] BitIntegers v0.3.1
  [5ba52731] CodecLz4 v0.4.3
  [6b39b394] CodecZstd v0.8.2
  [f0e56b4a] ConcurrentUtilities v2.4.1
  [9a962f9c] DataAPI v1.16.0
  [4e289a0a] EnumX v1.0.4
  [e6f89c97] LoggingExtras v1.0.3
  [2dfb63ee] PooledArrays v1.4.3
  [91c51154] SentinelArrays v1.4.3
  [bd369af6] Tables v1.11.1
⌃ [f269a46b] TimeZones v1.15.0
  [3bb67fe8] TranscodingStreams v0.10.8
  [ade2ca70] Dates
  [a63ad114] Mmap
  [cf7118a7] UUIDs
Info Packages marked with ⌃ have new versions available and may be upgradable.

julia> @time @time_imports using Arrow
      1.6 ms  LoggingExtras
      0.8 ms  DataAPI
      0.4 ms  DataValueInterfaces
      0.4 ms  IteratorInterfaceExtensions
      0.3 ms  TableTraits
      4.2 ms  OrderedCollections
     17.0 ms  Tables
     17.1 ms  SentinelArrays
     10.9 ms  PooledArrays
      7.7 ms  Preferences
      0.5 ms  JLLWrappers
     27.5 ms  Lz4_jll 93.79% compilation time
      1.4 ms  TranscodingStreams
      0.4 ms  TranscodingStreams  TestExt
      3.9 ms  CodecLz4
      0.8 ms  Zstd_jll
      2.6 ms  CodecZstd
      0.4 ms  Scratch
      0.3 ms  PrecompileTools
     16.6 ms  Parsers
      4.3 ms  InlineStrings
      0.4 ms  TZJData
      0.7 ms  Compat
      0.4 ms  Compat  CompatLinearAlgebraExt
      0.5 ms  ExprTools
      0.8 ms  Mocking
    253.5 ms  TimeZones 55.54% compilation time (79% recompilation)
     10.9 ms  BitIntegers
      1.8 ms  ConcurrentUtilities
      0.5 ms  EnumX
      2.8 ms  ArrowTypes
    152.1 ms  Arrow
  0.621405 seconds (1.35 M allocations: 83.477 MiB, 4.91% gc time, 31.84% compilation time: 56% of which was recompilation)

Julia 1.9.4 and TimeZones 1.16:

(Arrow) pkg> st
Project Arrow v2.7.2
Status `~/.julia/dev/Arrow/Project.toml`
  [31f734f8] ArrowTypes v2.3.0 `src/ArrowTypes`
  [c3b6d118] BitIntegers v0.3.1
  [5ba52731] CodecLz4 v0.4.3
  [6b39b394] CodecZstd v0.8.2
  [f0e56b4a] ConcurrentUtilities v2.4.1
  [9a962f9c] DataAPI v1.16.0
  [4e289a0a] EnumX v1.0.4
  [e6f89c97] LoggingExtras v1.0.3
  [2dfb63ee] PooledArrays v1.4.3
  [91c51154] SentinelArrays v1.4.3
  [bd369af6] Tables v1.11.1
  [f269a46b] TimeZones v1.16.0
  [3bb67fe8] TranscodingStreams v0.10.8
  [ade2ca70] Dates
  [a63ad114] Mmap
  [cf7118a7] UUIDs

julia> @time @time_imports using Arrow
      1.7 ms  LoggingExtras
      0.7 ms  DataAPI
      0.4 ms  DataValueInterfaces
      0.3 ms  IteratorInterfaceExtensions
      0.3 ms  TableTraits
      4.3 ms  OrderedCollections
     16.6 ms  Tables
     17.3 ms  SentinelArrays
     11.0 ms  PooledArrays
      7.4 ms  Preferences
      0.4 ms  JLLWrappers
     27.8 ms  Lz4_jll 92.37% compilation time
      1.4 ms  TranscodingStreams
      0.4 ms  TranscodingStreams  TestExt
      3.6 ms  CodecLz4
      0.8 ms  Zstd_jll
      2.7 ms  CodecZstd
      0.4 ms  Scratch
      0.3 ms  PrecompileTools
     16.2 ms  Parsers
      4.3 ms  InlineStrings
      0.5 ms  TZJData
      0.8 ms  Compat
      1.8 ms  Compat  CompatLinearAlgebraExt
      0.5 ms  ExprTools
      1.1 ms  Mocking
     28.9 ms  TimeZones
     25.9 ms  BitIntegers
      1.7 ms  ConcurrentUtilities
      0.6 ms  EnumX
      2.9 ms  ArrowTypes
    259.1 ms  Arrow
  0.523419 seconds (574.55 k allocations: 40.298 MiB, 2.85% gc time, 11.04% compilation time)

The increase in import time for Arrow is most likely to is due to having Arrow.jl code triggering the loading of time zone data.

@omus
Copy link
Contributor

omus commented May 21, 2024

It looks like the use of @tz_str is the issue at the moment. I updated to use TimeZone(...) and got this:

julia> @time @time_imports using Arrow
...
     25.9 ms  TimeZones
...
      2.8 ms  ArrowTypes
     19.3 ms  Arrow
  0.284661 seconds (574.62 k allocations: 41.319 MiB, 14.86% gc time, 20.92% compilation time)

I'll attempt to address this in TimeZones.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants