Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected allocations #496

Open
JoaoAparicio opened this issue Jan 22, 2024 · 2 comments
Open

Unexpected allocations #496

JoaoAparicio opened this issue Jan 22, 2024 · 2 comments

Comments

@JoaoAparicio
Copy link
Contributor

JoaoAparicio commented Jan 22, 2024

I've noticed that this allocates and I'm surprised.

]activate --temp
]add Arrow
struct IntWrapper
    data::Int64
end

const INTWRAPPER_NAME = Symbol("JuliaLang.IntWrapper")
ArrowTypes.ArrowKind(::Type{IntWrapper}) = ArrowTypes.PrimitiveKind()
ArrowTypes.ArrowType(::Type{IntWrapper}) = Int64
ArrowTypes.toarrow(x::IntWrapper) = x.data
ArrowTypes.arrowname(::Type{IntWrapper}) = INTWRAPPER_NAME
ArrowTypes.JuliaType(::Val{INTWRAPPER_NAME}, ::Type{Int64}) = IntWrapper
ArrowTypes.fromarrow(::Type{IntWrapper}, x::Int64) = reinterpret(IntWrapper, x)

x = [IntWrapper(1) for _ in 1:8_000_000];
@time Arrow.write("/tmp/temp.arrow", (x=x,))

I get (after running it once to compile):

 0.401526 seconds (8.00 M allocations: 184.254 MiB, 7.14% gc time)

Basically one allocation per element of the vector.

Compare this with the cost of saving just ints without the wrapper:

x = ones(Int,8_000_000);
@time Arrow.write("/tmp/temp.arrow", (x=x,))
0.056106 seconds (140 allocations: 11.461 KiB)

Am I doing something wrong?

I've reproduced this for:

julia 1.10.0 + Arrow 2.7.0
julia 1.9.0 + Arrow 2.6.2
julia 1.8.5 + Arrow 2.6.2
julia 1.8.5 + Arrow 2.5.0
julia 1.7.3 + Arrow 2.2.0
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
  Threads: 1 on 48 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Manifest

st -m
Status `/tmp/jl_mwcRrJ/Manifest.toml`
  [69666777] Arrow v2.7.0
  [31f734f8] ArrowTypes v2.3.0
  [c3b6d118] BitIntegers v0.3.1
  [5ba52731] CodecLz4 v0.4.1
  [6b39b394] CodecZstd v0.8.2
  [34da2185] Compat v4.12.0
  [f0e56b4a] ConcurrentUtilities v2.3.0
  [9a962f9c] DataAPI v1.15.0
  [e2d170a0] DataValueInterfaces v1.0.0
  [4e289a0a] EnumX v1.0.4
  [e2ba6199] ExprTools v0.1.10
  [842dd82b] InlineStrings v1.4.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.5.0
  [e6f89c97] LoggingExtras v1.0.3
  [78c3b35d] Mocking v0.7.7
  [bac558e1] OrderedCollections v1.6.3
  [69de0a69] Parsers v2.8.1
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.0
  [21216c6a] Preferences v1.4.1
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.1
  [dc5dba14] TZJData v1.0.0+2023c
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.11.1
  [f269a46b] TimeZones v1.13.0
  [3bb67fe8] TranscodingStreams v0.10.2
  [5ced341a] Lz4_jll v1.9.4+0
  [3161d3a3] Zstd_jll v1.5.5+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.0.5+1
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+2
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
@JoaoAparicio
Copy link
Contributor Author

JoaoAparicio commented Jan 22, 2024

One difference that I've noticed between Vector{Int64} and the Vector{IntWrapper} cases, is on entering this function

arrow-julia/src/utils.jl

Lines 34 to 57 in 3712291

function writearray(io::IO, ::Type{T}, col) where {T}
if col isa Vector{T}
n = Base.write(io, col)
elseif isbitstype(T) && (
col isa Vector{Union{T,Missing}} || col isa SentinelVector{T,T,Missing,Vector{T}}
)
# need to write the non-selector bytes of isbits Union Arrays
n = Base.unsafe_write(io, pointer(col), sizeof(T) * length(col))
elseif col isa ChainedVector
n = 0
for A in col.arrays
n += writearray(io, T, A)
end
else
n = 0
data = Vector{UInt8}(undef, sizeof(col))
buf = IOBuffer(data; write=true)
for x in col
n += Base.write(buf, coalesce(x, ArrowTypes.default(T)))
end
n = Base.write(io, take!(buf))
end
return n
end

In the first case, col is type Vector{Int64} and matches the first if case, the in the second col is type ArrowTypes.ToArrow{Int64,Vector{IntWrapper}} and matches the last. This allocates because
data = Vector{UInt8}(undef, sizeof(col))
won't know the size to be allocated at compile time.

However at this stage something already went wrong, I believe. By inserting prints I can ask for the sizeof of col which in the first case is the whole vector, but in the second case it's just 8 which I guess is the number of bytes for a single Int64.

@baumgold
Copy link
Member

baumgold commented Feb 8, 2024

It looks like @quinnj added the logic to write to a temporary vector prior to writing the vector to the IO. I don't understand why this is required. @quinnj - do you remember?

https://github.com/apache/arrow-julia/pull/57/files#diff-47c27891e951c8cd946b850dc2df31082624afdf57446c21cb6992f5f4b74aa2R47-R52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants