Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization #60

Open
alyst opened this issue Mar 7, 2017 · 13 comments
Open

Serialization #60

alyst opened this issue Mar 7, 2017 · 13 comments

Comments

@alyst
Copy link
Contributor

alyst commented Mar 7, 2017

Was the performance/storage efficiency of [Nullable]CategoricalArray serialization checked?
Would it make sense to override serialize()/deserialize() for CategoryPool (invindex and especially valindex fields could be reconstructed from index)?

@alyst
Copy link
Contributor Author

alyst commented Mar 7, 2017

Here's a little benchmark using Julia 0.5.0, NullableArrays 0.1.0 CategoricalArrays 0.1.2

using StatsBase, CategoricalArrays, NullableArrays

arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end

map(filesize, ["arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
# 3-element Array{Int64,1}:
# 266664515
#  38641606
#  40000939

It looks strange that categorical is that big.

@alyst
Copy link
Contributor Author

alyst commented Mar 7, 2017

Here's the deserialization results. Unfortunately, I wasn't able to install BenchmarkTools to an official Julia docker container, so the actual numbers might be different, but the trend is here:

@time open("arr.jlser", "r") do io deserialize(io) end
# 58.859399 seconds (211.15 M allocations: 8.504 GB, 21.28% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 82.845217 seconds (211.11 M allocations: 8.502 GB, 43.56% gc time)

@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 39.956336 seconds (111.14 M allocations: 4.936 GB, 55.29% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 23.724424 seconds (111.11 M allocations: 4.935 GB, 25.49% gc time)

@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.225701 seconds (106.56 k allocations: 42.491 MB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.092174 seconds (1.97 k allocations: 38.244 MB)

Array{Nullable{String}} times are bad, but not very surprising.
But poor NullableArray{String} performance is sad.

@nalimilan
Copy link
Member

Interesting. I would also compare with Array{String}, which is really the natural reference for NullableArray{String} and (Nullable)CategoricalArray{String}.

The filesize for the categorical array (40000939) is exactly what I would expect: 4 bytes per element plus some negligible overhead. It's hard to be more efficient than that. I'm not sure how NullableArray does to go below that limit: even just storing a pointer to a pooled string would take more space. Are you sure the array is correctly restored after deserialization?

As for possible optimizations, we could rebuild invindex and valindex from index and only save the latter. That would make little difference when the number of levels is small compared with the number of elements, though. Anyway we will need custom serialization rules if we store level hashes (#61), since IIRC hashes may include a random component changing across sessions in the future. If we do that, we may as well implement all possible optimizations.

@alyst
Copy link
Contributor Author

alyst commented Mar 7, 2017

Ah, right, categorical array uses UInt32 by default. Could it be worth to use more compact integer type for serializing?
I guess NullableArray occupies less because the serialization of #undef is compact.

Here's the results for Vector{String}:

arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end
map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
#  40000012
# 266663942
#  38640383
#  40000939

@time open("arr2.jlser", "r") do io deserialize(io) end
# 39.064836 seconds (110.00 M allocations: 5.290 GB, 47.89% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end
# 36.619610 seconds (110.00 M allocations: 5.290 GB, 43.79% gc time)

I guess there is a room for improvement in Base. I tried to use serialization mechanisms to load/save moderately sized data frames (~5.5GB gzipped), and loading is terribly slow.

@nalimilan
Copy link
Member

Ah, right, categorical array uses UInt32 by default. Could it be worth to use more compact integer type for serializing?

Yeah, I guess we could use the same logic as in compress to choose the smallest possible type. We would need to also save the original type so that we can restore the same array, though (else adding new levels could fail). Julia probably saves the type of the object somewhere already.

I guess NullableArray occupies less because the serialization of #undef is compact.

There should be only about 11% of nulls in the vector, so that can explain the difference compared with Array{String}, but not the fact that only 4 bytes are used per element. That could indicate that Base does something clever to save duplicated strings, but looking at the code that does not seem to be the case...

Also it's really weird that deserializing Array{String} is slower (on second run) than for NullableArray{String}...

@alyst
Copy link
Contributor Author

alyst commented Mar 7, 2017

Could it be worth to use more compact integer type for serializing?

OTOH, since serialization tries to restore the references, it's better to keep the serialization format as close to the type as possible. Gzipping should take care of it automatically:

gzopen("arr.jlser.gz", "w") do io serialize(io, arr) end
gzopen("arr2.jlser.gz", "w") do io serialize(io, arr2) end
gzopen("null_arr.jlser.gz", "w") do io serialize(io, NullableArray(arr)) end
gzopen("null_cat_arr.jlser.gz", "w") do io serialize(io, categorical(arr)) end

map(filesize, ["arr2.jlser.gz", "arr.jlser.gz", "null_arr.jlser.gz", "null_cat_arr.jlser.gz"])
#4-element Array{Int64,1}:
# 5801136
# 8686152
# 7389933
# 6254812

There should be only about 11% of nulls in the vector, so that can explain the difference compared with Array{String}, but not the fact that only 4 bytes are used per element.

The contents of null_arr.jlser look like this

E&�|B&�|B&�|E&�|A&�|G&�|A$&�|D&�|D&�|F&�|F&�|A&�|B$$&�|D&�|F&�|B&�|F&�|E&�|F&�|E$&�|F&�|B&�|D&�|B$&�|A&�|E&�|G&�|E&�|E&�|C&�|H&�|E&�|F$&�|C$&�|C&�|C$&�|A&�|H&�|A&�|H&�|E$&�|H&�|E&�|C&�|C&�|H&�|C&�|C&�|H&�|G&�|B&�|D&�|G$$&�|F&�|E&�|H&�|H&�|B&�|A&�|D&�|B&�|H&�|E&�|F&�|C&�|G&�|B&�|C&�|B&�|D&�|C$&�|B&�

In hex

45 26 15 7C 41 26 15 7C 46 26 15 7C 43 26 15 7C 48 26 15 7C 41 24 26 15 7C 41 26 15 7C 46 26 15 7C 42 26 15 7C 47 26 15 7C 42 26 15 7C 47 24 26 15 7C 48 26 15 7C 46 24 26 15 7C 46 26 15 7C 44 26 15 7C 43 26 15 7C 43 26 15 7C 41 26 15 7C 42 26 15 7C 46 26 15 7C 47 26 15 7C 45 26 15 7C 46 26 15

@nalimilan
Copy link
Member

Of course, that's because your strings are only one character, so 1 byte, plus apparently 3 bytes of overhead. With longer level names, the story would be quite different!

@alyst
Copy link
Contributor Author

alyst commented Mar 7, 2017

Relevant upstream issue JuliaLang/julia#18633

@nalimilan
Copy link
Member

The representation of String has changed in 0.6, maybe it improved (de)serialization performance?

@alyst
Copy link
Contributor Author

alyst commented Mar 7, 2017

Had not tested v0.6, but Vector{String} is improved in v0.5.1:

using StatsBase, CategoricalArrays, NullableArrays

arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end;
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end;
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end;
arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end;

map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
#  40000012
# 266670491
#  38643340
#  40000939

@time open("arr.jlser", "r") do io deserialize(io) end;
# 83.365617 seconds (127.84 M allocations: 4.879 GB, 7.27% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 94.405233 seconds (127.78 M allocations: 4.876 GB, 16.81% gc time)

@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 16.028068 seconds (57.82 M allocations: 1.758 GB, 39.62% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
# 12.801466 seconds (57.78 M allocations: 1.756 GB, 26.86% gc time)

@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.271359 seconds (106.03 k allocations: 42.476 MB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.034357 seconds (1.73 k allocations: 38.231 MB)

@time open("arr2.jlser", "r") do io deserialize(io) end;
# 11.650630 seconds (50.00 M allocations: 1.714 GB, 14.06% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end;
# 12.030219 seconds (50.00 M allocations: 1.714 GB, 14.30% gc time)

@alyst
Copy link
Contributor Author

alyst commented Mar 8, 2017

Finally, v0.6 using the official nightly. Everything is improved, although with the exception of categorical arrays, there's still considerable time/memory overhead.

arr = sample(["A", "B", "C", "D", "E", "F", "G", "H", Nullable()], 10^7);
open("arr.jlser", "w") do io serialize(io, arr) end;
open("null_arr.jlser", "w") do io serialize(io, NullableArray(arr)) end;
open("null_cat_arr.jlser", "w") do io serialize(io, categorical(arr)) end;
arr2 = sample(["A", "B", "C", "D", "E", "F", "G", "H", "G"], 10^7);
open("arr2.jlser", "w") do io serialize(io, arr2) end;

map(filesize, ["arr2.jlser", "arr.jlser", "null_arr.jlser", "null_cat_arr.jlser"])
#4-element Array{Int64,1}:
# 100000012
# 320003354
#  91977889
#  40001083

@time open("arr.jlser", "r") do io deserialize(io) end;
# 21.946035 seconds (89.01 M allocations: 3.474 GiB, 15.27% gc time)
@time open("arr.jlser", "r") do io deserialize(io) end;
# 26.444456 seconds (88.89 M allocations: 3.469 GiB, 26.98% gc time)

@time open("null_arr.jlser", "r") do io deserialize(io) end;
#  7.818007 seconds (48.89 M allocations: 1.094 GiB, 35.74% gc time)
@time open("null_arr.jlser", "r") do io deserialize(io) end;
#  5.232938 seconds (48.89 M allocations: 1.094 GiB, 13.68% gc time)

@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.148508 seconds (33.76 k allocations: 39.701 MiB)
@time open("null_cat_arr.jlser", "r") do io deserialize(io) end;
#  0.031035 seconds (2.04 k allocations: 38.254 MiB)

@time open("arr2.jlser", "r") do io deserialize(io) end;
#  5.172547 seconds (40.00 M allocations: 991.886 MiB, 15.52% gc time)
@time open("arr2.jlser", "r") do io deserialize(io) end;
#  4.753201 seconds (40.00 M allocations: 991.886 MiB, 12.20% gc time)

@nalimilan
Copy link
Member

Much better indeed. Could be useful to post the relevant timings on the Julia issue too.

I'm not sure what other software could be used to compare serialization performance. Maybe Feather and HDF5?

@alyst
Copy link
Contributor Author

alyst commented Mar 8, 2017

Also R's load/save() serialization. It's written in C, so should be quite fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants