slow String deserialization performance in v0.5 #18633

ExpandingMan · 2016-09-22T16:32:36Z

There are a couple of threads on this, but they are so old that I am assuming that the cause of this is new.

The performance of deserialization seems to have taken a big hit in v0.5. I remember v0.4 being much faster, but unfortunately I don't still have my v0.4 setup, so I can't give comparisons, however I think these numbers (in the case of strings) are indicative of a problem regardless.

I tested the serialization and deserialization on 8*10^6 element dataframes and here are the results (I include comparisons to Python's pickle protocol 3).

Float64

Floats look ok... Serialization is faster than pickle, deserialization is slower.

INFO: Serializing...
  1.632409 seconds (8.33 M allocations: 135.588 MB, 9.12% gc time)
INFO: Pickling...
  2.461387 seconds (67.31 k allocations: 2.670 MB)
INFO: Deserializing...
  2.401419 seconds (40.17 M allocations: 1.498 GB, 16.37% gc time)
INFO: Unpickling...
  1.223878 seconds (6.12 k allocations: 281.328 KB)

Int64

Integers also look fine, again serializes faster than pickle, deserializes slower.

INFO: Serializing...
  0.055541 seconds (68 allocations: 2.391 KB)
INFO: Pickling...
  0.139713 seconds (5.11 k allocations: 232.461 KB)
INFO: Deserializing...
  0.094004 seconds (284 allocations: 62.002 MB, 30.81% gc time)
INFO: Unpickling...
  0.062518 seconds (5.13 k allocations: 233.820 KB)

!! String !!

Strings are worryingly slow. Note that in this case these are all the same (length 16) string. Performance seems to get significantly worse when the strings are all different, I guess there must be some kind of compression going on. As you can see, deserialization is roughly 20 times slower than pickle.

INFO: Serializing...
  2.596744 seconds (8.09 M allocations: 125.543 MB, 5.78% gc time)
INFO: Pickling...
  2.408540 seconds (5.11 k allocations: 232.461 KB)
INFO: Deserializing...
 23.581884 seconds (88.03 M allocations: 4.353 GB, 35.62% gc time)
INFO: Unpickling...
  1.195566 seconds (5.13 k allocations: 233.820 KB)

Symbol

Symbols are better, but in my demo they were all the same symbol, so I expected them to be lightning fast, and they weren't really (it's quite possible my perception about how that works is incorrect).

INFO: Serializing...
  1.276334 seconds (8.01 M allocations: 122.531 MB)
INFO: Pickling...
  2.561902 seconds (5.11 k allocations: 232.461 KB)
INFO: Deserializing...
  3.758706 seconds (40.01 M allocations: 1.492 GB, 42.79% gc time)
INFO: Unpickling...
  1.215305 seconds (5.13 k allocations: 233.820 KB)

In case anyone is curious, serialized Julia dataframes and pickled pandas dataframes are almost exactly the same size for numeric data, but interestingly for strings and symbols the Julia dataframes are only 0.7 as big.

All these serializations and deserializations (including the pickles) were done to and from disk (a SATA3 SSD) so obviously the exact numbers are very system dependent. The largest (string) dataframes were still < 200MB. Of course the pickle times do not include any time it takes to convert data between Python and Julia, though pickle itself is called through PyCall.

The text was updated successfully, but these errors were encountered:

ViralBShah · 2016-09-22T18:32:52Z

Can you post the code in a gist?

stevengj · 2016-09-22T18:41:39Z

I also get a slowdown in a quick benchmark. Running

io = IOBuffer(); s = [randstring(32) for i=1:10^6]; @time serialize(io, s); seek(io, 0); @time deserialize(io);

in Julia 0.4.6 gives:

  0.496930 seconds (1.21 M allocations: 25.728 MB, 3.41% gc time)
  0.988836 seconds (5.24 M allocations: 231.618 MB, 24.29% gc time)

while in Julia 0.5.0 I get:

  0.781738 seconds (1.17 M allocations: 24.074 MB, 2.42% gc time)
  3.609652 seconds (11.11 M allocations: 576.955 MB, 26.23% gc time)

ExpandingMan · 2016-09-22T18:48:34Z

Here's the code I used to test it. I have a module with a whole bunch of utility functions, I think I have included everything I used from it as functions here, but in case I did not, they are all from ExpandingMan/DatasToolbox in src/dfutils.jl.

Gist

I was not aware of the randstring function when I wrote this so I'm afraid I used all the same string which may affect things if there is any kind of compression.

By the way, someone should do rand(dtype::Type{String}, n) = randstring(n) in random.jl.

stevengj · 2016-09-22T18:58:56Z

rand(String, n) would make a length-n array of String, not a String of length n.

ExpandingMan · 2016-09-22T19:00:42Z

Ah, of course that makes sense, nevermind then. However rand(String, n) probably should be defined to do that, currently it's undefined. Anyway, getting off-topic...

vtjnash · 2016-09-22T19:15:45Z

partially fixed by #18583

stevengj · 2016-10-07T20:13:23Z

I added a performance benchmark of this to nanosoldier, so we can track this going forward.

KristofferC · 2017-05-25T22:46:50Z

Julia 0.6 is about 3x faster than 0.4 for string deserialization using Stevens benchmark. Please comment if there is still a regression here.

ViralBShah added the performance Must go faster label Sep 22, 2016

stevengj added the regression Regression in behavior compared to a previous version label Sep 22, 2016

stevengj mentioned this issue Sep 23, 2016

small perf enhancement for deserialize of Arrays #18583

Merged

stevengj mentioned this issue Oct 4, 2016

serialization benchmarks JuliaCI/BaseBenchmarks.jl#29

Merged

alyst mentioned this issue Mar 7, 2017

Serialization JuliaData/CategoricalArrays.jl#60

Open

KristofferC closed this as completed May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow String deserialization performance in v0.5 #18633

slow String deserialization performance in v0.5 #18633

ExpandingMan commented Sep 22, 2016

ViralBShah commented Sep 22, 2016

stevengj commented Sep 22, 2016

ExpandingMan commented Sep 22, 2016 •

edited

Loading

stevengj commented Sep 22, 2016

ExpandingMan commented Sep 22, 2016 •

edited

Loading

vtjnash commented Sep 22, 2016

stevengj commented Oct 7, 2016

KristofferC commented May 25, 2017

slow String deserialization performance in v0.5 #18633

slow String deserialization performance in v0.5 #18633

Comments

ExpandingMan commented Sep 22, 2016

Float64

Int64

!! String !!

Symbol

ViralBShah commented Sep 22, 2016

stevengj commented Sep 22, 2016

ExpandingMan commented Sep 22, 2016 • edited Loading

stevengj commented Sep 22, 2016

ExpandingMan commented Sep 22, 2016 • edited Loading

vtjnash commented Sep 22, 2016

stevengj commented Oct 7, 2016

KristofferC commented May 25, 2017

ExpandingMan commented Sep 22, 2016 •

edited

Loading

ExpandingMan commented Sep 22, 2016 •

edited

Loading