Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow String deserialization performance in v0.5 #18633

Closed
ExpandingMan opened this issue Sep 22, 2016 · 8 comments
Closed

slow String deserialization performance in v0.5 #18633

ExpandingMan opened this issue Sep 22, 2016 · 8 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@ExpandingMan
Copy link
Contributor

There are a couple of threads on this, but they are so old that I am assuming that the cause of this is new.

The performance of deserialization seems to have taken a big hit in v0.5. I remember v0.4 being much faster, but unfortunately I don't still have my v0.4 setup, so I can't give comparisons, however I think these numbers (in the case of strings) are indicative of a problem regardless.

I tested the serialization and deserialization on 8*10^6 element dataframes and here are the results (I include comparisons to Python's pickle protocol 3).

Float64

Floats look ok... Serialization is faster than pickle, deserialization is slower.

INFO: Serializing...
  1.632409 seconds (8.33 M allocations: 135.588 MB, 9.12% gc time)
INFO: Pickling...
  2.461387 seconds (67.31 k allocations: 2.670 MB)
INFO: Deserializing...
  2.401419 seconds (40.17 M allocations: 1.498 GB, 16.37% gc time)
INFO: Unpickling...
  1.223878 seconds (6.12 k allocations: 281.328 KB)

Int64

Integers also look fine, again serializes faster than pickle, deserializes slower.

INFO: Serializing...
  0.055541 seconds (68 allocations: 2.391 KB)
INFO: Pickling...
  0.139713 seconds (5.11 k allocations: 232.461 KB)
INFO: Deserializing...
  0.094004 seconds (284 allocations: 62.002 MB, 30.81% gc time)
INFO: Unpickling...
  0.062518 seconds (5.13 k allocations: 233.820 KB)

!! String !!

Strings are worryingly slow. Note that in this case these are all the same (length 16) string. Performance seems to get significantly worse when the strings are all different, I guess there must be some kind of compression going on. As you can see, deserialization is roughly 20 times slower than pickle.

INFO: Serializing...
  2.596744 seconds (8.09 M allocations: 125.543 MB, 5.78% gc time)
INFO: Pickling...
  2.408540 seconds (5.11 k allocations: 232.461 KB)
INFO: Deserializing...
 23.581884 seconds (88.03 M allocations: 4.353 GB, 35.62% gc time)
INFO: Unpickling...
  1.195566 seconds (5.13 k allocations: 233.820 KB)

Symbol

Symbols are better, but in my demo they were all the same symbol, so I expected them to be lightning fast, and they weren't really (it's quite possible my perception about how that works is incorrect).

INFO: Serializing...
  1.276334 seconds (8.01 M allocations: 122.531 MB)
INFO: Pickling...
  2.561902 seconds (5.11 k allocations: 232.461 KB)
INFO: Deserializing...
  3.758706 seconds (40.01 M allocations: 1.492 GB, 42.79% gc time)
INFO: Unpickling...
  1.215305 seconds (5.13 k allocations: 233.820 KB)

In case anyone is curious, serialized Julia dataframes and pickled pandas dataframes are almost exactly the same size for numeric data, but interestingly for strings and symbols the Julia dataframes are only 0.7 as big.

All these serializations and deserializations (including the pickles) were done to and from disk (a SATA3 SSD) so obviously the exact numbers are very system dependent. The largest (string) dataframes were still < 200MB. Of course the pickle times do not include any time it takes to convert data between Python and Julia, though pickle itself is called through PyCall.

@ViralBShah ViralBShah added the performance Must go faster label Sep 22, 2016
@ViralBShah
Copy link
Member

Can you post the code in a gist?

@stevengj stevengj added the regression Regression in behavior compared to a previous version label Sep 22, 2016
@stevengj
Copy link
Member

I also get a slowdown in a quick benchmark. Running

io = IOBuffer(); s = [randstring(32) for i=1:10^6]; @time serialize(io, s); seek(io, 0); @time deserialize(io);

in Julia 0.4.6 gives:

  0.496930 seconds (1.21 M allocations: 25.728 MB, 3.41% gc time)
  0.988836 seconds (5.24 M allocations: 231.618 MB, 24.29% gc time)

while in Julia 0.5.0 I get:

  0.781738 seconds (1.17 M allocations: 24.074 MB, 2.42% gc time)
  3.609652 seconds (11.11 M allocations: 576.955 MB, 26.23% gc time)

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Sep 22, 2016

Here's the code I used to test it. I have a module with a whole bunch of utility functions, I think I have included everything I used from it as functions here, but in case I did not, they are all from ExpandingMan/DatasToolbox in src/dfutils.jl.

Gist

I was not aware of the randstring function when I wrote this so I'm afraid I used all the same string which may affect things if there is any kind of compression.

By the way, someone should do rand(dtype::Type{String}, n) = randstring(n) in random.jl.

@stevengj
Copy link
Member

rand(String, n) would make a length-n array of String, not a String of length n.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Sep 22, 2016

Ah, of course that makes sense, nevermind then. However rand(String, n) probably should be defined to do that, currently it's undefined. Anyway, getting off-topic...

@vtjnash
Copy link
Member

vtjnash commented Sep 22, 2016

partially fixed by #18583

@stevengj
Copy link
Member

stevengj commented Oct 7, 2016

I added a performance benchmark of this to nanosoldier, so we can track this going forward.

@KristofferC
Copy link
Member

Julia 0.6 is about 3x faster than 0.4 for string deserialization using Stevens benchmark. Please comment if there is still a regression here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

5 participants