-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow String deserialization performance in v0.5 #18633
Comments
Can you post the code in a gist? |
I also get a slowdown in a quick benchmark. Running io = IOBuffer(); s = [randstring(32) for i=1:10^6]; @time serialize(io, s); seek(io, 0); @time deserialize(io); in Julia 0.4.6 gives:
while in Julia 0.5.0 I get:
|
Here's the code I used to test it. I have a module with a whole bunch of utility functions, I think I have included everything I used from it as functions here, but in case I did not, they are all from ExpandingMan/DatasToolbox in I was not aware of the By the way, someone should do |
|
Ah, of course that makes sense, nevermind then. However |
partially fixed by #18583 |
I added a performance benchmark of this to nanosoldier, so we can track this going forward. |
Julia 0.6 is about 3x faster than 0.4 for string deserialization using Stevens benchmark. Please comment if there is still a regression here. |
There are a couple of threads on this, but they are so old that I am assuming that the cause of this is new.
The performance of deserialization seems to have taken a big hit in v0.5. I remember v0.4 being much faster, but unfortunately I don't still have my v0.4 setup, so I can't give comparisons, however I think these numbers (in the case of strings) are indicative of a problem regardless.
I tested the serialization and deserialization on 8*10^6 element dataframes and here are the results (I include comparisons to Python's pickle protocol 3).
Float64
Floats look ok... Serialization is faster than pickle, deserialization is slower.
Int64
Integers also look fine, again serializes faster than pickle, deserializes slower.
!! String !!
Strings are worryingly slow. Note that in this case these are all the same (length 16) string. Performance seems to get significantly worse when the strings are all different, I guess there must be some kind of compression going on. As you can see, deserialization is roughly 20 times slower than pickle.
Symbol
Symbols are better, but in my demo they were all the same symbol, so I expected them to be lightning fast, and they weren't really (it's quite possible my perception about how that works is incorrect).
In case anyone is curious, serialized Julia dataframes and pickled pandas dataframes are almost exactly the same size for numeric data, but interestingly for strings and symbols the Julia dataframes are only 0.7 as big.
All these serializations and deserializations (including the pickles) were done to and from disk (a SATA3 SSD) so obviously the exact numbers are very system dependent. The largest (string) dataframes were still < 200MB. Of course the pickle times do not include any time it takes to convert data between Python and Julia, though pickle itself is called through PyCall.
The text was updated successfully, but these errors were encountered: