-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Julia benchmarks #51
Conversation
@jangorecki this PR updates the Julia `by` benchmarks to the DataFrames.jl package v0.15.0 syntax. @nalimilan - can you please double check if I have updated everything correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looking forward the new results!
Though the benchmark includes the compilation time, and especially the first by
call will probably be much slower than it could. Not sure whether running them twice would be fair for other languages.
Actually each benchmark is run twice. |
@bkamins Thanks for updated code. On the benchmark report in top right corner of the plot you can see legend entries "first time" and "second time". We plot both those timing. Value next to the bars is |
Thank you - now I see it 😄. As for data loading I understand that only 50GB case fails (for 0.5 and 5GB I can see the results) - right? Also can you please let us know when you plan to re-run the benchmarks so that we can see if the CSV.jl still fails? Finally, if it still fails, could you please report an issue on https://github.com/JuliaData/CSV.jl indicating which file failed to load. I am sure that @quinnj will be willing to fix it. |
Thanks. It's good to show both the first and second run indeed. BTW, could you call this "DataFrames.jl" instead of "juliadf"? We don't use the latter name anywhere. |
@bkamins it was not failing but hanging, first script was processed fine but then the next one, after closing julia process, resetting env var to point to another dataset, and starting same julia script again was hanging on loading csv. Even when size was the same. Need to investigate this. AFAIR it started to happens when I switched to |
Thanks. The hang is a known problem with |
@nalimilan: so can you add a PR to https://github.com/h2oai/db-benchmark and set |
@nalimilan please fix, because of that issue julia is at the moment disabled from runs. Is it possible to use development versions of julia packages easily? I now run
If there is no easy way for this I will proceed with |
Yes. You can make Julia track master branch of some package by writing:
However, I would say that this is not needed, as all the packages are under a fast release cycle as they all are in development phase still. Also in our case e.g. categorical type is defined in the CategoricalArrays.jl package and probably the fix we discuss will be made there by @nalimilan, but normally you do not install this package directly and checking out master of CSV.jl or DataFrames.jl would not fix it anyway. |
I've filed #53, since the fix might take a bit to prepare, merge and release. Anyway using categorical columns doesn't really change performance currently when there are only 100 rows per group. When we're able to deal better with this I'll make another PR. |
…ai#51) This PR contains the following fixes. - Updates the regression test script to test all solutions together. If any two solutions have a different answer, the 'all' regression will throw an error - Bump memory sizes for solutions that set a fix memory size - Add a script for mounting nvme drives - Updates duckdb-latest query run on group by query 8 - Add a script to inspect the time.csv file to find solutions/queries that have incorrect results compared to other solutions/queries
@jangorecki this PR updates the Julia
by
benchmarks to the DataFrames.jl package v0.15.0 syntax.@nalimilan - can you please double check if I have updated everything correctly?