Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Julia benchmarks #51

Merged
merged 1 commit into from
Dec 4, 2018
Merged

Update Julia benchmarks #51

merged 1 commit into from
Dec 4, 2018

Conversation

bkamins
Copy link
Contributor

@bkamins bkamins commented Dec 3, 2018

@jangorecki this PR updates the Julia by benchmarks to the DataFrames.jl package v0.15.0 syntax.

@nalimilan - can you please double check if I have updated everything correctly?

@jangorecki this PR updates the Julia `by` benchmarks to the DataFrames.jl package v0.15.0 syntax.

@nalimilan - can you please double check if I have updated everything correctly?
Copy link
Contributor

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking forward the new results!

Though the benchmark includes the compilation time, and especially the first by call will probably be much slower than it could. Not sure whether running them twice would be fair for other languages.

@bkamins
Copy link
Contributor Author

bkamins commented Dec 4, 2018

Actually each benchmark is run twice.
@jangorecki: which timing do you take for comparisons?

@jangorecki
Copy link
Contributor

jangorecki commented Dec 4, 2018

@bkamins Thanks for updated code. On the benchmark report in top right corner of the plot you can see legend entries "first time" and "second time". We plot both those timing. Value next to the bars is max of those two, so it is not overlapping on any of bars and still is located in expected position on X axis.
Currently juliadf is not displayed on report because julia scripts were hanging on loading CSV data (when run from shell in loop). I don't have yet reproducible example to report that. I was hoping feather format will help to overcome that issue but it seems to segfaults for bigger data (didn't yet check Feather.jl but in R/python it doesn't work reliably).

@jangorecki jangorecki merged commit eae1746 into h2oai:master Dec 4, 2018
@bkamins bkamins deleted the patch-1 branch December 4, 2018 16:18
@bkamins
Copy link
Contributor Author

bkamins commented Dec 4, 2018

Thank you - now I see it 😄.

As for data loading I understand that only 50GB case fails (for 0.5 and 5GB I can see the results) - right?

Also can you please let us know when you plan to re-run the benchmarks so that we can see if the CSV.jl still fails?

Finally, if it still fails, could you please report an issue on https://github.com/JuliaData/CSV.jl indicating which file failed to load. I am sure that @quinnj will be willing to fix it.

@nalimilan
Copy link
Contributor

Thanks. It's good to show both the first and second run indeed.

BTW, could you call this "DataFrames.jl" instead of "juliadf"? We don't use the latter name anywhere.

@jangorecki
Copy link
Contributor

jangorecki commented Dec 5, 2018

@bkamins it was not failing but hanging, first script was processed fine but then the next one, after closing julia process, resetting env var to point to another dataset, and starting same julia script again was hanging on loading csv. Even when size was the same. Need to investigate this. AFAIR it started to happens when I switched to categorical=true, but not sure if it is actually related as the first script was finishing successfully, all next ones were hanging. Looks more like environment issue.
@nalimilan done in 7e43d0e

@nalimilan
Copy link
Contributor

Thanks.

The hang is a known problem with categorical=true, we resort the levels for each new level, which it terrible when there are 10M levels. I'd recommend using categorical=false for that case, categorical arrays are not designed to be more efficient with that many levels anyway. (But I'll fix that, it's easy.)

@bkamins
Copy link
Contributor Author

bkamins commented Dec 5, 2018

@nalimilan: so can you add a PR to https://github.com/h2oai/db-benchmark and set categorical to true or false as appropriate or we leave it as is here and fix categorical with many levels?

@jangorecki
Copy link
Contributor

jangorecki commented Dec 5, 2018

@nalimilan please fix, because of that issue julia is at the moment disabled from runs.

Is it possible to use development versions of julia packages easily? I now run using Pkg; Pkg.update(); so it updates to stable released version. Our preferred way is to run benchmark on development versions, so users and devs can see progress when pushing some improvements. It will be even more valuable when #46 resolved. We currently do it for:

  • data.table - we publish devel version of pkg after every commit, so install.packages is enough
  • dplyr - devtools to install from github
  • (py)datatable - we build from source, soon we will use "Latest release links for S3"

If there is no easy way for this I will proceed with categorical=false for now.

@bkamins
Copy link
Contributor Author

bkamins commented Dec 5, 2018

Is it possible to use development versions of julia packages easily?

Yes. You can make Julia track master branch of some package by writing:

using Pkg
Pkg.add(PackageSpec(name="PackageName", rev="master"))

However, I would say that this is not needed, as all the packages are under a fast release cycle as they all are in development phase still.

Also in our case e.g. categorical type is defined in the CategoricalArrays.jl package and probably the fix we discuss will be made there by @nalimilan, but normally you do not install this package directly and checking out master of CSV.jl or DataFrames.jl would not fix it anyway.

@nalimilan
Copy link
Contributor

I've filed #53, since the fix might take a bit to prepare, merge and release. Anyway using categorical columns doesn't really change performance currently when there are only 100 rows per group. When we're able to deal better with this I'll make another PR.

Tmonster added a commit to Tmonster/db-benchmark that referenced this pull request Oct 30, 2023
…ai#51)

This PR contains the following fixes.

- Updates the regression test script to test all solutions together. If any two solutions have a different answer, the 'all' regression will throw an error
- Bump memory sizes for solutions that set a fix memory size
- Add a script for mounting nvme drives
- Updates duckdb-latest query run on group by query 8 
- Add a script to inspect the time.csv file to find solutions/queries that have incorrect results compared to other solutions/queries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants