Update Julia benchmarks #51

bkamins · 2018-12-03T20:38:43Z

@jangorecki this PR updates the Julia by benchmarks to the DataFrames.jl package v0.15.0 syntax.

@nalimilan - can you please double check if I have updated everything correctly?

@jangorecki

@jangorecki this PR updates the Julia `by` benchmarks to the DataFrames.jl package v0.15.0 syntax. @nalimilan - can you please double check if I have updated everything correctly?

nalimilan

Thanks! Looking forward the new results!

Though the benchmark includes the compilation time, and especially the first by call will probably be much slower than it could. Not sure whether running them twice would be fair for other languages.

bkamins · 2018-12-04T08:31:38Z

Actually each benchmark is run twice.
@jangorecki: which timing do you take for comparisons?

jangorecki · 2018-12-04T16:07:56Z

@bkamins Thanks for updated code. On the benchmark report in top right corner of the plot you can see legend entries "first time" and "second time". We plot both those timing. Value next to the bars is max of those two, so it is not overlapping on any of bars and still is located in expected position on X axis.
Currently juliadf is not displayed on report because julia scripts were hanging on loading CSV data (when run from shell in loop). I don't have yet reproducible example to report that. I was hoping feather format will help to overcome that issue but it seems to segfaults for bigger data (didn't yet check Feather.jl but in R/python it doesn't work reliably).

bkamins · 2018-12-04T16:22:10Z

Thank you - now I see it 😄.

As for data loading I understand that only 50GB case fails (for 0.5 and 5GB I can see the results) - right?

Also can you please let us know when you plan to re-run the benchmarks so that we can see if the CSV.jl still fails?

Finally, if it still fails, could you please report an issue on https://github.com/JuliaData/CSV.jl indicating which file failed to load. I am sure that @quinnj will be willing to fix it.

nalimilan · 2018-12-04T17:36:07Z

Thanks. It's good to show both the first and second run indeed.

BTW, could you call this "DataFrames.jl" instead of "juliadf"? We don't use the latter name anywhere.

jangorecki · 2018-12-05T06:39:12Z

@bkamins it was not failing but hanging, first script was processed fine but then the next one, after closing julia process, resetting env var to point to another dataset, and starting same julia script again was hanging on loading csv. Even when size was the same. Need to investigate this. AFAIR it started to happens when I switched to categorical=true, but not sure if it is actually related as the first script was finishing successfully, all next ones were hanging. Looks more like environment issue.
@nalimilan done in 7e43d0e

nalimilan · 2018-12-05T08:54:29Z

Thanks.

The hang is a known problem with categorical=true, we resort the levels for each new level, which it terrible when there are 10M levels. I'd recommend using categorical=false for that case, categorical arrays are not designed to be more efficient with that many levels anyway. (But I'll fix that, it's easy.)

bkamins · 2018-12-05T09:01:25Z

@nalimilan: so can you add a PR to https://github.com/h2oai/db-benchmark and set categorical to true or false as appropriate or we leave it as is here and fix categorical with many levels?

jangorecki · 2018-12-05T09:51:52Z

@nalimilan please fix, because of that issue julia is at the moment disabled from runs.

Is it possible to use development versions of julia packages easily? I now run using Pkg; Pkg.update(); so it updates to stable released version. Our preferred way is to run benchmark on development versions, so users and devs can see progress when pushing some improvements. It will be even more valuable when #46 resolved. We currently do it for:

data.table - we publish devel version of pkg after every commit, so install.packages is enough
dplyr - devtools to install from github
(py)datatable - we build from source, soon we will use "Latest release links for S3"

If there is no easy way for this I will proceed with categorical=false for now.

bkamins · 2018-12-05T12:19:50Z

Is it possible to use development versions of julia packages easily?

Yes. You can make Julia track master branch of some package by writing:

using Pkg
Pkg.add(PackageSpec(name="PackageName", rev="master"))

However, I would say that this is not needed, as all the packages are under a fast release cycle as they all are in development phase still.

Also in our case e.g. categorical type is defined in the CategoricalArrays.jl package and probably the fix we discuss will be made there by @nalimilan, but normally you do not install this package directly and checking out master of CSV.jl or DataFrames.jl would not fix it anyway.

nalimilan · 2018-12-05T13:46:43Z

I've filed #53, since the fix might take a bit to prepare, merge and release. Anyway using categorical columns doesn't really change performance currently when there are only 100 rows per group. When we're able to deal better with this I'll make another PR.

…ai#51) This PR contains the following fixes. - Updates the regression test script to test all solutions together. If any two solutions have a different answer, the 'all' regression will throw an error - Bump memory sizes for solutions that set a fix memory size - Add a script for mounting nvme drives - Updates duckdb-latest query run on group by query 8 - Add a script to inspect the time.csv file to find solutions/queries that have incorrect results compared to other solutions/queries

Update Julia benchmarks

74ed406

@jangorecki this PR updates the Julia `by` benchmarks to the DataFrames.jl package v0.15.0 syntax. @nalimilan - can you please double check if I have updated everything correctly?

nalimilan approved these changes Dec 4, 2018

View reviewed changes

bkamins mentioned this pull request Dec 4, 2018

Split-apply-combine todo JuliaData/DataFrames.jl#1616

Closed

8 tasks

jangorecki merged commit eae1746 into h2oai:master Dec 4, 2018

bkamins deleted the patch-1 branch December 4, 2018 16:18

nalimilan mentioned this pull request Dec 5, 2018

julia: work around hang when loading CSV #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Julia benchmarks #51

Update Julia benchmarks #51

bkamins commented Dec 3, 2018

nalimilan left a comment

bkamins commented Dec 4, 2018

jangorecki commented Dec 4, 2018 •

edited

Loading

bkamins commented Dec 4, 2018

nalimilan commented Dec 4, 2018

jangorecki commented Dec 5, 2018 •

edited

Loading

nalimilan commented Dec 5, 2018

bkamins commented Dec 5, 2018

jangorecki commented Dec 5, 2018 •

edited

Loading

bkamins commented Dec 5, 2018

nalimilan commented Dec 5, 2018

Update Julia benchmarks #51

Update Julia benchmarks #51

Conversation

bkamins commented Dec 3, 2018

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Dec 4, 2018

jangorecki commented Dec 4, 2018 • edited Loading

bkamins commented Dec 4, 2018

nalimilan commented Dec 4, 2018

jangorecki commented Dec 5, 2018 • edited Loading

nalimilan commented Dec 5, 2018

bkamins commented Dec 5, 2018

jangorecki commented Dec 5, 2018 • edited Loading

bkamins commented Dec 5, 2018

nalimilan commented Dec 5, 2018

jangorecki commented Dec 4, 2018 •

edited

Loading

jangorecki commented Dec 5, 2018 •

edited

Loading

jangorecki commented Dec 5, 2018 •

edited

Loading