Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] remove median and nunique from describe by default #2339

Merged
merged 4 commits into from
Aug 2, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jul 31, 2020

Fixes #2269

As decided there we just drop computing :median and :nunique by default.
This is simplest to change, and if someone wants them it is easy to opt-in.

Just to get a relative impact on performance of dropping this consider:

julia> x = rand(10^8);

julia> @time mean(x)
  0.055976 seconds (1 allocation: 16 bytes)
0.5000065333980973

julia> @time median(x)
  2.046998 seconds (3 allocations: 762.940 MiB, 8.01% gc time)
0.5000076592923275

julia> @time length(unique(x))
 24.142629 seconds (112 allocations: 4.903 GiB, 0.49% gc time)
100000000

which when you have even several dozens of variables only starts to be prohibitive.

@bkamins bkamins added this to the 1.0 milestone Jul 31, 2020
@bkamins bkamins added breaking The proposed change is breaking. performance labels Jul 31, 2020
@bkamins
Copy link
Member Author

bkamins commented Jul 31, 2020

(in particular note that even if we start adding support to threading in DataFrames.jl - which is a plan according to the responses in https://discourse.julialang.org/t/dataframes-jl-development-survey/44022) still :median and :nunique will be problematic because of their memory usage (but this is another issue).

@bkamins
Copy link
Member Author

bkamins commented Jul 31, 2020

I have also switched unique to Set, which is faster, but still prohibitive:

julia> @time length(Set(x));
 13.259684 seconds (14 allocations: 3.375 GiB, 1.79% gc time)

@nalimilan
Copy link
Member

Dropping the number of unique values sounds fine, but I'm a bit more reluctant to drop the median. It's a more robust indicator than the mean and it would be too bad not to report it by default just because in some cases it will be slow. I assume in most cases it should be fast enough.

FWIW, the carefully designed skimr R package has an interesting approach:

n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 

Minimum and maximum are reported as being the 0% and 100% percentiles, so the median is between these two. Additionally, the 25% and 75% quantiles are reported. (The rate of complete observations is somewhat redundant IMO.) What do you think?

@bkamins
Copy link
Member Author

bkamins commented Jul 31, 2020

The problem is (for the same size of data):

julia> @time quantile(x, 0:0.25:1);
 20.659685 seconds (4 allocations: 762.940 MiB, 0.42% gc time)

so I would not include it for performance reasons.

I will revert the median as you propose, however it will mean that for 10^8 rows and 100 columns (probably the maximum size normally we should consider) it adds 5 minutes to computing time of the results.

@nalimilan
Copy link
Member

0% and 100% quantiles can be computed using minimum and maximum as currently, so I wouldn't include them in the timing. Also, it turns out that quantile(x, [0.25, 0.5, 0.75]) is twice slower than quantile(x, 0.25) + quantile(x, 0.5) + quantile(x, 0.75), which is completely absurd and needs to be investigated. But yes, computing two more quantiles would be somewhat slower.

@bkamins
Copy link
Member Author

bkamins commented Jul 31, 2020

Actually I thought that computing quantile(x, 0:0.25:1) would be fastest and then you would not compute min and max as you would get it for free. It definitely should be investigated why this slows down.

Anyway in this PR I would focus on dropping things (and I understand we agree to drop nunique and keep median by default). Let us come back to adding things when we understand better the issues with quantile in Statistics. OK?

@bkamins
Copy link
Member Author

bkamins commented Jul 31, 2020

re-introduced :median

@bkamins bkamins merged commit 20f1125 into JuliaData:master Aug 2, 2020
@bkamins bkamins deleted the faster_describe branch August 2, 2020 07:24
@bkamins
Copy link
Member Author

bkamins commented Aug 2, 2020

Thank you!

@bkamins bkamins changed the title remove median and nunique from describe by default [BREAKING] remove median and nunique from describe by default Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

describe on columns with e.g. Integers
2 participants