Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster computation of quantiles in describe #2909

Merged
merged 2 commits into from
Oct 16, 2021
Merged

Faster computation of quantiles in describe #2909

merged 2 commits into from
Oct 16, 2021

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Oct 15, 2021

Computing all quantiles when we only need the median is signficantly slower (and it's even worse than it should due to https://github.com/JuliaLang/Statistics.jl/issues/84).
Also avoid trying to compute quantiles for string columns since the failure only happens after sorting the vector, which is almost all of the work.

julia> df = DataFrame(x=string.(rand('a':'k', 10_000)));

# current main
julia> @btime describe(df);
  844.866 μs (147 allocations: 88.03 KiB)

# computing only the median
julia> @btime describe(df);
  480.801 μs (146 allocations: 87.92 KiB)

# not computing the median at all (PR)
julia> @btime describe(df);
  198.739 μs (142 allocations: 9.67 KiB)

julia> df = DataFrame(x=rand(1:10, 10_000));

# current main
julia> @btime describe(df);
  181.266 μs (108 allocations: 85.55 KiB)

# PR
julia> @btime describe(df);
  79.170 μs (100 allocations: 85.19 KiB)

Computing all quantiles when we only need the median is signficantly slower.
Also avoid trying to compute quantiles for string columns since the failure
only happens after sorting the vector, which is almost all of the work.
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We could add a correctness test.
Also note that this assumes that some custom subtype of AbstractString does not define arithmetic operations on it, but I assume it is OK to make this assumption in practice.

@bkamins bkamins added this to the patch milestone Oct 15, 2021
@nalimilan
Copy link
Member Author

Looks good. We could add a correctness test.

We already cover this AFAICT.

Also note that this assumes that some custom subtype of AbstractString does not define arithmetic operations on it, but I assume it is OK to make this assumption in practice.

Yes, in theory that's possible, though I'm not sure what sense it would make... An alternative would be to call middle on the first element when the eltype is concrete to get the error immediately. That would be more general so maybe a better solution actually?

@pdeffebach
Copy link
Contributor

Can we be even smarter about this and use sorted = true based on some flag we make as we try different quantile! stuff? It's not clear what "partially sorted" means in the documentation.

@nalimilan
Copy link
Member Author

nalimilan commented Oct 15, 2021

Can we be even smarter about this and use sorted = true based on some flag we make as we try different quantile! stuff? It's not clear what "partially sorted" means in the documentation.

"Partially sorted" refers to partialsort. Basically, the element(s) which are at the quantile's position have to be sorted, others can be anywhere. But I'm not sure how passing sorted=true would help?

@pdeffebach
Copy link
Contributor

Ah I see. Nevermind, this looks good.

@nalimilan
Copy link
Member Author

I've pushed a commit implementing the more general approach, let me know what you think.

It's funny to note that on Julia >= 1.7, thanks to https://github.com/JuliaLang/Statistics.jl/pull/72, a test on Dates is now able to compute the median, which fails with current DataFrames main because computing the third quartile fails (but the median works as it's an integer value...).

@nalimilan nalimilan merged commit 85fa306 into main Oct 16, 2021
@nalimilan nalimilan deleted the nl/describe2 branch October 16, 2021 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants