Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe on columns with e.g. Integers #2269

Closed
KristofferC opened this issue May 26, 2020 · 10 comments · Fixed by #2339
Closed

describe on columns with e.g. Integers #2269

KristofferC opened this issue May 26, 2020 · 10 comments · Fixed by #2339
Labels
breaking The proposed change is breaking. decision
Milestone

Comments

@KristofferC
Copy link
Contributor

KristofferC commented May 26, 2020

Right now, describe is documented with:

If a column's base type derives from Real, :nunique will return nothings.

The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:

  • Compute the number of unique values below some threshold
  • Above some threshold, use a technique like HyperLogLog to get an approximate number.

The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)

Ref #1435

@nalimilan
Copy link
Member

Another advantage of not special-casing Real is that the high cardinality problem can also happen with other types (e.g. dates and times) so it would be more robust. I think we could start with the simple solution of stopping after some threshold, and later we could implement HyperLogLog (see the old HyperLogLog.jl).

@bkamins
Copy link
Member

bkamins commented May 26, 2020

My recommendation would be:

  • handle everything in the same way when computing :nunique; I do not think presenting an approximate count is a good idea - we should do an exact calculation;
  • disable computing of :median and :nunique by default (leave them as opt-in)

This will dramatically improve usability of describe. Currently for non-toy data sets describe is in practice very slow because of computing :median and :nunique. If someone really wants to see these statistics they can be opt-in.

@bkamins bkamins added breaking The proposed change is breaking. decision labels May 26, 2020
@bkamins bkamins added this to the 1.0 milestone May 26, 2020
@piever
Copy link

piever commented May 26, 2020

Maybe relevant: HyperLogLog is implemented in OnlineStats here.
On the same vein, OnlineStats also has the P square algorithm for approximated quantiles.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 26, 2020

I have an implementation of HyperLogLog with the best new corrections to the estimator that I can contribute. I should probably put it up in a new package. Here are some additional details: joshday/OnlineStats.jl#177. Probably best to integrate the improvements into OnlineStats though.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

My personal preference would be to leave that out of DataFrames.jl to keep the list of the dependencies low. describe allows to pass any function for aggregation, so in a sense "this is already available". The question is in my opinion only about what we will show by default (and as I have commented actually I would prefer to limit this to most basic statistics to make describe respond fast).

@nalimilan
Copy link
Member

I'm fine with not reporting the number of uniques by default.

Though even if we do that it would still be nice to be able to use it 1) without it taking ages even if you have a numeric column with many unique values and 2) if you want to know the number of unique values of e.g. an integer column with few unique values. That said, I don't have a perfect solution to that.

@bkamins
Copy link
Member

bkamins commented May 26, 2020

The solution I think is describe(df, :nunique => fun, other statistics ...) where fun is a function of your choice (exact count or approximate, from whatever package you like).

@pdeffebach
Copy link
Contributor

Stata's summarize, the inspiration for this command, doesn't include it. So I'm fine dropping it as we aren't losing feature parity. Plus i've found describe a bit too wide sometimes. This will help with that.

@bkamins
Copy link
Member

bkamins commented Jun 29, 2020

Unless there is some other comment I will open a PR implementing the recommendation:

disable computing of :median and :nunique by default (leave them as opt-in)

(the issue has popped up again on Slack and newcomers can be really surprised by the slow response time for large data sets)

@bkamins
Copy link
Member

bkamins commented Jul 31, 2020

See #2339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants