describe on columns with e.g. Integers #2269

KristofferC · 2020-05-26T07:49:19Z

Right now, describe is documented with:

If a column's base type derives from Real, :nunique will return nothings.

The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:

Compute the number of unique values below some threshold
Above some threshold, use a technique like HyperLogLog to get an approximate number.

The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)

Ref #1435

The text was updated successfully, but these errors were encountered:

nalimilan · 2020-05-26T07:54:07Z

Another advantage of not special-casing Real is that the high cardinality problem can also happen with other types (e.g. dates and times) so it would be more robust. I think we could start with the simple solution of stopping after some threshold, and later we could implement HyperLogLog (see the old HyperLogLog.jl).

bkamins · 2020-05-26T08:15:40Z

My recommendation would be:

handle everything in the same way when computing :nunique; I do not think presenting an approximate count is a good idea - we should do an exact calculation;
disable computing of :median and :nunique by default (leave them as opt-in)

This will dramatically improve usability of describe. Currently for non-toy data sets describe is in practice very slow because of computing :median and :nunique. If someone really wants to see these statistics they can be opt-in.

piever · 2020-05-26T08:52:42Z

Maybe relevant: HyperLogLog is implemented in OnlineStats here.
On the same vein, OnlineStats also has the P square algorithm for approximated quantiles.

StefanKarpinski · 2020-05-26T12:47:44Z

I have an implementation of HyperLogLog with the best new corrections to the estimator that I can contribute. I should probably put it up in a new package. Here are some additional details: joshday/OnlineStats.jl#177. Probably best to integrate the improvements into OnlineStats though.

bkamins · 2020-05-26T13:38:58Z

My personal preference would be to leave that out of DataFrames.jl to keep the list of the dependencies low. describe allows to pass any function for aggregation, so in a sense "this is already available". The question is in my opinion only about what we will show by default (and as I have commented actually I would prefer to limit this to most basic statistics to make describe respond fast).

nalimilan · 2020-05-26T15:16:48Z

I'm fine with not reporting the number of uniques by default.

Though even if we do that it would still be nice to be able to use it 1) without it taking ages even if you have a numeric column with many unique values and 2) if you want to know the number of unique values of e.g. an integer column with few unique values. That said, I don't have a perfect solution to that.

bkamins · 2020-05-26T15:29:17Z

The solution I think is describe(df, :nunique => fun, other statistics ...) where fun is a function of your choice (exact count or approximate, from whatever package you like).

pdeffebach · 2020-05-26T18:42:00Z

Stata's summarize, the inspiration for this command, doesn't include it. So I'm fine dropping it as we aren't losing feature parity. Plus i've found describe a bit too wide sometimes. This will help with that.

bkamins · 2020-06-29T18:12:12Z

Unless there is some other comment I will open a PR implementing the recommendation:

disable computing of :median and :nunique by default (leave them as opt-in)

(the issue has popped up again on Slack and newcomers can be really surprised by the slow response time for large data sets)

bkamins · 2020-07-31T11:29:04Z

See #2339

bkamins added breaking The proposed change is breaking. decision labels May 26, 2020

bkamins added this to the 1.0 milestone May 26, 2020

bkamins mentioned this issue Jul 31, 2020

[BREAKING] remove median and nunique from describe by default #2339

Merged

bkamins closed this as completed in #2339 Aug 2, 2020

nalimilan mentioned this issue Aug 26, 2020

Provide nunique for <:Real columns in describe #2384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

describe on columns with e.g. Integers #2269

describe on columns with e.g. Integers #2269

KristofferC commented May 26, 2020 •

edited

Loading

nalimilan commented May 26, 2020

bkamins commented May 26, 2020

piever commented May 26, 2020

StefanKarpinski commented May 26, 2020 •

edited

Loading

bkamins commented May 26, 2020

nalimilan commented May 26, 2020

bkamins commented May 26, 2020

pdeffebach commented May 26, 2020

bkamins commented Jun 29, 2020

bkamins commented Jul 31, 2020

describe on columns with e.g. Integers #2269

describe on columns with e.g. Integers #2269

Comments

KristofferC commented May 26, 2020 • edited Loading

nalimilan commented May 26, 2020

bkamins commented May 26, 2020

piever commented May 26, 2020

StefanKarpinski commented May 26, 2020 • edited Loading

bkamins commented May 26, 2020

nalimilan commented May 26, 2020

bkamins commented May 26, 2020

pdeffebach commented May 26, 2020

bkamins commented Jun 29, 2020

bkamins commented Jul 31, 2020

KristofferC commented May 26, 2020 •

edited

Loading

StefanKarpinski commented May 26, 2020 •

edited

Loading