-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
describe on columns with e.g. Integers #2269
Comments
Another advantage of not special-casing |
My recommendation would be:
This will dramatically improve usability of |
Maybe relevant: HyperLogLog is implemented in OnlineStats here. |
I have an implementation of HyperLogLog with the best new corrections to the estimator that I can contribute. I should probably put it up in a new package. Here are some additional details: joshday/OnlineStats.jl#177. Probably best to integrate the improvements into OnlineStats though. |
My personal preference would be to leave that out of DataFrames.jl to keep the list of the dependencies low. |
I'm fine with not reporting the number of uniques by default. Though even if we do that it would still be nice to be able to use it 1) without it taking ages even if you have a numeric column with many unique values and 2) if you want to know the number of unique values of e.g. an integer column with few unique values. That said, I don't have a perfect solution to that. |
The solution I think is |
Stata's summarize, the inspiration for this command, doesn't include it. So I'm fine dropping it as we aren't losing feature parity. Plus i've found |
Unless there is some other comment I will open a PR implementing the recommendation:
(the issue has popped up again on Slack and newcomers can be really surprised by the slow response time for large data sets) |
See #2339 |
Right now,
describe
is documented with:The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:
The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)
Ref #1435
The text was updated successfully, but these errors were encountered: