Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make nunique option in describe work for integers #1435

Closed
wants to merge 1 commit into from

Conversation

ExpandingMan
Copy link
Contributor

I'm absolutely loving the new describe function, thanks to all those who worked on this!

Anyway, it was really bothering me that the nunique option was not working on integers, this makes it so that this option works on Integer types, but no other Real types.

I had considered making it work for everything except AbstractFloat, but there are also Rational and Irrational, so I thought making this work only on integers seemed sensible. Any thoughts on this? Should we extend it to Rational? (In practice even Irrational would probably be fine.)

@ExpandingMan
Copy link
Contributor Author

Sorry, I didn't even realize there were tests for this. Let's see if there's a consensus on what types this should be done for and then I'll fix the tests.

@nalimilan
Copy link
Member

This was made on purpose because computing the number of unique values needs a lot of memory when it's large. We've mentioned at #1409 that the best approach would probably to have a special unique function which bails out when too many values have been encountered (1000? 10000?).

CC: @pdeffebach

@pdeffebach
Copy link
Contributor

I think it seems like a good idea. We can't expect users to always use categorical arrays even though they are supposed to.

Wrapping nunique is a good solution, though. How do we do this? Choose a big number, like 10,000, or have a time limit on the computation? I'm not sure how hard the second option is.

@nalimilan
Copy link
Member

Choosing a number is probably the best approach.

@ExpandingMan
Copy link
Contributor Author

Yeah, probably simplest to limit it. Ideally such a function would live somewhere else other than DataFrames. I may work on this at some point.

@pdeffebach
Copy link
Contributor

Fun fact, length(Set(x)) is generally faster than length(Unique(x)) so it might make sense to change describe to use that instead.

julia> x = randn(10_000)

julia> @btime foo_unique($x) # length(unique(x))
  492.946 μs (36 allocations: 450.59 KiB)
10000

julia> @btime foo_set($x)
  219.970 μs (8 allocations: 144.68 KiB) # length(Set(x))
10000

@ExpandingMan
Copy link
Contributor Author

I'm not too sure why the difference should be so large. My guess is that the unique(::AbstractArray) code is sub-optimal.

Regardless, I don't think I'll ever take back up this PR. What is really needed is a more flexible describe function. Whether or not it's me that does that (no immediate plans), I think I'll close this PR now.

@pdeffebach
Copy link
Contributor

On slack people were saying is has something to do with the way Set stops after unique values. but that wouldn't explain rand(N)

Yeah for the record I think #1436 is a great idea and will hope to work on that soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants