make nunique option in describe work for integers #1435

ExpandingMan · 2018-06-20T19:29:36Z

I'm absolutely loving the new describe function, thanks to all those who worked on this!

Anyway, it was really bothering me that the nunique option was not working on integers, this makes it so that this option works on Integer types, but no other Real types.

I had considered making it work for everything except AbstractFloat, but there are also Rational and Irrational, so I thought making this work only on integers seemed sensible. Any thoughts on this? Should we extend it to Rational? (In practice even Irrational would probably be fine.)

ExpandingMan · 2018-06-20T19:58:39Z

Sorry, I didn't even realize there were tests for this. Let's see if there's a consensus on what types this should be done for and then I'll fix the tests.

nalimilan · 2018-06-21T14:19:44Z

This was made on purpose because computing the number of unique values needs a lot of memory when it's large. We've mentioned at #1409 that the best approach would probably to have a special unique function which bails out when too many values have been encountered (1000? 10000?).

CC: @pdeffebach

pdeffebach · 2018-06-21T15:00:50Z

I think it seems like a good idea. We can't expect users to always use categorical arrays even though they are supposed to.

Wrapping nunique is a good solution, though. How do we do this? Choose a big number, like 10,000, or have a time limit on the computation? I'm not sure how hard the second option is.

nalimilan · 2018-06-21T15:18:21Z

Choosing a number is probably the best approach.

ExpandingMan · 2018-06-21T15:22:16Z

Yeah, probably simplest to limit it. Ideally such a function would live somewhere else other than DataFrames. I may work on this at some point.

pdeffebach · 2018-08-18T01:42:31Z

Fun fact, length(Set(x)) is generally faster than length(Unique(x)) so it might make sense to change describe to use that instead.

julia> x = randn(10_000)

julia> @btime foo_unique($x) # length(unique(x))
  492.946 μs (36 allocations: 450.59 KiB)
10000

julia> @btime foo_set($x)
  219.970 μs (8 allocations: 144.68 KiB) # length(Set(x))
10000

ExpandingMan · 2018-08-18T01:51:08Z

I'm not too sure why the difference should be so large. My guess is that the unique(::AbstractArray) code is sub-optimal.

Regardless, I don't think I'll ever take back up this PR. What is really needed is a more flexible describe function. Whether or not it's me that does that (no immediate plans), I think I'll close this PR now.

pdeffebach · 2018-08-18T04:17:18Z

On slack people were saying is has something to do with the way Set stops after unique values. but that wouldn't explain rand(N)

Yeah for the record I think #1436 is a great idea and will hope to work on that soon.

make nunique option in describe work for integers

2e95c1a

ExpandingMan closed this Aug 18, 2018

KristofferC mentioned this pull request May 26, 2020

describe on columns with e.g. Integers #2269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make nunique option in describe work for integers #1435

make nunique option in describe work for integers #1435

ExpandingMan commented Jun 20, 2018

ExpandingMan commented Jun 20, 2018

nalimilan commented Jun 21, 2018

pdeffebach commented Jun 21, 2018

nalimilan commented Jun 21, 2018

ExpandingMan commented Jun 21, 2018

pdeffebach commented Aug 18, 2018

ExpandingMan commented Aug 18, 2018

pdeffebach commented Aug 18, 2018

make nunique option in describe work for integers #1435

make nunique option in describe work for integers #1435

Conversation

ExpandingMan commented Jun 20, 2018

ExpandingMan commented Jun 20, 2018

nalimilan commented Jun 21, 2018

pdeffebach commented Jun 21, 2018

nalimilan commented Jun 21, 2018

ExpandingMan commented Jun 21, 2018

pdeffebach commented Aug 18, 2018

ExpandingMan commented Aug 18, 2018

pdeffebach commented Aug 18, 2018