allow `describe` function to take arbitrary functions #1436

ExpandingMan · 2018-06-20T19:33:33Z

I'm absolutely loving the new describe function, but I'm wondering if it should be allowed to take arbitrary functions as arguments.

For example,

describe(df, stats=[:mean, :min, :custom1=>f])

Any thoughts on this? I can do the PR if we decide on what for this should take.

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-06-21T14:16:53Z

Yes, but probably better just pass the function and call string on it to get its name than passing a pair.

ExpandingMan · 2018-06-21T14:23:28Z

Well, I was wondering about that. I find this aspect of aggregate a bit annoying as often I pass anonymous functions and they wind up with ugly strings that are annoying to manipulate. Is there perhaps a solution that's both better than calling string and what I suggested here?

pdeffebach · 2018-06-21T15:50:14Z

I've thought about this too. The issue is that the function has to be pretty detailed, in that it shouldn't throw an error for any of the types for the functions. Most datasets at the least have a non-numeric ID.

Adding this feature puts a lot of trust on the user.

ExpandingMan · 2018-06-21T16:29:59Z

It's already catching errors and returning nothing in those cases, so we could do the same thing for custom functions. Normally I'm not a fan of catching errors, but I think this is a great use case for it because, as you said, usually there are a couple of columns on which some functions can't run.

The downside of this is that it would make it difficult for the user to debug their own code.

pdeffebach · 2018-06-21T20:12:39Z

Oh, yeah I hadn't thought about that.

we would just have to separate the stats vector into symbols and tuples. Then, if its a tuple, add to the dict of stats.

And this is useful because our function can take care of the try...catch for the user. The alternative is like

desc = describe(df)
desc[:q99] = [try mean(x) end for x in columns(df)]

which is a bit ugly.

pdeffebach · 2018-06-24T12:45:29Z

I'm a bit confused about type stability in this case.

If we have stats be a vector of both symbols and symbol-function pairs, is that bad for performance in some way? Is there some sort of issue with having multiple types in that vector? Does it not matter because all the actual computation is in these functions themselves?

nalimilan · 2018-06-24T21:03:22Z

It shouldn't make a significant difference given that most of the time is indeed spent in the functions themselves.

pdeffebach mentioned this issue Aug 18, 2018

make nunique option in describe work for integers #1435

Closed

pdeffebach mentioned this issue Jan 3, 2019

Add custom functions to describe #1664

Merged

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

nalimilan closed this as completed in #1664 Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow `describe` function to take arbitrary functions #1436

allow `describe` function to take arbitrary functions #1436

ExpandingMan commented Jun 20, 2018

nalimilan commented Jun 21, 2018

ExpandingMan commented Jun 21, 2018

pdeffebach commented Jun 21, 2018 •

edited

Loading

ExpandingMan commented Jun 21, 2018

pdeffebach commented Jun 21, 2018

pdeffebach commented Jun 24, 2018

nalimilan commented Jun 24, 2018

allow describe function to take arbitrary functions #1436

allow describe function to take arbitrary functions #1436

Comments

ExpandingMan commented Jun 20, 2018

nalimilan commented Jun 21, 2018

ExpandingMan commented Jun 21, 2018

pdeffebach commented Jun 21, 2018 • edited Loading

ExpandingMan commented Jun 21, 2018

pdeffebach commented Jun 21, 2018

pdeffebach commented Jun 24, 2018

nalimilan commented Jun 24, 2018

allow `describe` function to take arbitrary functions #1436

allow `describe` function to take arbitrary functions #1436

pdeffebach commented Jun 21, 2018 •

edited

Loading