Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow describe function to take arbitrary functions #1436

Closed
ExpandingMan opened this issue Jun 20, 2018 · 7 comments · Fixed by #1664
Closed

allow describe function to take arbitrary functions #1436

ExpandingMan opened this issue Jun 20, 2018 · 7 comments · Fixed by #1664

Comments

@ExpandingMan
Copy link
Contributor

I'm absolutely loving the new describe function, but I'm wondering if it should be allowed to take arbitrary functions as arguments.

For example,

describe(df, stats=[:mean, :min, :custom1=>f])

Any thoughts on this? I can do the PR if we decide on what for this should take.

@nalimilan
Copy link
Member

Yes, but probably better just pass the function and call string on it to get its name than passing a pair.

@ExpandingMan
Copy link
Contributor Author

Well, I was wondering about that. I find this aspect of aggregate a bit annoying as often I pass anonymous functions and they wind up with ugly strings that are annoying to manipulate. Is there perhaps a solution that's both better than calling string and what I suggested here?

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 21, 2018

I've thought about this too. The issue is that the function has to be pretty detailed, in that it shouldn't throw an error for any of the types for the functions. Most datasets at the least have a non-numeric ID.

Adding this feature puts a lot of trust on the user.

@ExpandingMan
Copy link
Contributor Author

It's already catching errors and returning nothing in those cases, so we could do the same thing for custom functions. Normally I'm not a fan of catching errors, but I think this is a great use case for it because, as you said, usually there are a couple of columns on which some functions can't run.

The downside of this is that it would make it difficult for the user to debug their own code.

@pdeffebach
Copy link
Contributor

Oh, yeah I hadn't thought about that.

we would just have to separate the stats vector into symbols and tuples. Then, if its a tuple, add to the dict of stats.

And this is useful because our function can take care of the try...catch for the user. The alternative is like

desc = describe(df)
desc[:q99] = [try mean(x) end for x in columns(df)]

which is a bit ugly.

@pdeffebach
Copy link
Contributor

I'm a bit confused about type stability in this case.

If we have stats be a vector of both symbols and symbol-function pairs, is that bad for performance in some way? Is there some sort of issue with having multiple types in that vector? Does it not matter because all the actual computation is in these functions themselves?

@nalimilan
Copy link
Member

It shouldn't make a significant difference given that most of the time is indeed spent in the functions themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants