Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an equivalent to dplyrs summarise function #84

Open
davidanthoff opened this issue Feb 2, 2017 · 14 comments
Open

Provide an equivalent to dplyrs summarise function #84

davidanthoff opened this issue Feb 2, 2017 · 14 comments
Milestone

Comments

@davidanthoff
Copy link
Member

No description provided.

@floswald
Copy link
Contributor

floswald commented Jun 7, 2017

hi @davidanthoff are you looking for some help with that? the cost for you is of course that you would lose some time managing and explaining what I do. of course assuming that this does not require a fundamental change of your setup.
cheers

@davidanthoff
Copy link
Member Author

I'm mainly still struggling with the design for this... But I kind of have an idea now, would be great to hear your feedback on that. And yes, I could generally use help with the whole package/ecosystem, so that would be most welcome and I would not mind at all explaining things. Are you coming to juliacon this year? That might be an efficient way.

Ok, here is my idea. For the case where you want to summarize a grouped result, you can today use the following syntax:

@from i in df begin
    @group i by i.state into g
    @select {age=mean(map(j->j.age,g)), oldest=maximum(map(j->j.age,g))}
    @collect DataFrame
end

Some of these reduction functions in base allow you to pass a function that transforms things before the reduction happens, e.g. there is mean(f::Function, v), so one could rewrite the @select statement as

    @select {age=mean(j->j.age,g), oldest=maximum(map(j->j.age,g))}

That is a bit better, but many of the reduction functions in base don't support this, and I find it still clunky.

I think there are two ways out of this:

  1. @JeffBezanson mentioned in Enable function composition for . fusion operator JuliaLang/julia#21875 that .a is shorthand for x->x.a in some languages. That in combination with a systematic attempt to add methods that take a transformation function to all reduction functions in base (like the mean function) would generally allow us to write things like @select {age=mean(.age, g), oldest=maximum(.age, g)}. I think that would be nice. But .a doesn't parse right now, so this would require a change in base.
  2. Another idea (that I think I saw @JeffBezanson make somewhere else) would be that x..a is shorthand for map(i->i.a, x). So that would allow us to write things like @select {age=mean(g..age), oldest=maximum(g..age)}. I think I like that syntax best, actually. Benefit of this one is that a..b parses currently, so we could implement that transformation as part of the @from macro, i.e. we could do this right now. And maybe someday that syntax will make its way into base, which would be great, of course.

I thought for a while that the story for summarizing a whole query is more tricky. I could add a @summarize statement that can be used inside a query, but in general I'm not happy how I'm terminating queries these days, and this would be another statement that terminates a query, and I'm just not super happy with the whole design there. But, I just merged an initial version of a piping syntax (the goal is to add another full dplyr like user API to the package eventually). And with this piping syntax I think one could move the summarize functionality outside of the query itself, and could have something like this:

df |> @query(i, begin
        @select i
    end) |>
    @summarize(age=mean(age), oldest=maximum(age)

This whole piping syntax works already on master, the only thing missing is the @summarize macro here. The caveat would be that @summarize only works with table sources. My thinking right now is that in general I'll make the dplyr interface to only work with tables, and only the existing LINQ style interface would support all the other, non table sources and targets that it supports right now.

One general question is what @summarize returns. I was thinking right now to just return a named tuple. But that is different from dplyr, where it returns a table with one row. The equivalent in my system would be that @summarize would return an iterator with one row that returns a named tuple. My gut feeling is that returning just a named tuple is easier, but I'm not sure...

@davidanthoff
Copy link
Member Author

For the grouped summary story, see #121.

@floswald
Copy link
Contributor

hi @davidanthoff so i finally got round to look at this. on the upside: i'm able to run the tests. on the downside, I dont' even know where to start with the code. :-( It's very advanced with metaprogramming, maybe a bit too much for me - I'd like to learn but not sure it's worth your time, as I said. (not at juliaCon unfortunately)

So I find both the piping and your solution number 2 above appealing. number 2 seems the right thing for summaries within a query. So just to get the main setup right:

  • you first construct an expression with a macro. for example @from.
  • then you call translate_queryon the expression body so constructed. I suspect this is where you unpick the expression and figure out what to do?
  • so what you did in Add a..b syntax #121 is to add a..b to that translation phase. i'm sure there's a good reason for why are there 7 phases.
  • is that enough? I mean in terms of making this work, is that all that needs to be done? (amazing!)

@bramtayl
Copy link
Contributor

I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.

@bramtayl
Copy link
Contributor

I think I've got a solution here

@bramtayl
Copy link
Contributor

Figuring out some story about ungroup would be useful too.

@davidanthoff
Copy link
Member Author

I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.

Hm, that would imply yet another allocation, right? Unless it would be a named tuple of vector views...

Figuring out some story about ungroup would be useful too.

That should be easily done via a nested @from clause that flattens groups.

@bramtayl
Copy link
Contributor

I guess what you really might want is generators (row.name for row in i).

@bramtayl
Copy link
Contributor

bramtayl commented Jul 13, 2017

I tried out the generators in LazyQuery seems to be working fine on master. I was hoping you could help me out with the ungroup. Say for example I use LazyQuery to do something like this:

@chain @evaluate begin
    DataFrame(
        a = [1, 1, 2, 2],
        b = [1, 2, 3, 4],
        c = [4, 3, 2, 1]
    )
    query(it)
    @group it a
    @make_from it a d = collect(b) / sum(b) e = collect(c) / sum(c)
    collect(it, DataFrame)
end

I end up with nested vectors in d and e. How would I ungroup them? If you want you can send back query syntax and I can macroexpand my way through it.

@davidanthoff
Copy link
Member Author

Something like this:

@from i in df begin                                            
    @group i by i.a into g                                         
    @select {g.key, some_avg = mean(j->j.b, g), group = g} into i  
    @from j in i.group                                             
    @select {i.key, i.some_avg, j.b, j.c}                          
    @collect DataFrame                                             
end                                                            

Not a perfect match, but it shows the general idea.

One problematic aspect here is that this won't work if you have more than one vector in the group that you want to unroll. I.e. in my example, only group is a vector that I want to unroll (but it is a vector of named tuples). In your example you have two vectors you want to unroll (d and e), and that doesn't work with the machinery we have right now.

@bramtayl
Copy link
Contributor

Right, so then the solution would be to take non-grouping columns, zip them back up into a vector of named tuples, unnest, then unzip them out again?

@davidanthoff
Copy link
Member Author

Yeah... Not ideal...

@bramtayl
Copy link
Contributor

Ok, well, I've decided add an additional dataframes backed for lazyquery to fully support grouped operations. It seems like to me that the namedtuples row approach isn't really compatible with grouping/ungrouping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants