Provide an equivalent to dplyrs summarise function #84

davidanthoff · 2017-02-02T17:34:30Z

No description provided.

floswald · 2017-06-07T15:25:26Z

hi @davidanthoff are you looking for some help with that? the cost for you is of course that you would lose some time managing and explaining what I do. of course assuming that this does not require a fundamental change of your setup.
cheers

davidanthoff · 2017-06-07T16:40:58Z

I'm mainly still struggling with the design for this... But I kind of have an idea now, would be great to hear your feedback on that. And yes, I could generally use help with the whole package/ecosystem, so that would be most welcome and I would not mind at all explaining things. Are you coming to juliacon this year? That might be an efficient way.

Ok, here is my idea. For the case where you want to summarize a grouped result, you can today use the following syntax:

@from i in df begin
    @group i by i.state into g
    @select {age=mean(map(j->j.age,g)), oldest=maximum(map(j->j.age,g))}
    @collect DataFrame
end

Some of these reduction functions in base allow you to pass a function that transforms things before the reduction happens, e.g. there is mean(f::Function, v), so one could rewrite the @select statement as

    @select {age=mean(j->j.age,g), oldest=maximum(map(j->j.age,g))}

That is a bit better, but many of the reduction functions in base don't support this, and I find it still clunky.

I think there are two ways out of this:

@JeffBezanson mentioned in Enable function composition for . fusion operator JuliaLang/julia#21875 that .a is shorthand for x->x.a in some languages. That in combination with a systematic attempt to add methods that take a transformation function to all reduction functions in base (like the mean function) would generally allow us to write things like @select {age=mean(.age, g), oldest=maximum(.age, g)}. I think that would be nice. But .a doesn't parse right now, so this would require a change in base.
Another idea (that I think I saw @JeffBezanson make somewhere else) would be that x..a is shorthand for map(i->i.a, x). So that would allow us to write things like @select {age=mean(g..age), oldest=maximum(g..age)}. I think I like that syntax best, actually. Benefit of this one is that a..b parses currently, so we could implement that transformation as part of the @from macro, i.e. we could do this right now. And maybe someday that syntax will make its way into base, which would be great, of course.

I thought for a while that the story for summarizing a whole query is more tricky. I could add a @summarize statement that can be used inside a query, but in general I'm not happy how I'm terminating queries these days, and this would be another statement that terminates a query, and I'm just not super happy with the whole design there. But, I just merged an initial version of a piping syntax (the goal is to add another full dplyr like user API to the package eventually). And with this piping syntax I think one could move the summarize functionality outside of the query itself, and could have something like this:

df |> @query(i, begin
        @select i
    end) |>
    @summarize(age=mean(age), oldest=maximum(age)

This whole piping syntax works already on master, the only thing missing is the @summarize macro here. The caveat would be that @summarize only works with table sources. My thinking right now is that in general I'll make the dplyr interface to only work with tables, and only the existing LINQ style interface would support all the other, non table sources and targets that it supports right now.

One general question is what @summarize returns. I was thinking right now to just return a named tuple. But that is different from dplyr, where it returns a table with one row. The equivalent in my system would be that @summarize would return an iterator with one row that returns a named tuple. My gut feeling is that returning just a named tuple is easier, but I'm not sure...

davidanthoff · 2017-06-07T17:41:22Z

For the grouped summary story, see #121.

floswald · 2017-06-14T09:26:05Z

hi @davidanthoff so i finally got round to look at this. on the upside: i'm able to run the tests. on the downside, I dont' even know where to start with the code. :-( It's very advanced with metaprogramming, maybe a bit too much for me - I'd like to learn but not sure it's worth your time, as I said. (not at juliaCon unfortunately)

So I find both the piping and your solution number 2 above appealing. number 2 seems the right thing for summaries within a query. So just to get the main setup right:

you first construct an expression with a macro. for example @from.
then you call translate_queryon the expression body so constructed. I suspect this is where you unpick the expression and figure out what to do?
so what you did in Add a..b syntax #121 is to add a..b to that translation phase. i'm sure there's a good reason for why are there 7 phases.
is that enough? I mean in terms of making this work, is that all that needs to be done? (amazing!)

bramtayl · 2017-07-11T00:40:26Z

I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.

bramtayl · 2017-07-13T14:13:49Z

I think I've got a solution here

bramtayl · 2017-07-13T14:14:28Z

Figuring out some story about ungroup would be useful too.

davidanthoff · 2017-07-13T14:22:08Z

I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.

Hm, that would imply yet another allocation, right? Unless it would be a named tuple of vector views...

Figuring out some story about ungroup would be useful too.

That should be easily done via a nested @from clause that flattens groups.

bramtayl · 2017-07-13T14:28:44Z

I guess what you really might want is generators (row.name for row in i).

bramtayl · 2017-07-13T15:04:40Z

I tried out the generators in LazyQuery seems to be working fine on master. I was hoping you could help me out with the ungroup. Say for example I use LazyQuery to do something like this:

@chain @evaluate begin
    DataFrame(
        a = [1, 1, 2, 2],
        b = [1, 2, 3, 4],
        c = [4, 3, 2, 1]
    )
    query(it)
    @group it a
    @make_from it a d = collect(b) / sum(b) e = collect(c) / sum(c)
    collect(it, DataFrame)
end

I end up with nested vectors in d and e. How would I ungroup them? If you want you can send back query syntax and I can macroexpand my way through it.

davidanthoff · 2017-07-13T15:28:43Z

Something like this:

@from i in df begin                                            
    @group i by i.a into g                                         
    @select {g.key, some_avg = mean(j->j.b, g), group = g} into i  
    @from j in i.group                                             
    @select {i.key, i.some_avg, j.b, j.c}                          
    @collect DataFrame                                             
end

Not a perfect match, but it shows the general idea.

One problematic aspect here is that this won't work if you have more than one vector in the group that you want to unroll. I.e. in my example, only group is a vector that I want to unroll (but it is a vector of named tuples). In your example you have two vectors you want to unroll (d and e), and that doesn't work with the machinery we have right now.

bramtayl · 2017-07-13T15:48:55Z

Right, so then the solution would be to take non-grouping columns, zip them back up into a vector of named tuples, unnest, then unzip them out again?

davidanthoff · 2017-07-14T08:22:50Z

Yeah... Not ideal...

bramtayl · 2017-07-15T16:14:48Z

Ok, well, I've decided add an additional dataframes backed for lazyquery to fully support grouped operations. It seems like to me that the namedtuples row approach isn't really compatible with grouping/ungrouping.

davidanthoff added the enhancement label Feb 2, 2017

davidanthoff added this to the Backlog milestone Feb 2, 2017

davidanthoff mentioned this issue Feb 2, 2017

can i summarize a datasource? #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an equivalent to dplyrs summarise function #84

Provide an equivalent to dplyrs summarise function #84

davidanthoff commented Feb 2, 2017

floswald commented Jun 7, 2017 •

edited

Loading

davidanthoff commented Jun 7, 2017

davidanthoff commented Jun 7, 2017

floswald commented Jun 14, 2017

bramtayl commented Jul 11, 2017

bramtayl commented Jul 13, 2017

bramtayl commented Jul 13, 2017

davidanthoff commented Jul 13, 2017

bramtayl commented Jul 13, 2017

bramtayl commented Jul 13, 2017 •

edited

Loading

davidanthoff commented Jul 13, 2017

bramtayl commented Jul 13, 2017

davidanthoff commented Jul 14, 2017

bramtayl commented Jul 15, 2017

Provide an equivalent to dplyrs summarise function #84

Provide an equivalent to dplyrs summarise function #84

Comments

davidanthoff commented Feb 2, 2017

floswald commented Jun 7, 2017 • edited Loading

davidanthoff commented Jun 7, 2017

davidanthoff commented Jun 7, 2017

floswald commented Jun 14, 2017

bramtayl commented Jul 11, 2017

bramtayl commented Jul 13, 2017

bramtayl commented Jul 13, 2017

davidanthoff commented Jul 13, 2017

bramtayl commented Jul 13, 2017

bramtayl commented Jul 13, 2017 • edited Loading

davidanthoff commented Jul 13, 2017

bramtayl commented Jul 13, 2017

davidanthoff commented Jul 14, 2017

bramtayl commented Jul 15, 2017

floswald commented Jun 7, 2017 •

edited

Loading

bramtayl commented Jul 13, 2017 •

edited

Loading