Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Add @byrow #239

Closed
wants to merge 4 commits into from
Closed

WIP Add @byrow #239

wants to merge 4 commits into from

Conversation

pdeffebach
Copy link
Collaborator

@pdeffebach pdeffebach commented Apr 18, 2021

I thought this would be a relatively trivial PR, but it looks like things are a bit complicated.

What I want:

julia> df = DataFrame(a = [1, 2], b = [3, 4]);

julia> @orderby(df, @byrow :a, @byrow :b)
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   11      3
   22      4

But that doesn't work. it looks like we can't use macro syntax as flags.

julia> :(@byrow :a, @byrow :b) |> MacroTools.prettify
:(@byrow (:a, @byrow(:b)))

That is, the first @byrow eats up all the later expressions in the block. Of course, we could require

@orderby(df, (@byrow :a), (@byrow :b))

but that's a lot of parentheses and I already feel bad enough having people type @byrow in the first place.

I'm not sure what other packages have done with regards to these "flags". I'll take a look at the ecosystem and see if there are any more elegant (non-lispy) solutions.

@pdeffebach
Copy link
Collaborator Author

cc @jkrumbiegel if you have any ideas for how to make this work.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Apr 18, 2021

Here is a possible fix. It basically unnests the @byrow calls.

julia> t = :(@byrow :a, :b)
:(#= REPL[127]:1 =# @byrow (:a, :b))

julia> fix_byrows(t)
2-element Vector{Any}:
 :(#= REPL[118]:3 =# @byrow :a)
 :(:b)
function fix_byrows(ex, v = Any[])
    if ex isa Expr && ex.head == :macrocall && ex.args[1] == Symbol("@byrow")
        push!(v, :(@byrow $(ex.args[3].args[1])))
        fix_byrows(ex.args[3].args[2], v)
    else
        push!(v, ex)
    end
    return v
end

function orderby_helper(x, args...)
    args = mapreduce(fix_byrows, vcat, args)
    t = (fun_to_vec(arg; nolhs = true, gensym_names = true) for arg in args)
    quote
        $DataFramesMeta.orderby($x, $(t...))
    end
end

This might be a step in the right direction.

EDIT: But this won't work with y = @byrow f(:x) style calls.

@pdeffebach pdeffebach changed the title Add @byrow WIP Add @byrow Apr 18, 2021
@bkamins
Copy link
Member

bkamins commented Apr 18, 2021

Does @orderby(df, @byrow(:a), @byrow(:b)) work? I would think it should. I know that still there is a lot of (, but at least in natural places.

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Apr 18, 2021

If I have to do x = @byrow(:a + :b) that might as well be x = ByRow(:a + :b), right? Then we would stay closer to the original syntax.

I think the contortions to make @byrow :a + :b work are not really worth it, it seems brittle to me and it actually kind of breaks the mental model of how macros work for the user. And it could make for really bad errors if you for example forget an operator in the middle and the macro parsing ends there.

I thought we would switch to :x = ... anyway, no? In this case, I still like my previous suggestion of :x .= :a + :b. I know it doesn't mean assign elementwise to :x, but 1. who cares inside of a macro, and 2. it's really short and I like short for these things that repeat over and over.

I guess a drawback is that you can't use .= on its own, without a left-hand-side variable.

@pdeffebach
Copy link
Collaborator Author

An update:

I thought we would switch to :x = ... anyway, no? In this case, I still like my previous suggestion of :x .= :a + :b. I know it doesn't mean assign elementwise to :x, but 1. who cares inside of a macro, and 2. it's really short and I like short for these things that repeat over and over.

Okay yes, I'm on board with this now. I imagine that like 99% of the time users will want @byrow and this syntax will make it as easy as possible for them. Note: this can be done with or without :x instead of x on the LHS.

I guess a drawback is that you can't use .= on its own, without a left-hand-side variable.

Yeah, so I guess we've "solved" the problem when it comes to making new variables, but we still have to worry about @where and @orderby.

I just realized that this is a problem with any macro

julia> @transform(df, x = @. :a + :b, y = :b)
ERROR: UndefVarError: y not defined
Stacktrace:

or any function for that matter

julia> function foo(a, b)
           1
       end
foo (generic function with 1 method)

julia> x = [1, 2]; y = [3, 4];

julia> foo(@. x * y, y)
ERROR: MethodError: no method matching foo(::Tuple{Vector{Int64}, Vector{Int64}})
Closest candidates are:
  foo(::Any, ::Any) at REPL[152]:1

so users will still have to wrap parentheses with macro calls anyways..

I think I would prefer @byrow over ByRow because there is still magic happening regardless.

Also, this issue only comes up when there is more than one argument. So the user can still avoid parentheses when using one argument, which is probably the most frequent use-case.

@pdeffebach
Copy link
Collaborator Author

So rules as of this PR are:

  1. if you are inside transform or select, use @transform(df, y .= :x * :y) to get element-wise multiplication. You can also do @transform(df, y = @byrow :x * :y). With multiple arguments you can do @transform(df, y .= :x * :y, z .= :x * :y), but if you do @byrow you need parentheses @transform(df, y = @byrow(:x * :y), z = @byrow(:x * :y))
  2. If you are using @where or @orderby, you have to use @byrow all the time.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Apr 18, 2021

Here is another option that I like a lot

function orderby_helper(x, args...)
    if length(args) == 1
        f = first(args)
        if f isa Expr && f.head == :block
            args = Base.remove_linenums!(f).args
        end
    end
    t = (fun_to_vec(arg; nolhs = true, gensym_names = true) for arg in args)
    quote
        $DataFramesMeta.orderby($x, $(t...))
    end
end

It works like this

julia> @orderby df begin 
           @byrow :a + 1
           @byrow :b + 1
       end
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4

EDIT: I think this is the way forward. I feel like switching all documentation to this format for expressions is safest. People can use macros as they please without any unintuitive behavior.

cc @EconometricsBySimulation on this.

@bkamins
Copy link
Member

bkamins commented Apr 18, 2021

@pdeffebach - I have a lot of other PRs to check in DataFrames.jl now so I cannot closely follow all discussions here. Can you please ping me with this PR when some decision is reached. Thank you!

@EconometricsBySimulation
Copy link
Contributor

Hi all, thanks for all your hard work on this. I am not sure if I understand the motivation for @byrow( ) exactly.

If it so that a user can do something with respect to reach row without having to broadcast each function/operator? Trying to make the syntax a little less complex? If that were the case wouldn't a keyword or row specific macro be easier for users to length/remember?

> @orderby df begin 
           @byrow :a + 1
           @byrow :b + 1
       end

Could be:

> @orderbyrow df begin 
           :a + 1
           :b + 1
       end

Or

> @orderby( df, :ByRow, :a + 1, :b + 1)

But since @orderby only works on rows anyways pretty much all syntax send redundant and unnecessary. I guess the only time you would need it not to be not row wise would be in the unusual event in which you want to so some kind of rescaling based on a function applied to the entire entire column followed by a non-monotonic transformation: such as
@orderby( df, abs.(:a .- mean(:a))), .-:b.

Coming up with an entirely new macro stand to handle what I would think to be the most common use seems unnecessary verbose when the following would work just fine even if a row wise orderby is the default:

@orderby( df, abs(:a - mean(df.a))), -:b.

I guess I need to be thoughtful about the groupeddataframes but that can also be accomplished by first @transform gdf abar .= mean(:a) followed by
@orderby( df, abs(:a - :abar))), -:b.

In general though I don't seem myself writing

> @orderby df begin 
           @byrow :a + 1
           @byrow :b + 1
       end

When

> @orderby df begin 
           :a .+ 1
           :b .+ 1
       end

Would work just fine. Sorry if my comments aren't useful. I think there is more of a case for @Transform but I think @rowtransform or @transform(df, :byrow,...) would be both more concise and less (human) error prone.

@EconometricsBySimulation
Copy link
Contributor

Wrote this on my phone so tons of errors. Sorry about that. Hopefully it is readable.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Apr 19, 2021

Thanks, @EconometricsBySimulation , for your feedback.

I think the main motivation for @byrow handled at the argument level is because that's how transform does it. i.e. there is no keyword argument for ByRow in DataFrames.transform, rather, users are required to write src => ByRow(fun) => dest.

DataFramesMeta is, at it's core, a way of creating src => fun => dest calls. In order to keep a consistent mental model for users of both DataFramesMeta and DataFrames, it makes sense for @byrow to be at the argument level just as ByRow.

If it so that a user can do something with respect to reach row without having to broadcast each function/operator? Trying to make the syntax a little less complex? If that were the case wouldn't a keyword or row specific macro be easier for users to length/remember?

It's more than that, consider

@transform df y = @. :x == 1 ? 100 : 200

This will currently fail because the ? : syntax does not have broadcasting. To do this you need the clunky ifelse.(...) syntax.

On this PR, we have

@transform df y = @byrow :x == 1 ? 100 : 200

which would work because it creates the expression

_f (x) = x == 1 ? 100 : 200
transform(df, :x => ByRow(f) => :y)

In general though I don't seem myself writing

> @orderby df begin 
           @byrow :a + 1
           @byrow :b + 1
       end

So yeah, in that example, @byrow :a + 1 is redundant, but you would definitely need to write @byrow if you have anything with || or ? :.

But since @orderby only works on rows anyways pretty much all syntax send redundant and unnecessary. I guess the only time you would need it not to be not row wise would be in the unusual event in which you want to so some kind of rescaling based on a function applied to the entire entire column followed by a non-monotonic transformation: such as
@orderby( df, abs.(:a .- mean(:a))), .-:b.

We are restricted to making things column-wise by default because that's what transform does. In DataFrames, we recently had filter which operated row-wise, and this stuck out like enough of a sore thumb that we created subset which acts on columns.

There is definitely a tradeoff between flexibility and consistency. If we have @orderby act on rows (or @transform), then we still need to have a way to act on columns, otherwise @orderby(gd::GroupedDataFrame, ...) would not be very meaningful.

So making row-wise the default would be a bit awkward because we would need to explain to new users.

Would work just fine. Sorry if my comments aren't useful. I think there is more of a case for @Transform but I think @rowtransform

Yes, this is possible, but at the same time there doesn't exist a DataFrames.rowtransform. We've done a good job so far at keeping (almost) 1:1 consistency with the base DataFrames API. Maybe in the future we can have a @rowtransform, but right now I would argue the goal is filling all the missing pieces in the DataFrames API, which right now is a ByRow function.

So that's the background for why I think @byrow is a necessary addition to the API. I agree that sometimes it can get clunky to type a lot. But hopefully if we add .= then 99% of the clunkiness goes away because, as you said, @orderby where you really need columns is rare, but it is common in @transform.

EDIT: Here is a similar discussion about transformation by groups vs. transformations by rows.

@EconometricsBySimulation
Copy link
Contributor

Thanks for the kick ass and thorough response @pdeffebach! Totally on board now. Being able to use standard logical operators &&, ||, and ? : for row operations would be great!

@matthieugomez
Copy link

matthieugomez commented Apr 20, 2021

I personally think that multiplying macros lead to confusions. I also think that .= feels too much like a pun (and it is not a solution that works without lhs anyway).

I would prefer something like this

@orderby(df, rows(:a + 1))
@transform(df, rows(:y = :x^2))

For multiple arguments, either or both

@orderby(df, rows(:a + 1, :b + 1))
@orderby(df, rows(:a + 1), rows(:b + 1))

@EconometricsBySimulation
Copy link
Contributor

EconometricsBySimulation commented Apr 20, 2021

I like the

@orderby(df, rows(:a + 1))
@transform(df, rows(:y = :x^2))

Though I think it would be less jargony to do

@orderrows(df, :a + 1)
@transformrows(df, y = :x^2)

and you could imagine a less frequently used symmetric set of macros for use with DataFrames that look more like single type matrices:

@ordercolumns(df, :1 + 1)
@transformcolumns(df, :3 = :2^2)

@EconometricsBySimulation
Copy link
Contributor

I could imagine doing

@orderrows(df, mean(col(:a)) - :a)
@transformrows(df, y = col(:x)^2)

Though something like the following would be even more readable

@orderrows(df, mean(col:a) - :a)
@transformrows(df, y = maximum(col:x)^2)

Or

@orderrows(df, mean(col.a) - :a)
@transformrows(df, y = maximum(col.x)^2)

Seems feasible.

@matthieugomez
Copy link

matthieugomez commented Apr 20, 2021

Yes, defining new macros would be a great solution too (clear and not punny). transformrows is a bit of a mouthful though. Maybe a r prefix?

@rtransform(df, :a = :b^2)
@rselect(df, :a = :b^2)
@rwhere(df, :a > 0)

@jkrumbiegel
Copy link
Contributor

I think rtransform or similar is also ok, but you lose the ability to switch between byrow and not when doing multiple transformations in one call.

@pdeffebach
Copy link
Collaborator Author

@orderby(df, rows(:a + 1))
@transform(df, rows(:y = :x^2))

I am against this because it's not clear that rows is metaprogramming magic. I think @transform(df, @rows(:y = :x^2)) is better because the @rows tells the user that evaluation is non-standard.

Yes, defining new macros would be a great solution too (clear and not punny). transformrows is a bit of a mouthful though. Maybe a r prefix?

@rtransform(df, :a = :b^2)
@rselect(df, :a = :b^2)
@rwhere(df, :a > 0)

I hear the support for new macro names and think we should consider it more in the future, but I'm hesitant to double the size of the API at the moment.

For now, I would really rather the focus be on 1.0 being "Support the complete DataFrames.transform syntax" before adding lots of new macro names.

For that reason, I would prefer shoving more magic into @transform than add new macros. Let me make another proposal. We could have @rows be either at the argument level or at the start of a block. I am assuming that people are on board with syntax of the form

@transform df begin 
    y = f(:x)
    z = g(:x)
end

(This solves the more general macro problem, i.e. @.)

We could also allow

@transform df @rows begin 
    y = f(:x)
    z = g(:x)
end

which would be a synonym for

@transform df begin 
    y = @rows f(:x)
    z = @rows g(:x)
end

Though I am still partial to .= as well. But I see Matthieu's point that it is too magic, especially since I am complaining about rows(:y = f(:x)).

How do people feel about this proposal?

@bkamins
Copy link
Member

bkamins commented Apr 20, 2021

@pdeffebach - could you summarize your current thinking of the best design in a single post (so that I am clear what is proposed, as in the above discussion - which is very valuable - there are mixed proposals). In particular it would be great to see the syntax for:
a) one-line, b) multi line
operations

@pdeffebach
Copy link
Collaborator Author

Here is a more comprehensive and clear proposal

To clarify, here's a proposal as is stands now.

  • Single transform, same as before
@transform df y = f(:x)
  • Mutliple columns: Can do with commas or a big block
@tranform df begin 
	y = f(:x)
	z = f(:g)
end

@transform(df,
	y = f(:x),
	z =g(:x)
)

Benefits of the big block: macros work easier. Say you want to use @sommacro and multiple arguments. You need

@transform df begin 
	y = @somemacro f(:x)
	z = g(:x)
end

@transform(df,
	y = @somemacro(f(:x)),
	z =g(:x)
)

The Stata use in me strongly prefers the :block syntax because it doesn't need commas or parentheses.

  • Row-wise transforms. We can either do @byrow at the argument-level or the block level.
@transform df begin 
	y = @byrow h(:x)
	z = g(:x)
end

@transform(df, 
	y = @byrow(h(:x)),
	z = g(:x)
)

when done at the block level, all transformations get applied by row

@transform df @byrow begin 
	y = h(:x)
	z = j(:x)
end

this also works when there is a single transform. Which leads to the slightly redundant syntax

@transform df @byrow y = h(:x)
@transform df y = @byrow h(:x)

Maybe we can disallow @byrow as the first argument when it's not a :block. This is an edge case that imo isn't too important.

@bkamins
Copy link
Member

bkamins commented Apr 20, 2021

OK and I understand that then also:

@transform(df, y = @byrow(f(:x)), z =@byrow(g(:x)))

If yes then I am OK with this proposal.

@pdeffebach
Copy link
Collaborator Author

I know the above proposal isn't what everyone wanted originally, but I think it's a nice compromise.

If people agree on at least the :block part, I will close this PR and start on that.

@bkamins
Copy link
Member

bkamins commented Apr 21, 2021

Having pressed 👍 I am OK with your proposal (from my experience with DataFrames.jl I think you should lead the development here - collect the feedback and judge it, but in the end the recommendation of the decision should be on your side 😄)

@EconometricsBySimulation
Copy link
Contributor

I have a feeling that for the 95% of cases in which transformations are going to be on a row-wise basis people will end up writing the macro rtransform(df, args...); :(@transform $df @byrow $(args...) ) end or something similar. But that is easy enough to do if you have the @transform df @byrow syntax already working.

@pdeffebach
Copy link
Collaborator Author

Remember that @eachrow! with better new column creation is also on the to-do list. So there will be other ways of working row-wise.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Apr 21, 2021

And if it looks like a lot of people are writing rtransform after this change, we can definitely add it post 1.0. But I feel like pre-1.0 we should avoid adding new functions like that.

@matthieugomez
Copy link

matthieugomez commented Apr 22, 2021

My main concern is that adding multiple macros within the same call is confusing. Especially because @byrow is, by itself, meaningless. Is there any other example in Julia of something like this (i.e. a macro that only works within a pre-specified set of macros)?

So, to re-iterate, my view is that it'd be cleaner to have row-wise versions of @transform @select etc or. There are multiple ways of doing so:

  • defining macros with r as a prefix, i.e. @rselect, @rtransform
  • defining macros with . as a suffix, i.e. @transform., @select.. Would require to change Julia's parser to transform @transform. into @transform__dot__, similarly to the way @. is transformed into @__dot__, a one line modification

Or maybe there are even better syntaxes!

Of course, that's just my opinion, so feel free to ignore me — it's not like I can't do my own package if I want something different.

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Apr 22, 2021

This is quite the bikeshedding issue, it seems there are many different personal preferences. I have to think more about it. I feel like always having to write @cols or @rows would annoy me quickly. I go back and forth between liking the non-symbol syntax and thinking it's too much trouble to be worth it. I think transform and select should be row-based by default one day, and to stick with the default col-based from DataFrames the other. Currently I'm tending towards making a block-based macro, because I usually do not use these functions on their own, and I could save all the @s this way.

I'm still trying to assess what the most common workflow is, so that we can shave the most redundancy off of that. Maybe we'd need some quantitative analysis on people's data wrangling code..

Now I'm also starting to like the version without parentheses that @pdeffebach is advocating. And this actually can't be done without the @ in transform, because it doesn't parse otherwise.

@matthieugomez
Copy link

matthieugomez commented Apr 22, 2021

I think transform and select should be row-based by default one day, and to stick with the default col-based from DataFrames the other.

Even if we were to change the default, we'd still need a syntax to change them from row-based to column-based when needed (for what it's worth, I'd argue @combine is the only one that needs to be col-based by default)

@pdeffebach
Copy link
Collaborator Author

superceded by #250

@pdeffebach pdeffebach closed this May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants