-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Add @byrow #239
WIP Add @byrow #239
Conversation
cc @jkrumbiegel if you have any ideas for how to make this work. |
Here is a possible fix. It basically unnests the julia> t = :(@byrow :a, :b)
:(#= REPL[127]:1 =# @byrow (:a, :b))
julia> fix_byrows(t)
2-element Vector{Any}:
:(#= REPL[118]:3 =# @byrow :a)
:(:b) function fix_byrows(ex, v = Any[])
if ex isa Expr && ex.head == :macrocall && ex.args[1] == Symbol("@byrow")
push!(v, :(@byrow $(ex.args[3].args[1])))
fix_byrows(ex.args[3].args[2], v)
else
push!(v, ex)
end
return v
end
function orderby_helper(x, args...)
args = mapreduce(fix_byrows, vcat, args)
t = (fun_to_vec(arg; nolhs = true, gensym_names = true) for arg in args)
quote
$DataFramesMeta.orderby($x, $(t...))
end
end This might be a step in the right direction. EDIT: But this won't work with |
Does |
If I have to do I think the contortions to make I thought we would switch to I guess a drawback is that you can't use |
An update:
Okay yes, I'm on board with this now. I imagine that like 99% of the time users will want
Yeah, so I guess we've "solved" the problem when it comes to making new variables, but we still have to worry about I just realized that this is a problem with any macro julia> @transform(df, x = @. :a + :b, y = :b)
ERROR: UndefVarError: y not defined
Stacktrace: or any function for that matter julia> function foo(a, b)
1
end
foo (generic function with 1 method)
julia> x = [1, 2]; y = [3, 4];
julia> foo(@. x * y, y)
ERROR: MethodError: no method matching foo(::Tuple{Vector{Int64}, Vector{Int64}})
Closest candidates are:
foo(::Any, ::Any) at REPL[152]:1 so users will still have to wrap parentheses with macro calls anyways.. I think I would prefer Also, this issue only comes up when there is more than one argument. So the user can still avoid parentheses when using one argument, which is probably the most frequent use-case. |
So rules as of this PR are:
|
Here is another option that I like a lot
It works like this
EDIT: I think this is the way forward. I feel like switching all documentation to this format for expressions is safest. People can use macros as they please without any unintuitive behavior. cc @EconometricsBySimulation on this. |
@pdeffebach - I have a lot of other PRs to check in DataFrames.jl now so I cannot closely follow all discussions here. Can you please ping me with this PR when some decision is reached. Thank you! |
Hi all, thanks for all your hard work on this. I am not sure if I understand the motivation for @byrow( ) exactly. If it so that a user can do something with respect to reach row without having to broadcast each function/operator? Trying to make the syntax a little less complex? If that were the case wouldn't a keyword or row specific macro be easier for users to length/remember?
Could be:
Or
But since @orderby only works on rows anyways pretty much all syntax send redundant and unnecessary. I guess the only time you would need it not to be not row wise would be in the unusual event in which you want to so some kind of rescaling based on a function applied to the entire entire column followed by a non-monotonic transformation: such as Coming up with an entirely new macro stand to handle what I would think to be the most common use seems unnecessary verbose when the following would work just fine even if a row wise orderby is the default:
I guess I need to be thoughtful about the groupeddataframes but that can also be accomplished by first In general though I don't seem myself writing
When
Would work just fine. Sorry if my comments aren't useful. I think there is more of a case for @Transform but I think |
Wrote this on my phone so tons of errors. Sorry about that. Hopefully it is readable. |
Thanks, @EconometricsBySimulation , for your feedback. I think the main motivation for DataFramesMeta is, at it's core, a way of creating
It's more than that, consider
This will currently fail because the On this PR, we have
which would work because it creates the expression
So yeah, in that example,
We are restricted to making things column-wise by default because that's what There is definitely a tradeoff between flexibility and consistency. If we have So making row-wise the default would be a bit awkward because we would need to explain to new users.
Yes, this is possible, but at the same time there doesn't exist a So that's the background for why I think EDIT: Here is a similar discussion about transformation by groups vs. transformations by rows. |
Thanks for the kick ass and thorough response @pdeffebach! Totally on board now. Being able to use standard logical operators |
I personally think that multiplying macros lead to confusions. I also think that I would prefer something like this @orderby(df, rows(:a + 1))
@transform(df, rows(:y = :x^2)) For multiple arguments, either or both @orderby(df, rows(:a + 1, :b + 1))
@orderby(df, rows(:a + 1), rows(:b + 1)) |
I like the
Though I think it would be less jargony to do
and you could imagine a less frequently used symmetric set of macros for use with DataFrames that look more like single type matrices:
|
I could imagine doing
Though something like the following would be even more readable
Or
Seems feasible. |
Yes, defining new macros would be a great solution too (clear and not punny). @rtransform(df, :a = :b^2)
@rselect(df, :a = :b^2)
@rwhere(df, :a > 0) |
I think rtransform or similar is also ok, but you lose the ability to switch between byrow and not when doing multiple transformations in one call. |
I am against this because it's not clear that
I hear the support for new macro names and think we should consider it more in the future, but I'm hesitant to double the size of the API at the moment. For now, I would really rather the focus be on 1.0 being "Support the complete For that reason, I would prefer shoving more magic into
(This solves the more general macro problem, i.e. We could also allow
which would be a synonym for
Though I am still partial to How do people feel about this proposal? |
@pdeffebach - could you summarize your current thinking of the best design in a single post (so that I am clear what is proposed, as in the above discussion - which is very valuable - there are mixed proposals). In particular it would be great to see the syntax for: |
Here is a more comprehensive and clear proposal To clarify, here's a proposal as is stands now.
@transform df y = f(:x)
@tranform df begin
y = f(:x)
z = f(:g)
end
@transform(df,
y = f(:x),
z =g(:x)
) Benefits of the big block: macros work easier. Say you want to use @transform df begin
y = @somemacro f(:x)
z = g(:x)
end
@transform(df,
y = @somemacro(f(:x)),
z =g(:x)
)
The Stata use in me strongly prefers the
@transform df begin
y = @byrow h(:x)
z = g(:x)
end
@transform(df,
y = @byrow(h(:x)),
z = g(:x)
) when done at the block level, all transformations get applied by row @transform df @byrow begin
y = h(:x)
z = j(:x)
end this also works when there is a single transform. Which leads to the slightly redundant syntax @transform df @byrow y = h(:x)
@transform df y = @byrow h(:x) Maybe we can disallow |
OK and I understand that then also:
If yes then I am OK with this proposal. |
I know the above proposal isn't what everyone wanted originally, but I think it's a nice compromise. If people agree on at least the |
Having pressed 👍 I am OK with your proposal (from my experience with DataFrames.jl I think you should lead the development here - collect the feedback and judge it, but in the end the recommendation of the decision should be on your side 😄) |
I have a feeling that for the 95% of cases in which transformations are going to be on a row-wise basis people will end up writing the |
Remember that |
And if it looks like a lot of people are writing |
My main concern is that adding multiple macros within the same call is confusing. Especially because So, to re-iterate, my view is that it'd be cleaner to have row-wise versions of
Or maybe there are even better syntaxes! Of course, that's just my opinion, so feel free to ignore me — it's not like I can't do my own package if I want something different. |
This is quite the bikeshedding issue, it seems there are many different personal preferences. I have to think more about it. I feel like always having to write I'm still trying to assess what the most common workflow is, so that we can shave the most redundancy off of that. Maybe we'd need some quantitative analysis on people's data wrangling code.. Now I'm also starting to like the version without parentheses that @pdeffebach is advocating. And this actually can't be done without the |
Even if we were to change the default, we'd still need a syntax to change them from row-based to column-based when needed (for what it's worth, I'd argue |
superceded by #250 |
I thought this would be a relatively trivial PR, but it looks like things are a bit complicated.
What I want:
But that doesn't work. it looks like we can't use macro syntax as flags.
That is, the first
@byrow
eats up all the later expressions in the block. Of course, we could requirebut that's a lot of parentheses and I already feel bad enough having people type
@byrow
in the first place.I'm not sure what other packages have done with regards to these "flags". I'll take a look at the ecosystem and see if there are any more elegant (non-lispy) solutions.