Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row-wise vs. whole vector functions #1952

Closed
nalimilan opened this issue Sep 9, 2019 · 42 comments
Closed

Row-wise vs. whole vector functions #1952

nalimilan opened this issue Sep 9, 2019 · 42 comments
Labels
Milestone

Comments

@nalimilan
Copy link
Member

nalimilan commented Sep 9, 2019

This is a continuation of a discussion started by @piever at #1256 (comment), about whether functions should operate row-wise or take and return full vectors.

As @piever noted, JuliaDB has taken the row-wise approach since that allows distributing operations over multiple cores. This is a good idea in general even for DataFrames, where we could use multiple threads. groupreduce is also a good example of this: I indeed added code to detect common reductions in by/combine to transform such operations into what is essentially a groupreduce operation (which could be added to the API at some point). I think it is useful to allow both, as people are used to thinking in terms of sum rather than "reduction using +" (and indeed Base provides sum(x) in addition to reduce(+, x)).

So the general question is, when should a DataFrames function be row-wise and when should it take a full vector? Unfortunately, I think there are advantages to both. Operating row-wise is simpler to write (no dots), distributable and possibly more efficient (no intermediate allocations). But operating over whole vectors allows doing things like normalize(x), x .- mean(x) or diff(x), which are quite common (either on the whole data frame, or by group); another operation which is sometimes useful is to create a new variable containing for each row the mean of the group it corresponds to (in which case recycling is needed, just like dplyr's mutate does). We discussed similar issues previously in the context of JuliaDBMeta macros at JuliaData/JuliaDBMeta.jl#29. In theory, these things can be performed as a row-wise operation after computing the needed summary statistic, but that's not very user-friendly unless we can find a very simple macro syntax which also works for window functions like lag (let's discuss that at JuliaData/JuliaDBMeta.jl#29).

The simplest solution would be to have both kinds of functions, provided we can find a clear rule to distinguish them. For example, we already have row-wise filter, unique and sort, so it would be somewhat consistent to also have a row-wise map (or map(f, eachrow(df)) if we're unsure) as an equivalent to JuliaDB's select.

Then we could provide separate functions using a dplyr-like terminology, which operate on whole vectors and are somewhat more user-friendly. But select belongs to the latter family, so by that rule it would have to operate over whole vectors. One solution to that would be to only allow selecting (not transforming) columns, as in dplyr. Then we could also introduce mutate (I prefer the name transform, but JuliaDB already uses it for row-wise operation...) to create columns by passing a function that operates on whole vectors. We would still need an equivalent of select or dplyr's transmute to operate on whole vectors. mutate(df, ..., keep=true) could work, but it should probably recycle scalars so that the result has the same number of rows as the input: that wouldn't allow replacing aggregate (#1256).

@piever
Copy link

piever commented Sep 9, 2019

it would be somewhat consistent to also have a row-wise map (or map(f, eachrow(df)) if we're unsure) as an equivalent to JuliaDB's select.

Just a small comment, as this overlap was pointed out in #1256 already. I think one has to be a bit careful with performance here, in that select(t, :a => :b => log) is much more performant then map(row -> (; a = log(row.b)), t) as one does not need to iterate the whole row. But maybe for DataFrames, map can allow both syntaxes (just like by went through a similar optimization).

@pdeffebach
Copy link
Contributor

Figuring out which functions are special cased as by row and which ones take the whole column is a continual pain point for me with dplyr. I would prefer all functions be applied to the whole column and require the use of anonymous functions for element wise operations.

@nalimilan
Copy link
Member Author

nalimilan commented Sep 27, 2019

Which functions are applied by row in dplyr? I thought they were all vectorized.

I would prefer all functions be applied to the whole column and require the use of anonymous functions for element wise operations.

As I noted above, the problem is that some operations could be made faster if we know they are applied independently to each row: operation can be panellized, temporary vector can be avoided, etc.

@pdeffebach
Copy link
Contributor

It was the string packages that frustrated me. stringr functions were broadcasted but paste weren't. I ended up writing wrapper functions that took in vectors.

Another solution might be a RowWise struct that DataFrames to compute things row-wise.

@nalimilan
Copy link
Member Author

It was the string packages that frustrated me. stringr functions were broadcasted but paste weren't. I ended up writing wrapper functions that took in vectors.

Wait, paste is broadcasted:

p> paste(1:2, 3:4)
[1] "1 3" "2 4"

Another solution might be a RowWise struct that DataFrames to compute things row-wise.

Yeah, I was thinking something like that could do the trick. Though it's kind of backwards...

@pdeffebach
Copy link
Contributor

It might have been paste0 or some variant.

I see the point. I'm surprised there is not way for the compiler to know if an anonymous function is purely broadcasted, but that seems to be the case so this is a tough problem.

@nalimilan nalimilan added this to the 1.0 milestone Dec 1, 2019
@bkamins
Copy link
Member

bkamins commented Dec 11, 2019

Adding #2048 as a part of the decision here I guess (I am not sure what is left to decide in the main thread, but this issue seems to fall into the category of this thread 😄).

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

@nalimilan regarding:

But operating over whole vectors allows doing things like normalize(x), x .- mean(x) or diff(x), which are quite common

This comment by you indicates that it would be good that by allowed an option to keep the order of rows unchanged in by (apart from "sort" and "undefined" options we now have). This is not super crucial as one always can join the result after subtracting, but not having to join but being able just to hcat would be nice.

EDIT - I have thought over that. It is not that crucial in the end in my opinion. join should be good enough if needed.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

Here is a summary what I have on this issue.

Currently pesent:

  • map(fun, ::GroupedDataFrame) -> groupwise, produces GroupedDataFrame
  • filter - rowwise on AbstractDataFrame
  • sort/sort! - rowwise on AbstractDataFrame
  • combine -> groupwise on GroupedDataFrame, produces DataFrame
  • aggregate -> groupwise on GroupedDataFrame, produces DataFrame (we should deprecate it and replace it with combine - sometihng @nalimilan is now working towards)
  • mapcols -> whole column on AbstractDataFrame

To be added:

  • select -> whole column on AbstractDataFrame
  • map(fun, ::AbstractDataFrame) -> rowwise, produces DataFrame
  • suffle/shuffle!/sample (proposals only now) - rowwise on AbstractDataFrame if we decide to add it (sample is less probable as it is not in Base)

So I would say what we have now is consistent. The TODO is more or less:

  • remove aggreage (@nalimilan is thinking about this AFAICT)
  • add map for AbstractDataFrame (this is not super needed now, but I can add it if we agree we like it, it is OK to be post 1.0 functionality)
  • make select allow whole column transformations (I will implement it)
  • shuffle/shuffle!/sample - are potential additions that are non-problematic and can be done post 1.0

If there are no negative comments towards that I will go forward with this plan.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

@nalimilan - I have started documenting & implementing target select. I initially written down above that select should operate on whole columns. But as I start writing tests for this the functionality is really inconvenient. Essentially it seems to me that select makes more sense to provide row-wise operations (I do not see a strong use case for select with whole column operations).

@nalimilan
Copy link
Member Author

Can you develop? I think we need probably need both. The difficulty is under what form we can provide them...

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

I was writing in parallel in #2053 (comment) what is the the rationale.

My question to @nalimilan and @pdeffebach - can you give me examples of typical use cases where "whole column" select is useful? The issue is that in select we expect that we do not reduce a column to a single value but rather produce a column that has that many elements as there are rows in a source data frame.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

Also if we want "whole column" operations on a data frame many typical cases (like standardization of many numeric columns) can be achieved by mapcols.

@nalimilan
Copy link
Member Author

nalimilan commented Dec 15, 2019

Can you copy your comment here instead? That sounds more appropriate and it will avoid splitting discussions.

The most common cases I can think of are normalize(x), x .- mean(x) or lag(x). These are actually very specific operations, so it could make sense for select to be row-wise and not support them. But I think we need a convenient way to apply them. Maybe just with the same oldcol: => f => :newcol syntax with mapcols/mapcols!? Not sure.

EDIT: another interesting use case is when you want to modify or reorder levels of a categorical array. That's quite natural in dplyr with mutate as it's vectorized.

@pdeffebach
Copy link
Contributor

Yes i agree with Milan. normalize and lag are important and commonly used operations. mapcols is not a replacement because it affects every column in a DataFrame. Imagine you have a string ID variable but want to normalize people's incomes. tbh it's hard to imagine mapcols being very useful for data cleaning with a survey dataset.

Another point to make is that it's very easy to simulate row-wise operations with vector notation, just add a . to get.

select(df, (:x, :y) => (x, y) -> (x .+ y) => :z)

On the other hand if we impose row-wise operations it's very hard to go the other direction. Bcause of this asymmetry I support column-based operations.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

I split the issue when I thought we have a simple case. So let me first add the comment I put there here.

Here is the comment from #2053 issue:

However, later I came to a conclusion that for select to work on AbstractDataFrame it is almost useless to have it work on whole columns (I could not think of any really strong use case). So if we decide to make select work row-wise, then map for AbstractDataFrame is not needed and we can just drop this issue (as it would do essentially the same what select does).

The issue is that GroupedDataFrame a whole-column in a subgroup approach is natural both for combine and for map as it is a collection of data frames. And the difference is only what is the return value - DataFrame for combine and GroupedDataFrame for map.

For AbstractDataFrame we do not have this dual need - it is natural to want a DataFrame as a result of select or map on it. On the other hand we treat a data frame as a collection of rows, so this is what I think we should stick to.

Just to sum up, e.g. the use case to standardize the column of a data frame is not so appealing as it is very easy to achieve via setproperty! or setindex!, so select does not have to cover it, or it can cover it like:

select(df, :oldcol => (x -> (x - mean(df.oldcol)) / std(df.oldcol)) => :newcol)

This is a bit verbose but not super bad, as opposed to eg having to write:

select(df, :oldcol => (x -> abs.(x)) => :newcol)

when you want to take abs of a column, while in row-wise approach it is a clean:

select(df, :oldcol => abs => :newcol)

(I think row-wise operations on data frame - like in JuliaDB.jl will be needed more often by users than whole column operations)

Some more thoughts

Actually we already have a whole-column operation available it is by. When you write:

by(df, [], put_your_transformations_here)

you get exactly what we are talking about. The only thing that is needed is to extend the syntax of by to mach select, but we discussed that this is needed anyway.

I am not sure this is a right direction of thought, but at least this is possible.

@nalimilan
Copy link
Member Author

mapcols is not a replacement because it affects every column in a DataFrame.

@pdeffebach Sorry if that wasn't clear: what I was suggesting is mapcols!(oldcol: => normalize => :newcol, df) to operate on a single column.

Another point to make is that it's very easy to simulate row-wise operations with vector notation, just add a . to get.

select(df, (:x, :y) => (x, y) -> (x .+ y) => :z)

On the other hand if we impose row-wise operations it's very hard to go the other direction. Bcause of this asymmetry I support column-based operations.

@pdeffebach Yes, however it's clearly quite verbose (I mean, if you show that example to any R user he would laugh at us as x and y are repeated three times; though DataFramesMeta could help I guess), and it doesn't allow applying nice optimizations like multithreading (or multicore processing for JuliaDB).

Actually we already have a whole-column operation available it is by. When you write:

by(df, [], put_your_transformations_here)

you get exactly what we are talking about. The only thing that is needed is to extend the syntax of by to mach select, but we discussed that this is needed anyway.

I am not sure this is a right direction of thought, but at least this is possible.

@bkamins Right. Since that behavior is due to combine(gd, put_your_transformations_here) which in turn is a more efficient shorthand for combine(map(put_your_transformations_here, gd)), I think this raises the question of what map should do again. We need a function which would apply vectorized operations to both columns of GroupedDataFrame and AbstractDataFrame. If that's not map (as we want data frames to be collections of rows), it could be mapcols or transform/mutate.

@piever
Copy link

piever commented Dec 16, 2019

In terms of consistency across packages, what @bkamins is proposing is exactly the JuliaDB approach, where select is row-wise, ungrouped column-wise operations can be achieved by simply extracting the relevant columns (no need for special API) or as a simple case of JuliaDB.groupby (i.e. DataFrames.by) where the set of grouping variables is empty.

I agree that it'd be interesting to figure out a good name for a colwise transformation, to complete the analogy reduce, groupreduce versus mapcols, groupmapcols (i.e. by).

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

Well, as commented in #2053 we do not really need row-wise map for AbstratDataFrame if we have it in select, so we are free to define map in whatever way we like if we really need to.

Just to explore the possibilities another idea that came to my mind was to allow select to work either by rows or by whole columns. Two possible ideas are:

  • add a kwarg that would indicate if we want it to work colwise or rowwise (this was my first thought)
  • add a function wrapper that would signal select that the function should be applied for whole columns. My initial idea was to name it Col. (and I like it better as it allows mixing row-wise and whole col operations)

Then you could write something like:

select(df, :, :x => Col(normalize), (:a, :b, :c) => + => :v)

What is the rationale behind it? I expect that apart from decision of row-wise vs whole column operations people will want also column selection and renaming functionalities in both options, so it would duplicate the functionality (+ with Col proposal - of course a different name might be proposed - you can mix both row-wise and whole column operations in one call).

PS. we use normalize example which is an unfortunate name as it is taken by Unicode :). Probably we should write standardize.

@piever
Copy link

piever commented Dec 16, 2019

add a function wrapper that would signal select that the function should be applied for whole columns. My initial idea was to name it Col. (and I like it better as it allows mixing row-wise and whole col operations)

This seems to be much better than a keyword argument. In particular it is pretty much in line with special selectors, like Not, All, etc... Here the idea that select(t, ::Pair{Selection, <:Col}) applies the Col.function column-wise to the selection is very similar to rule 3 linked above.

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

Exactly - and that is why I thought that it might be also then integrated into JuliaDB.jl so that we have consistency (it will be harder as JuliaDB.jl is distributed, but maybe there would be some efficient way to do it - at least in some cases).

@pdeffebach
Copy link
Contributor

Regarding the verbosity of

select(df, (:x, :y) => (x, y) -> (x .+ y) => :z)

Row-wise operations won't change that verbosity very much, any time you have two columns you will need an anonymous function, as with any time you want to have a keyword argument in a scalar-valued function.

Ideally currying would solve all these problems, enabling us to write

select(df, (:x, :y) => _1 .+ _2)

or something.

I think Col is a good idea, though. and would like to see how it would work.

I still think there would be major inconsistencies in the API if by took in vectors but not select.

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

any time you have two columns you will need an anonymous function

I think you will not in most common cases, e.g. in my example above you will be able to write select(df, (:x, :y) => + => :z) without needing an anonymous function as in select(df, (:x, :y) => (x, y) -> (x .+ y) => :z). The same with all binary operators, which are most common things (that is why they are defined this way I guess 😄).

Also note that select(df, (:x, :y) => (x, y) -> (x .+ y) => :z) is actually invalid, it should be select(df, (:x, :y) => ((x, y) -> (x .+ y)) => :z), which is tricky, and that is why I would prefer to keep it to the minimum. Finally this problem with Col will be non existent, as you would write select(df, Col((:x, :y) => (x, y) -> (x .+ y)) => :z) and automatically the right precedence would be enforced by wrapping it in Col.

Ideally currying would solve all these problems, enabling us to write

I would like to keep DataFrames.jl functionality plain and simple using only Base. All magic should go to DataFramesMeta.jl, which I plan to work on when we release DataFrames.jl 1.0 when we have a stabilized API here.

I still think there would be major inconsistencies in the API if by took in vectors but not select.

by takes vectors because GroupedDataFrame is a collection of SubDataFrames. While select takes rows, because AbstractDataFrame is a collection of rows (this is the interpretation @nalimilan fixed some time ago). So this would not be inconsistent. At least this is what I think.


But as always - let us discuss. I think this is the last major decision before 1.0 so we should have a clear roadmap for it.

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

Just to sum up my current thoughts that were scattered around several issues (sorry for that, but I am in creative mode, hopefully will switch to implementing soon, so there will be less noise).

My thinking is that we need two functions select and by:

  • select works row-wise by default, but you can opt-in for whole column processing with Col wrapper; additionally select can be equipped with two keyword arguments group (which will perform operations in groups) and where (which will filter rows for the processing); select supports column selection, renaming and transforming; it assumes (and checks) that the result has that many rows and in the same order as the source (with the additional detail that where could select some rows);
  • by works on whole columns; we should switch its positional argument where we pass grouping columns to be a keyword argument group (and deprecate the form using a positional argument); by default group has value [] which means we do whole column operations; by in general supports only whole column transformations, and for 1.0 it does not support column selection (except for col => identity and cols .=> identity which essentially do this) and renaming (except for (except for oldcol => identity => newcol) (this restriction comes from the fact that we want to deprecate the second positional argument as group keyword argument and we would have an ambiguity here otherwise); it does not make any assumptions how many rows are produced per group; order of rows is determined as it is currently (the sort option); optionally we could also provide where keyword argument like in select; for 1.0 I would deprecate :newcol = :oldcol => :newcol syntax (as it should be replaced by :oldcol => :newcol => :newcol - to keep consistency, and in this way we have keyword arguments for passing options to by only). We still support single function form of by which passes SubDataFrame to passed function (as this is occasionally convenient). We also add keepgroup (or something like this) keyword argument identifying if we want to keep grouping columns in the result or not.

In this way we will have two functions:

  • select which is a combo with optional grouping and filtering that works row-wise by default (with Col option to work whole columns as opt-in), that allows column selection, renaming and transforming.
  • by which is a combo for whole column processing, which does not explicitly support selection and renaming for now (this is less useful in whole column processing that possibly changes the number of rows; also potentially it can support this in the future)

@nalimilan
Copy link
Member Author

Interesting. Though using by for this sounds weird to me, as it really implies grouping to me. Also it wouldn't play well with the existence of groupby and combine, would it? I'd rather introduce another function.

Adding keyword arguments to support filtering and grouping is also interesting, but essentially orthogonal AFAICT: this can be achieved in several steps anyway using filter/getindex/view and groupby.

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

Here are my thoughts:

Though using by for this sounds weird to me

I agree it is weird.
The problem is that if we introduce a function like transform (a tentative name) that works on whole columns it will be essentially by anyway, but just under a new name (of course we could artificially limit the functionality of by to make it different, but what would be the benefit of it?).

Also it wouldn't play well with the existence of groupby and combine

The fact that by is now implemented as a wrapper for combine called on groupby result is an implementation detail I think. In general it would not have to be so (even it could actually improve performance in some cases to avoid having these two separate steps).
groupby + combine/map can exist independently as there are use cases, when you actually need a GroupedDataFrame object (e.g. splitting data into train/validation/test subsets).

Adding keyword arguments to support filtering and grouping is also interesting, but essentially orthogonal

I think it is not orthogonal because of two reasons:

  • design: the key point is that we do not have to implement all this now, but if we have this design goal in the long term it influences the current decisions (namely: never use kwargs for transformations, because they are reserved for other usage, all transformations should be positional arguments)
  • performance: possibly having this in one function can allow optimizing (i.e. if we know the query is composed of selecting, filtering, grouping, aggregating etc. we can optimize it - like databases do)

Of course all this makes sense under "powerful select" approach. If we want select to only select then we can have two functions for adding columns to a data frame - one for row-wise transforms and one for whole column transforms, and by would just focus on what it does currently.

@nalimilan
Copy link
Member Author

The problem is that if we introduce a function like transform (a tentative name) that works on whole columns it will be essentially by anyway, but just under a new name (of course we could artificially limit the functionality of by to make it different, but what would be the benefit of it?).

I'd say calling this general column-wise function by confuses the discussion (because of its name and because of the already existing function with that name). If it was called transform or a generic name like that, it would be more obvious that you could apply it either to a data frame without grouping, to a data frame with a group keyword argument (if we want to add that), or to an already computed GroupedDataFrame.

The fact that by is now implemented as a wrapper for combine called on groupby result is an implementation detail I think. In general it would not have to be so (even it could actually improve performance in some cases to avoid having these two separate steps).

I don't think doing groupby and combine in a single step can improve performance. With #2047 we only compute group indices, which are needed anyway.

Maybe that's the case for other operations, but AFAIK for data frames the most efficient way to do a filtering, grouping and combining operation is really to do these three steps in sequence (as long as you take care of creating a view to filter). I'm not very familiar with query planners, but my understanding is that major optimizations can be obtained when joins are involved (which keyword arguments wouldn't support as we're talking about single-table operations). I guess Jacob and David know better.

But yes, you're right that if we want to support keyword arguments in 1.x we have to prevent using them to specify column names.

Of course all this makes sense under "powerful select" approach. If we want select to only select then we can have two functions for adding columns to a data frame - one for row-wise transforms and one for whole column transforms, and by would just focus on what it does currently.

I'm not sure that's really a requirement. We could have a powerful select and a powerful transform, the only difference being vectorization.

@bkamins
Copy link
Member

bkamins commented Dec 17, 2019

I'm not sure that's really a requirement. We could have a powerful select and a powerful transform, the only difference being vectorization.

OK - we can go this way I think. So the question is do we like the names select (row-wise) and transform (whole column)?

Another dimension in e.g. dplyr is that it distinguishes if old columns are kept or not. I think we do not strictly need this distinction as you will be able to add : as a first argument to keep old columns.

We can start select and transform without group and filter arguments, as they can be added later.

Another question - which again can be decided later - is if we want to add Col and Row wrappers so that you can write Col(fun) in select if most of your operations are row-wise and only some are whole column and similarly Row(fun) in transform if you have an opposite case.

@piever
Copy link

piever commented Dec 17, 2019

In terms of naming, I don't think it's great to have transform column-wise and select row-wise, as nothing in the name suggests this distinction. I thought the plan was to have transform be select but also keeping the previous columns, whereas select only returns the new ones?

As the default seems to be row-wise, maybe it'd be better that the column-wise version has this explicitly in the name, e.g. mapcols, or select(df, :x => Col(f)).

@bkamins
Copy link
Member

bkamins commented Dec 17, 2019

I am not sure about naming either - hopefully @nalimilan can come up with something smart as usual 😄. Maybe just selectcols - but I am not sure (as I have said above - essentially by does exactly this, but I agree with @nalimilan that the name is not very fortunate).

I thought the plan was to have transform be select but also keeping the previous columns, whereas select only returns the new ones?

As I have said above - this is possible, but it is really easy to keep old columns by adding : as a first argument to select, I think the less functions people need to learn the better.

select(df, :x => Col(f))

I understand that @nalimilan was afraid that this will be too complex if we had only this option for the newcomers.

Regarding mapcols - it is possibly a good name, but map part suggests iteration of function applications over all columns (and this is what it currently does).

@pdeffebach
Copy link
Contributor

I think the less functions people need to learn the better.

I can't agree with this. I think a *nix philosophy of aptly named simple functions is preferrable. select(df, :, :a => abs => :b) has this weird : operator and a name that doesn't suggest transformation.

I think if anything we should actually take seriously _selectrenametransform(df, args...; kwargs...) and have select, rename, and transform be simple wrappers of that with various defaults.

@bkamins
Copy link
Member

bkamins commented Dec 17, 2019

Thank you all for the comments. I have a feeling that I see what _selectrenametransform(df, args...; kwargs...) should be (whether we implement it as a single function that is called by wrappers or not is an implementation detail).

The advice that I think would be now most valuable is suggestions of public API.

We have the following dimensions (I give a full list, not all of them have to be covered in one shot, but we need a longer term plan):

  1. allow selection of columns (with two possible defaults: all and no columns)
  2. allow column renaming
  3. allow column transformations (with two possible defaults: row-wise and whole column)

For now - following the comment by @nalimilan I leave out SQL where, group by, having and order by options for separate functions.

So which combinations of these options should be exposed by what functions. Probably there are mixed opinions here (which is OK - I think it is better to voice them now so that we can properly weigh them; the crucial thing - if I may ask - is to possibly provide a complete proposal that covers a whole list of possible options, as a key thing here is to be able to verify consistency and intuitiveness of names in the proposals). Thank you!

@nalimilan
Copy link
Member Author

nalimilan commented Dec 17, 2019

I don't have great ideas for naming unfortunately. I agree that having select be row-wise and transform vectorized wouldn't be obvious -- though that's what JuliaDB does @piever, right? selectcols would also be problematic as select actually selects columns currently.

The Col(f) wrapper approach isn't too bad either, but I'm just afraid it will look quite complex to new users: e.g. select(df, :x => Col(x -> levels!(x, ["c", "b", "a"]))) doesn't look great. Though it's not super nice without Col either, so maybe that's just a matter of providing a nice solution in DataFramesMeta. But even there we will need a nice name for vectorized operations, so better think about that right now for consistency (currently that operation is called... @transform). And maybe that could help us to find a solution?

@piever
Copy link

piever commented Dec 17, 2019

I don't have great ideas for naming unfortunately. I agree that having select be row-wise and transform vectorized wouldn't be obvious -- though that's what JuliaDB does @piever, right?

No, actually they are both row-wise, the difference is that transform keeps the old columns whereas select discards them (even though both accept things like :col => v, where v is a vector computed elsewhere).

At this moment my personal preference is probably for the Col wrapper, which is explicit, pushes users towards the efficient option (row-wise), and doesn't require many other function names.

We had discussed in JuliaData/JuliaDBMeta.jl#29 the possibility to allow interpolating a column in JuliaDBMeta (and I guess DataFramesMeta) row-wise macros. For example in:

@transform df normcol = (:col1 + :col2) / $(mean(:col3))

the dollar expression gets computed before calling the row-wise macro and then the rest is computed row-wise. Again, this would mean that most (all?) macros are row-wise, but there is an escape-hatch for column-wise operations.

@nalimilan
Copy link
Member Author

Funny, I don't know how I imagined transform was vectorized in JuliaDB. Maybe because it accepts a vector.

JuliaData/JuliaDBMeta.jl#29 is indeed relevant, thanks for raising it again. I still like the $ escaping (or equivalent) idea. Even though I proposed the @where iris :SepalLength > $(mean(:SepalLength)) idea, I now wonder whether your initial proposal @where iris :SepalLength > mean($SepalLength) or something like that wouldn't be even better. Both could be allowed actually.

I guess that kind of approach would suit well with the Col wrapper in DataFrames. BTW, note that if could be select(df, :x => Col(f) => :y) but also possibly select(df, Col(:x) => f => :y).

@bkamins
Copy link
Member

bkamins commented Dec 18, 2019

select(df, Col(:x) => f => :y) was supported also in #1727 (comment).

I think it is also OK to have select(df, Col(:x) => f => :y) if most prefer it.
The only benefit of select(df, :x => Col(f) => :y) is that if f is an anonymous function you have to wrap it in () anyway and with Col you are sure you will not forget it.

What I mean is that:

select(df, :x => Col(x -> levels!(x, ["c", "b", "a"])) => :y)

would become

select(df, Col(:x) => (x -> levels!(x, ["c", "b", "a"])) => :y)

as it still requires ( and ) around x -> ....

@nalimilan
Copy link
Member Author

Right. But wrapping the column name sounds slightly more correct to me, as it's really the kind of argument passed to the function which is changed by Col, not the function itself. In other words, no transformation of the function can make it work on whole vectors if it's only called repeatedly on each entry. (On the contrary, that would be appropriate if the default was vectorized, and we used a wrapper to automatically call the function on each entry of the passed vector.)

Now, what would be the best name for that wrapper? Is Col explicit enough?

@bkamins
Copy link
Member

bkamins commented Dec 18, 2019

Col seemed OK for me as it is also short.

Just to add (as it might affect the decision), that we would also write things like Col(:) or Col(Not(r"x")).

As an additional decision we should make is what oldcols => f => newcol will pass to f. In by we pass a NamedTuple.

@pdeffebach
Copy link
Contributor

Col isn't great because we use cols to escape symbols in DataFramesMeta, which is a convention shared by JuliaDBMeta and StatsPlots as well.

@dgkf
Copy link

dgkf commented Jan 17, 2020

Weighing in on a number of topics that cropped up in this thread:

Row-wise by default

I experimented with this a bit in my fork of DataFramesMeta (although admittedly it sort of fell off my radar and I wouldn't be surprised if a lot of it breaks at the moment). By default, all operations were columnwise, but could be interpreted as rowwise by first converting a DataFrame to DataFrameRows, e.g.

# using dgkf/DataFramesMeta.jl#dev/symbol_contexts
df = DataFrame(x = 1:4, y = repeat([1, 2], 2), z = 'a':'d')

df |> @transform(a = :x .+ :y)  # columnwise
df |> eachrow |> @transform(a = :x + :y)  # rowwise

# rowwise with column results
# :. is interpreted as the input object
df |> eachrow |> @transform(a = (:x + :y) / mean(parent(:.)[!,^(:x)]))  

What I like about this is that it can use the type dispatch to alternate between modes of operation based on the input datatype. The way it's implemented, it's pretty trivial because of how the symbols are evaluated using dgkf/SymbolContexts.jl. The last example isn't pretty, but referencing columns is possible.

How to escape columns

What's not implemented above, and is reflected in how nasty that last line looks, is how to escape a column. I think that this could be cleaned up in the above example, perhaps to something like:

@transform(a = (:x + :y) / mean(:.[!,:x]))  # or
@transform(a = (:x + :y) / mean($x))        # as suggested above

I'm not crazy about having a bunch of added syntax to denote row cells (:x) and now columns as well ($x), but I'm struggling to come up with some syntax that succinctly could be used to differentiate the two as symbols or using the :. referring to the whole (parent) DataFrame.

Adapting case_when

As a small aside, I also implemented a rudimentary case_when that allows for row-predicated transformations by (optionally column predicated) columnwise functions. In the dplyr world (which is generally what has shaped my outlook for such tasks), this is the preferred way to predicate rowwise operations.

# using dgkf/DataFramesMeta.jl#dev/symbol_contexts
df = DataFrame(x = 1:4, y = repeat([1, 2], 2), z = 'a':'d')

df |> @transform(a = case_when(
    :x .>= 2 => :y,
    true     => nothing
))

select and by

I wrote up a long opinion piece about select handling transformations over in #2080, so I won't belabor those points. Long story short, I think this type of syntactic sugar does not belong in a package that, at least to my naive eye, exists to share a common data structure. I like the idea of select and by as interesting syntactic experiments into the language of data manipulation and want to see what it matures into, but I don't think it should be bundled with such a foundational data structure.

In my opinion, regardless of where they end up, both select and by could benefit from a renaming. Neither one is intuitive. Even after reading the description quite a few times, I can't understand how these behaviors are reflected in either function name. Reading the code to call these, they just don't read as intuitive in any way. As rough as selectrenametransform is (even as a user-facing function), I prefer it to select or by because at least it's clear what all it does.

Brainstorming some alternative naming conventions

Given the breadth of data transformations you can perform, maybe something as generic as mutate would work here? I generally dislike dplyr's use of such a generic word because it only performs column-wise transformations, but here we're really talking about pretty much every transformation under the sun (column transforms, column selections, adding and removing rows... I think that's about everything you can do to modify a DataFrame), so maybe it warrants the generality.

Alternatively, I think some language from ETL processes might be more appropriate given that you're extracting a subset of columns, transforming them and then "loading" them back into the data (okay, a bit of a stretch). I think etl could be cool, especially if you eventually are able to continue the chain to then dump data into another object or database connection.

etl(df, 
  Cols(:x, :y) .=> 
  col -> col .* 2 .=> 
  Cols(:a, :b) =>
  DataBaseConnection("mytable"))

@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

@nalimilan - is there anything left to be decided in this issue (I know there are several specific things left, but they have separate issues).

Is there any remaining "grand decision" to be made in this issue?

@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

I am closing this as I do not see any open grand discussions here after a review. We have settled on a general design. Please open separate issues if there are future requests for "individual" decisions/functionalities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants