Allow data frame and DataFrameRow to take part in broadcasting #1806

bkamins · 2019-05-11T21:55:37Z

Currently data frame is not allowed in broadcasting and DataFrameRow is collected to a vector.

Following the discussion in #1804 we should probably implement custom broadcasting rules for them that would preserve column names if possible.

It is not entirely clear for me what should be the output but I guess this is an initially reasonable approach:

f.(data_frame) returns a DataFrame transforming each element using f but retaining column names; the question is what should happen if we combine data frames in broadcasting with other data frames or data frame rows or three or higher dimensional objects;
I am not sure what f.(::DataFrameRow) should produce - probably a NamedTuple, but the problem is that NamedTuples are not allowed in broadcasting later

Please comment if you have any opinion on this.

The text was updated successfully, but these errors were encountered:

nalimilan · 2019-05-17T20:28:56Z

I agree f.(::DataFrame) should return a DataFrame. Broadcasting data frames with other data frames should probably return data frames, provided that column names match; same for matrices, provided that the sizes match. Broadcasting data frames with DataFrameRow or single-row matrix should return a data frame, recycling the rows to match those of the data frame. Broadcasting data frames with vectors or single-column matrices should return a data frame, recycling the columns. Higher-dimensional objects don't have to be supported (for now at least), but the rules applied to matrices could be extended I guess.

Having f.(::DataFrameRow) return a NamedTuple makes sense. We could wait until broadcasting over named tuples is implemented in Base (JuliaLang/julia#30836).

bkamins · 2019-05-17T20:53:53Z

This gets complex 😄. Especially as the question is how we want to provide broadcast fusion. I will have to think about it.

However, there is one general comment:

Broadcasting data frames with DataFrameRow or single-row matrix should return a data frame

This is a delicate point (and I would leave out the decision what to do with DataFrameRow for later). The reason is that what you write is sensible, but currently we do something else so this would be breaking and, moreover, this is potentially inconsistent with how we treat DataFrameRow when we assign to it and treatment of NamedTuples in Base.

bkamins · 2019-05-19T07:28:23Z

A note to remember. When we decided to treat DataFrame as a collection of rows then first was defined to return a DataFrameRow. Some functionalities in Base (e.g. @default_eltype) rely on this.

Probably still we should treat DataFrame as two-dimensional in broadcasting as treating it as a single-dimensional collection of rows would not be very convenient.

nalimilan · 2019-05-20T07:59:04Z

Broadcasting data frames with DataFrameRow or single-row matrix should return a data frame

This is a delicate point (and I would leave out the decision what to do with DataFrameRow for later). The reason is that what you write is sensible, but currently we do something else so this would be breaking and, moreover, this is potentially inconsistent with how we treat DataFrameRow when we assign to it and treatment of NamedTuples in Base.

Right. Given what we said at #1804 (comment), it wouldn't be consistent to treat DataFrameRow differently from NamedTuple or Vector. That's a bit weird, but hopefully in practice it won't matter a lot (and anyway the same applies to matrix slices a.k.a. vectors).

A note to remember. When we decided to treat DataFrame as a collection of rows then first was defined to return a DataFrameRow. Some functionalities in Base (e.g. @default_eltype) rely on this.

Probably still we should treat DataFrame as two-dimensional in broadcasting as treating it as a single-dimensional collection of rows would not be very convenient.

Indeed, that's another difficult point. Strictly speaking I guess f.(df) should apply f to each row of df and return a data frame. But in practice that's not likely to be very useful (notably df .= 1 wouldn't work) and we already have map for that. I guess we can take the position that data frames are collections of rows when treated as one-dimensional, but are matrix-like collections of cells when treating as two-dimensional (notably with indexing and therefore broadcasting). But we should check whether this can create weird conflicts between these two views in some cases.

bkamins · 2019-05-20T09:11:44Z

I guess we can take the position that data frames are collections of rows when treated as one-dimensional, but are matrix-like collections of cells when treating as two-dimensional (notably with indexing and therefore broadcasting).

This is exactly the problem. And yes - treating a DataFrame as a collection of rows in broadcasting would not be very convenient as Julia does not do nested broadcasting. What I mean is that this fails:

julia> [[1,2], [3,4]] .+ 1
ERROR: MethodError: no method matching +(::Array{Int64,1}, ::Int64)

I think we will not have a problem with this as long as we do not define data frames as iterable. Also currently data frame does not support map. If we introduce it we have a conceptual problem because, if I understand @mbauman design idea correctly, map(f, x) and f.(x) in theory should be the same, so here we would break this contract. But I think that is the reason we have eachrow and eachcol to drop the dimensionality of a data frame to one.

bkamins · 2019-05-20T19:00:15Z

We have the following ways to go (the decision influences #1804, so that PR is on hold now).

df means a data frame, dfr means a DataFrameRow, tup is Tuple, nt is NamedTuple.

Note 1 Whatever is written here does not broadcasting over a single column (which fortunately). This kind of broadcasting is a major use case. If we cannot decide what to do for multiple-column broadcasting the "zero" is option 1. below.

Note 2 Everywhere I omit discussion of row selection as this is irrelevant (the behavior is not affected if we take all rows or only a subset of rows, except a single row when a dfr is created and this case is covered).

Disallow broadcasting of df and dfr on RHS and LHS except for the case when a single column is selected - then operations on both sides are allowed an supported (this is the case for every option below so I do not repeat it).
Treat df as two-dimensional and dfr as single dimensional; consequences
- standard broadcasting operations with scalars, vectors and matrices work nicely for df and dfr
- df .= dfr makes little sense
- df .= Ref(dfr) makes little sense
- df .= Ref(nt) makes little sense
- df .= tup makes little sense
- df .= Ref(tup) makes little sense
- df .= df2 works as expected
- possibly in the future we will have conflicts with treating data frame as a collection of rows (e.g. now in filter, in the future in map)
- we are consistent with size and ndims
- we are consistent with indexing rules
Treat df as a collection of rows and dfr as non-broadcastable, optionally allowing "implicit broadcasting" of scalars
- broadcasting assignment df requires on RHS a collection that is treated as a row (problem to broadcast a column vector vertically, but this is not a common operation)
- is we add "implicit broadcasting" of scalars in dfr then broadcasting assignment df also accepts a scalar
- df .= dfr is not allowed
- df .= Ref(dfr) makes sense
- df .= Ref(nt) makes sense
- df .= tup makes sense (tup has to contain collections assigned to rows)
- df .= Ref(tup) makes sense
- df .= df2 works as expected
- we have a consistency in treating data frame as a collection of rows (e.g. now in filter, in the future in map)
- we are inconsistent with size and ndims (probably we should redefine them then)
- we are inconsistent with indexing rules (and we have to live with this)
Treat df as a collection of rows and dfr as broadcastable - very similar to option above, but without implicit broadcasting of scalars over a dfr in assignment; in practice it will only add dfr .= 1 possibility while disallowing df .= 1 possibility from the option above if we allow implicit broadcasting.

@nalimilan - this is a summary of the options we discussed. Feel free to edit this post if you wish, so that we do not have a spaghetti discussion on this but keep things in one place.

EDIT In summary the major decision is how we treat df as a "matrix" or as a "vector of vectors". The rest of things will follow the "implicit broadcasting" of dfr can be later viewed as "nested broadcasting" over an enclosed container. This will be a slight abuse, but we might decide to allow or disallow it. This is relevant only in places when you are using scalars and instead of df .= scalar you have to write e.g. df .= Ref(fill(scalar, ncol(df))) or something similar which is not extremely bad I think (I would assume that broadcasting over columns is not the most frequent thing one wants to do). An alternative would be write map(row -> row .= scalar, df) (if we define map on df) which is again not that bad and currently foreach(row -> row .= scalar, eachrow(df)) which is a bit long but at least clear. Both assume that we would allow dfr to be LHS in broadcasting.

nalimilan · 2019-05-20T20:09:59Z

Given that broadcasting is supposed to be tied with indexing, treating data frames as two-dimensional in broadcasting sounds the best option (and the most useful one in practice). So option 2 is probably the way to go, even if it's a bit weird for map, and to a less extent for filter and co. Anyway as you note we are already inconsistent and we certainly don't want to fix that, as it would e.g. require length(df) and size(df) to return the number of rows, which would then be inconsistent with indexing.

Though I think we should disallow broadcasting DataFrameRow and NamedTuple for now, since 1) broadcasting NamedTuple is disallowed in Base currently, 2) it makes little sense to treat them like column vectors (as you note), and 3) that will leave us more time to decide.

bkamins · 2019-05-21T08:55:49Z

OK. I will add then the suupport for CartesianIndex indexing for DataFrame and DataFrameRow then also. I will come back with a specification here (or the PR if things are simple) in a few days.

bkamins mentioned this issue May 11, 2019

Broadcasting in DataFrames #1804

Merged

bkamins mentioned this issue May 20, 2019

Surprising setindex/setproperty behaviour #1815

Closed

bkamins mentioned this issue Jun 8, 2019

Add broadcasting of AbstractDataFrame #1840

Merged

bkamins closed this as completed in #1840 Jun 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow data frame and DataFrameRow to take part in broadcasting #1806

Allow data frame and DataFrameRow to take part in broadcasting #1806

bkamins commented May 11, 2019

nalimilan commented May 17, 2019

bkamins commented May 17, 2019

bkamins commented May 19, 2019

nalimilan commented May 20, 2019

bkamins commented May 20, 2019

bkamins commented May 20, 2019 •

edited

Loading

nalimilan commented May 20, 2019

bkamins commented May 21, 2019

Allow data frame and DataFrameRow to take part in broadcasting #1806

Allow data frame and DataFrameRow to take part in broadcasting #1806

Comments

bkamins commented May 11, 2019

nalimilan commented May 17, 2019

bkamins commented May 17, 2019

bkamins commented May 19, 2019

nalimilan commented May 20, 2019

bkamins commented May 20, 2019

bkamins commented May 20, 2019 • edited Loading

nalimilan commented May 20, 2019

bkamins commented May 21, 2019

bkamins commented May 20, 2019 •

edited

Loading