Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow data frame and DataFrameRow to take part in broadcasting #1806

Closed
bkamins opened this issue May 11, 2019 · 8 comments · Fixed by #1840
Closed

Allow data frame and DataFrameRow to take part in broadcasting #1806

bkamins opened this issue May 11, 2019 · 8 comments · Fixed by #1840

Comments

@bkamins
Copy link
Member

bkamins commented May 11, 2019

Currently data frame is not allowed in broadcasting and DataFrameRow is collected to a vector.

Following the discussion in #1804 we should probably implement custom broadcasting rules for them that would preserve column names if possible.

It is not entirely clear for me what should be the output but I guess this is an initially reasonable approach:

  • f.(data_frame) returns a DataFrame transforming each element using f but retaining column names; the question is what should happen if we combine data frames in broadcasting with other data frames or data frame rows or three or higher dimensional objects;
  • I am not sure what f.(::DataFrameRow) should produce - probably a NamedTuple, but the problem is that NamedTuples are not allowed in broadcasting later

Please comment if you have any opinion on this.

@nalimilan
Copy link
Member

I agree f.(::DataFrame) should return a DataFrame. Broadcasting data frames with other data frames should probably return data frames, provided that column names match; same for matrices, provided that the sizes match. Broadcasting data frames with DataFrameRow or single-row matrix should return a data frame, recycling the rows to match those of the data frame. Broadcasting data frames with vectors or single-column matrices should return a data frame, recycling the columns. Higher-dimensional objects don't have to be supported (for now at least), but the rules applied to matrices could be extended I guess.

Having f.(::DataFrameRow) return a NamedTuple makes sense. We could wait until broadcasting over named tuples is implemented in Base (JuliaLang/julia#30836).

@bkamins
Copy link
Member Author

bkamins commented May 17, 2019

This gets complex 😄. Especially as the question is how we want to provide broadcast fusion. I will have to think about it.

However, there is one general comment:

Broadcasting data frames with DataFrameRow or single-row matrix should return a data frame

This is a delicate point (and I would leave out the decision what to do with DataFrameRow for later). The reason is that what you write is sensible, but currently we do something else so this would be breaking and, moreover, this is potentially inconsistent with how we treat DataFrameRow when we assign to it and treatment of NamedTuples in Base.

@bkamins
Copy link
Member Author

bkamins commented May 19, 2019

A note to remember. When we decided to treat DataFrame as a collection of rows then first was defined to return a DataFrameRow. Some functionalities in Base (e.g. @default_eltype) rely on this.

Probably still we should treat DataFrame as two-dimensional in broadcasting as treating it as a single-dimensional collection of rows would not be very convenient.

@nalimilan
Copy link
Member

Broadcasting data frames with DataFrameRow or single-row matrix should return a data frame

This is a delicate point (and I would leave out the decision what to do with DataFrameRow for later). The reason is that what you write is sensible, but currently we do something else so this would be breaking and, moreover, this is potentially inconsistent with how we treat DataFrameRow when we assign to it and treatment of NamedTuples in Base.

Right. Given what we said at #1804 (comment), it wouldn't be consistent to treat DataFrameRow differently from NamedTuple or Vector. That's a bit weird, but hopefully in practice it won't matter a lot (and anyway the same applies to matrix slices a.k.a. vectors).

A note to remember. When we decided to treat DataFrame as a collection of rows then first was defined to return a DataFrameRow. Some functionalities in Base (e.g. @default_eltype) rely on this.

Probably still we should treat DataFrame as two-dimensional in broadcasting as treating it as a single-dimensional collection of rows would not be very convenient.

Indeed, that's another difficult point. Strictly speaking I guess f.(df) should apply f to each row of df and return a data frame. But in practice that's not likely to be very useful (notably df .= 1 wouldn't work) and we already have map for that. I guess we can take the position that data frames are collections of rows when treated as one-dimensional, but are matrix-like collections of cells when treating as two-dimensional (notably with indexing and therefore broadcasting). But we should check whether this can create weird conflicts between these two views in some cases.

@bkamins
Copy link
Member Author

bkamins commented May 20, 2019

I guess we can take the position that data frames are collections of rows when treated as one-dimensional, but are matrix-like collections of cells when treating as two-dimensional (notably with indexing and therefore broadcasting).

This is exactly the problem. And yes - treating a DataFrame as a collection of rows in broadcasting would not be very convenient as Julia does not do nested broadcasting. What I mean is that this fails:

julia> [[1,2], [3,4]] .+ 1
ERROR: MethodError: no method matching +(::Array{Int64,1}, ::Int64)

I think we will not have a problem with this as long as we do not define data frames as iterable. Also currently data frame does not support map. If we introduce it we have a conceptual problem because, if I understand @mbauman design idea correctly, map(f, x) and f.(x) in theory should be the same, so here we would break this contract. But I think that is the reason we have eachrow and eachcol to drop the dimensionality of a data frame to one.

@bkamins
Copy link
Member Author

bkamins commented May 20, 2019

We have the following ways to go (the decision influences #1804, so that PR is on hold now).

df means a data frame, dfr means a DataFrameRow, tup is Tuple, nt is NamedTuple.

Note 1 Whatever is written here does not broadcasting over a single column (which fortunately). This kind of broadcasting is a major use case. If we cannot decide what to do for multiple-column broadcasting the "zero" is option 1. below.

Note 2 Everywhere I omit discussion of row selection as this is irrelevant (the behavior is not affected if we take all rows or only a subset of rows, except a single row when a dfr is created and this case is covered).

  1. Disallow broadcasting of df and dfr on RHS and LHS except for the case when a single column is selected - then operations on both sides are allowed an supported (this is the case for every option below so I do not repeat it).

  2. Treat df as two-dimensional and dfr as single dimensional; consequences

    • standard broadcasting operations with scalars, vectors and matrices work nicely for df and dfr
    • df .= dfr makes little sense
    • df .= Ref(dfr) makes little sense
    • df .= Ref(nt) makes little sense
    • df .= tup makes little sense
    • df .= Ref(tup) makes little sense
    • df .= df2 works as expected
    • possibly in the future we will have conflicts with treating data frame as a collection of rows (e.g. now in filter, in the future in map)
    • we are consistent with size and ndims
    • we are consistent with indexing rules
  3. Treat df as a collection of rows and dfr as non-broadcastable, optionally allowing "implicit broadcasting" of scalars

    • broadcasting assignment df requires on RHS a collection that is treated as a row (problem to broadcast a column vector vertically, but this is not a common operation)
    • is we add "implicit broadcasting" of scalars in dfr then broadcasting assignment df also accepts a scalar
    • df .= dfr is not allowed
    • df .= Ref(dfr) makes sense
    • df .= Ref(nt) makes sense
    • df .= tup makes sense (tup has to contain collections assigned to rows)
    • df .= Ref(tup) makes sense
    • df .= df2 works as expected
    • we have a consistency in treating data frame as a collection of rows (e.g. now in filter, in the future in map)
    • we are inconsistent with size and ndims (probably we should redefine them then)
    • we are inconsistent with indexing rules (and we have to live with this)
  4. Treat df as a collection of rows and dfr as broadcastable - very similar to option above, but without implicit broadcasting of scalars over a dfr in assignment; in practice it will only add dfr .= 1 possibility while disallowing df .= 1 possibility from the option above if we allow implicit broadcasting.

@nalimilan - this is a summary of the options we discussed. Feel free to edit this post if you wish, so that we do not have a spaghetti discussion on this but keep things in one place.

EDIT In summary the major decision is how we treat df as a "matrix" or as a "vector of vectors". The rest of things will follow the "implicit broadcasting" of dfr can be later viewed as "nested broadcasting" over an enclosed container. This will be a slight abuse, but we might decide to allow or disallow it. This is relevant only in places when you are using scalars and instead of df .= scalar you have to write e.g. df .= Ref(fill(scalar, ncol(df))) or something similar which is not extremely bad I think (I would assume that broadcasting over columns is not the most frequent thing one wants to do). An alternative would be write map(row -> row .= scalar, df) (if we define map on df) which is again not that bad and currently foreach(row -> row .= scalar, eachrow(df)) which is a bit long but at least clear. Both assume that we would allow dfr to be LHS in broadcasting.

@nalimilan
Copy link
Member

Given that broadcasting is supposed to be tied with indexing, treating data frames as two-dimensional in broadcasting sounds the best option (and the most useful one in practice). So option 2 is probably the way to go, even if it's a bit weird for map, and to a less extent for filter and co. Anyway as you note we are already inconsistent and we certainly don't want to fix that, as it would e.g. require length(df) and size(df) to return the number of rows, which would then be inconsistent with indexing.

Though I think we should disallow broadcasting DataFrameRow and NamedTuple for now, since 1) broadcasting NamedTuple is disallowed in Base currently, 2) it makes little sense to treat them like column vectors (as you note), and 3) that will leave us more time to decide.

@bkamins
Copy link
Member Author

bkamins commented May 21, 2019

OK. I will add then the suupport for CartesianIndex indexing for DataFrame and DataFrameRow then also. I will come back with a specification here (or the PR if things are simple) in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants