Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcast in DataFrames.jl #1643

Closed
wants to merge 1 commit into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 14, 2018

This is a small PR, but a big decision towards making DataFrames.jl broadcasting-ready. What I recommend is:

  • treat an AbstractDataFrame as an AbstractVector of DataFrameRows when broadcasting;
  • treat a DataFrameRow as a one-dimensional generic collection when broadcasting.

The major consequence is that the output of broadcasting aa function over these types will be a vector in both cases.

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2018

This PR will pass tests after #1637 is merged.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have a few hesitations about this.

First, I wonder whether returning a vector when broadcasting over a DataFrame is the most useful option. We could implement custom methods which return a DataFrame, and allow returning a named tuple to create columns. This would be essentially transmute in the dplyr terminology. Or a possible intermediate approach would be to return a vector if the function returns a single value, but a data frame if it returns a named tuple.

The other annoying point is that this syntax will necessarily be slow, and we don't provide a fast alternative (like we do for by). Providing a syntax which is convenient but slow could be worse than not providing it at all.

@@ -0,0 +1,9 @@
module TestDataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in an existing file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will move it to indexing.jl.

@@ -471,3 +471,70 @@ CSV.write(output, df)
```

The behavior of CSV functions can be adapted via keyword arguments. For more information, see `?CSV.read` and `?CSV.write`, or checkout the online [CSV.jl documentation](https://juliadata.github.io/CSV.jl/stable/).

## Broadcasting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should mention this here given that this syntax is inefficient. We should first show how to achieve this efficiently. Broadcasting over columns is also very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point with this PR is exactly to decide what kind of broadcasting we want to provide by default. We essentially have four options for broadcasting purposes:

  • make DataFrame be treated as scalar, then we do Ref(df);
  • make DataFrame be treated as a collection of rows, then we do eachrow(df);
  • make DataFrame be treated as a collection of rows, then we do eachcol(df, false);
  • make DataFrame be treated as a two dimensional object, like array; an inefficient way to do this would be Matrix(df), probably a more efficient implementation would be needed if we decided on it.

We have to select one. The other three have to be keyed-in by the user manually when broadcasting.

So the question is - how do we want to treat an AbstractDataFrame in broadcasting by default (when it is not wrapped by some other object).

"1"
"3"

julia> (row -> string.(row)).(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this syntax is recommended (and it's quite obscure for newcomers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will remove it.

@bkamins
Copy link
Member Author

bkamins commented Dec 15, 2018

Given (I have just noticed it as it got covered by inline comments):

The other annoying point is that this syntax will necessarily be slow, and we don't provide a fast alternative

I am OK to leave it undefined for now, but at least let us try to decide what should be the dimensionality of AbstractDataFrame when broadcasted over (a question I have asked above).

@pdeffebach
Copy link
Contributor

One thing I've been thinking about is if we might want to have our own map function that works similar to the by optimizations.

Having

map(df, [:x1, :x2]) do x 

could be nice. I mean, maybe not, but it's worth wondering if we should save row-iteration for something behind a few more function calls to make it performant.

@bkamins
Copy link
Member Author

bkamins commented Dec 15, 2018

Here is what my current thinking is (it would solve many challenges we face in one shot):

  • add typed field to DataFrame structure and typeddirty::Bool field
  • this would be either a a) NamedTuple of AbstractVectors or b) TypedTable (this would introduce a dependency, but we would gain many things for free); any of those options would reuse the stored columns
  • the typed field would be materialized lazily only on read; it would be materialized only when typedirty is true and after materialization set it to false;
  • we set typeddirty to true on any opertion that changes column types (column addition/deletion/disallowmissing etc. or setindex! on column that changes its type) - we have full control over all methods that are able to do this modification;
  • probably some locking should be done in order to make it thread safe.

In this way we keep normal DataFrame with all its benefits but "for free" we get typed version of it when it is needed.

And for example broadcast, map, or by etc. could use this typed field. The big benefit is that we materialize typed only once if columns do not change (so we have no precompilation overhead every time we call). Of course typed field would have volatile type, so we would have to have a barrier function inside all functions that want to take advantage of this field.

Notice that then we would not need to write:

map(df, [:x1, :x2]) do x 

but simply:

map(df) do x 

would be enough and efficient.

If df had thousands of columns then you would write

map(df[[:x1,:x2]]) do x 

at the cost of generation of typed for this small data frame but saving on precompilation of function taking a possibly large NamedTuple and if you wanted to do several operations on a subset you would write df2 = df[[:x1, :x2]] and then df2 would cache its typed argument in repeated calls.

Do you see any disadvantages of this approach?

@nalimilan
Copy link
Member

Interesting. I'm not sure that would work if we only stored a field: it would have to be part of the type, with e.g. DataFrame{Nothing} for the type-unstable default. Maybe we could simply have a type parameter reflecting the type of the columns field, which could either be a named tuple (type stable) or a kind of named vector (type-unstable, with eltype AbstractVector).

But this discussion is off-topic here. Maybe continue it at #1335?

@nalimilan
Copy link
Member

nalimilan commented Dec 26, 2018

x-ref #1019, #1513

@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2019

Closing this as we have a new approach to data frame broadcasting now

@bkamins bkamins closed this Jun 8, 2019
@bkamins bkamins deleted the dataframe_broadcast branch June 8, 2019 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants