Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify eachcol and columns functions #1590

Merged
merged 39 commits into from
Nov 14, 2018
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 8, 2018

Implementing what was discussed in #1586.

@bkamins
Copy link
Member Author

bkamins commented Nov 8, 2018

I had to add rawcolumns function defined for DataFrame only to allow access to columns field.
Also I fixed some problems with DataFrameStream that were "close" to columns issue (it was type unstable and was using .columns access which would produce an error after getproperty change in Julia 1.0).

@bkamins bkamins changed the title Unify eachcol and columns functions WIP: Unify eachcol and columns functions Nov 9, 2018
@bkamins
Copy link
Member Author

bkamins commented Nov 9, 2018

This is still WIP as I have to think if we want to make DFRowIterator and DFColumnIterator subtypes of AbstractVector and if yes what is the best way to do it.

@bkamins bkamins changed the title WIP: Unify eachcol and columns functions Unify eachcol and columns functions Nov 9, 2018
@bkamins
Copy link
Member Author

bkamins commented Nov 9, 2018

This should be good to have a look at now. I decided to leave the types only iterators (not AbstractVectors).

Two notes:

  • interestingly the code fails on Julia 0.7 and I do not understand why (it says it cannot find a method that is defined) - maybe it would be good to understand why
  • @nalimilan after Improve performance of by() using NamedTuples #1520 we have some new depwarn-s because of changes of how getindex will work; probably this is only transitory, but it is better to check (e.g. if you do not assume there some specific type, e.g. DataFrame not SubDataFrame or vector not a view - I checked this and I think it does not but you know the code better)

@bkamins
Copy link
Member Author

bkamins commented Nov 10, 2018

I have fixed Julia 0.7 issues with broadcasting in tests.

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Show resolved Hide resolved
src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
"""
struct DFColumnIterator{T <: AbstractDataFrame}
struct DFColumnIterator{T<:AbstractDataFrame, C}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of C being a Boolean, it could be either AbstractVector or Tuple{Symbol,AbstractVector}. That way you could do <: AbstractArray{C} here, and you wouldn't need to define eltype and such below.

BTW, it's a good occasion to move from a tuple to a pair. That shouldn't affect most uses but it's more natural here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if we keep the special behavior of map which returns a DataFrame, maybe we'd better not make this iterator an AbstractVector. Same for the row iterator if we want to do something similar in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan Thank you for the comments. They touch on things I was not sure myself.

Actually the crucial design issue is to make them AbstractVector subtypes or not. This is tempting and then they should carry the eltype in the signature (actually - interestingly - this is the only way it is possible to implement it AFAIK) and the design would be simplified. If we went for this I was thinking about renaming the types to DFRowVector and DFColumnVector to better sell the idea what we produce.

But we have this special map behavior that produces a DataFrame from eachcol. If we went for AbstractVector subtyping I would remove this functionality. If we feel that we like its functionality I would prefer to create a new specialized method for it, e.g. having the following signature colmap(::Union{Type, Functon}, ::AbstractDataFrame and not overload map with a method that breaks its intuitive contract.

Also, I do not see much sense of doing the same functionality on eachrow (we would not know how to transform it to a DataFrame later).

In summary, having thought of it I would:

  • make both types AbstractVector subtype
  • deprecate map for eachcol
  • add colmap function that does what map for eachcol now does

When we agree if this is the way to go (or decide something else) I will fix this PR.

Copy link
Member

@nalimilan nalimilan Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we need to choose between these two solutions:

  1. Define map(f, columns(df)) to return a data frame. We could add map(f, rows(df)) later if we want and that would be consistent. People have to use comprehensions to get a vector instead.
  2. Keep the standard behavior of map(f, columns(df)), i.e. have it return a vector. This is basically colwise, but with a more standard name (and we could deprecate it, the broadcasting form f.(columns(df)) is even more compact). We need another way to apply a function to each column of a data frame and get a data frame back: maybe mapcols(f, df), and mapcols!(f, df) for the in-place variant (the naming would be consistent with other column-wise functions, cf. Review row vs. column orientation of API #1514). Then map(f, df) would apply a function to each row (this is what JuliaDB supports).

Also, I do not see much sense of doing the same functionality on eachrow (we would not know how to transform it to a DataFrame later).

That would be quite simple AFAICT: require the user-provided function to return a named tuple, and fill a data frame with that like we now do in map(::Function, ::GroupedDataFrame).

Cc: @piever to discuss the consistency with JuliaDB.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me to understand, there are a ColumnIterator and RowIterator types, and the decision is whether one should overload map on them to return a DataFrame rather than a vector of columns or a vector of rows?

In JuliaDB map acts by default on the rows and returns a table. We also have a map_rows which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...). It is useful when f returns NamedTuples but iter... are not a DataFrame (they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100)). Could this be added to DataFrames as well? In which case I think mapcols map_rows is a strange pair. What could be a better pair of names?

For consistency with JuliaDB, option 2 seems much better (also, to keep map and filter consistent). I thought there was a colwise or similar in JuliaDB but apparently I'm misremembering. It looks like currently one has to do:

julia> t = table((a = rand(100), b = randn(100)));

julia> f(v) = partialsort(v, 1:10)
f (generic function with 1 method)

julia> table(map(f, columns(t)))
Table with 10 rows, 2 columns:
a           b
────────────────────
0.00392291  -2.52979
0.0221383   -2.01489
0.029867    -1.92054
0.0462468   -1.89799
0.0720458   -1.5951
0.0824684   -1.31414
0.105048    -1.25503
0.108803    -1.19262
0.1204      -1.13395
0.136053    -1.13125

This I think is not so bad but maybe it'd be more awkward in DataFrames because DataFrame(map(f, columns(t))) would lose the column names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JuliaDB map acts by default on the rows and returns a table. We also have a map_rows which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...). It is useful when f returns NamedTuples but iter... are not a DataFrame (they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100)). Could this be added to DataFrames as well? In which case I think mapcols map_rows is a strange pair. What could be a better pair of names?

I don't really see the advantage of providing a map_rows function if DataFrame(f(i) for i in iter...) does the same thing TBH. And using a generator like that sounds more Julian (it's like creating dicts).

For consistency with JuliaDB, option 2 seems much better (also, to keep map and filter consistent).

OK. That's good since that's the approach @bkamins has implemented. ;-)

But the example you show below goes a bit against with this statement AFAICT: map(f, columns(t)) returns a NamedTuple in JuliaDB, not a vector (which is why names are lost). That's logical since column(t) gives a NamedTuple, but that would be different from data frames. That said, it's not a big deal. Maybe JuliaDB could support mapcols just for consistency and convenience?

Copy link

@piever piever Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, maybe JuliaDB could deprecate map_rows as "unnecessary". The easiest is probably to add a mapcols(f, t; select = All()) method to JuliaDB. As I mentioned above, I'm afraid this creates a bit of cacophony on the JuliaDB side (mapcols and map_rows have different underscore style and are not related in the obvious way). The cleanest is probably to deprecate (or at least not export) map_rows and rename setcol, popcol, pushcol... to setcols, popcols, pushcols (as they accept multiple columns at once) and add mapcols to the collections. But this PR shouldn't worry about that as all the changes to make things consistent should come from JuliaDB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It's good to anticipate to what API we both want to converge in the long term.

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
nalimilan and others added 2 commits November 11, 2018 16:09
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member Author

bkamins commented Nov 11, 2018

Having thought about it, I will go on with what I have suggested and then it lend under review.

@bkamins
Copy link
Member Author

bkamins commented Nov 11, 2018

OK - in 52e9010 you have what had to be changed to better support this intermediate deprecation.

@bkamins
Copy link
Member Author

bkamins commented Nov 12, 2018

@nalimilan Having tried what you propose I now have a basic example when iterate and getindex conflict, it is collect(eachcol(df)). This is exactly what I was afraid of - seems that collect iterates through a collection using getindex but then hits a problem that it expects the return value of type that iterate produces.

So - in my opinion - we have to be breaking here, unless you have some better recommendation.

@nalimilan
Copy link
Member

Hmm... I guess we can also override collect for now. Anyway that's temporary -- and you've already duplicated half of the code base to silence SubDataFrame deprecations. ;-)

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
Base.size(itr::DataFrameColumns) = (size(itr.df, 2),)
Base.IndexStyle(::Type{<:DataFrameColumns}) = Base.IndexLinear()

function Base.getindex(itr::DataFrameColumns{<:AbstractDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@propagate_inbounds wouldn't hurt here and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it will not change anything significantly as we do boundschecking anyway when we try to access column number j from df (note that df is an AbstractDataFrame so we do not know at this point how it stores data).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but propagating this is better anyway, and the data frame implementation could use @boundscheck at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened a separate issue #1594 for this as this PR is complex enough and I am not sure if it will help anything (performancewise - disabling checking bounds saves us nanoseconds but then we have a dynamic dispatch on the result which is orders of magnitude more expensive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying it matters, but it's cheap to add, so...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will add, but I leave the issue open for further analysis.

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/iteration.jl Outdated Show resolved Hide resolved
nalimilan and others added 2 commits November 12, 2018 10:42
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member Author

bkamins commented Nov 12, 2018

I will override collect, but please note that some other function will have the same problem.

In summary: after we merge this, what other PRs would you want to finish before tagging a release. After you tag a release I will immediately submit a PRs (probably it is better to have several for different types of deprecations) cleaning up all this mess.

@pdeffebach
Copy link
Contributor

But the problem is that then the following will produce different results:
[f(col) for col in eachcol(df)]
and
[f(eachcol(df)[i]) for i in 1:ncol(df)]

And now a question: what will f.(eachcol(df)) produce? Unless you are an expert in how broadcasting is internally implemented you have to guess.

Note that [f(eachcol(df)[i]) for i in 1:ncol(df)] is very unlikely to be used very much because f has to take in a (name => column) pair. Because of that, I think that the divergence between Base.broadcast, Base.map, and f(col) for col in eachcol is probably fine. And I agree that is a preferred solution fo colwise.

@@ -138,11 +138,17 @@ function Base.iterate(itr::DataFrameColumns{<:AbstractDataFrame,
return (_names(itr.df)[j] => itr.df[j], j + 1)
end

# TODO: remove this after deprecation period of getindex of DataFrameColumns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this in deprecated.jl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be simpler for me to clean it up later (and I assume I will do it just after the coming release).
deprecated.jl has 62kb and almost 2000 loc now and actually it is easy to forget to fix something there.

@nalimilan
Copy link
Member

I'd like to merge #1591 and finish a PR to speed up grouping on categorical arrays. Then yes, it would be nice to clean up all this mess.

test/grouping.jl Outdated
# f4(df) = [maximum(df[:c]), minimum(df[:c])]
# f5(df) = reshape([maximum(df[:c]), minimum(df[:c])], 2, 1)
# f6(df) = [maximum(df[:c]) minimum(df[:c])]
f1(df) = DataFrame(cmax = maximum(df[:c]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this backwards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - but I thought this is what you wanted me to do saying that this will be handled in a separate PR. I will revert it. Hold on several minutes with reviewing please I am pushing things following your earlier comments and will let you know when I am finished

@bkamins bkamins mentioned this pull request Nov 13, 2018
@bkamins
Copy link
Member Author

bkamins commented Nov 13, 2018

@nalimilan - thank you for the patience with this PR. I believe the having a first-class views of columns and rows of a DataFrame as abstract vectors will be very useful.

I have additionally opened #1595 to discuss the future of colwise and similarly #1596 for eltypes (both things came up in the discussion and I think I would wait if people would want to comment on those issues before deciding how to move forward).

@nalimilan
Copy link
Member

You're welcome!

@bkamins
Copy link
Member Author

bkamins commented Nov 13, 2018

If there will be no additional comments to this PR I will merge it tomorrow.

@bkamins bkamins merged commit 5ebb714 into JuliaData:master Nov 14, 2018
@bkamins bkamins deleted the df_col_iteration branch November 14, 2018 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants