Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of eachcol seems to be a pun? #2636

Closed
andyferris opened this issue Feb 28, 2021 · 7 comments · Fixed by #2680
Closed

Usage of eachcol seems to be a pun? #2636

andyferris opened this issue Feb 28, 2021 · 7 comments · Fixed by #2680
Labels
breaking The proposed change is breaking. ecosystem Issues in DataFrames.jl ecosystem
Milestone

Comments

@andyferris
Copy link
Member

I'm opening this in response to JuliaData/TypedTables.jl#68, with the hope of improving consistency across the ecosystem.

It seems that DataFrames.jl overloads Base.eachcol to return the column interator DataFrameColumns. To me, this seems like a bit of a pun, and the columns function from Tables.jl seems to be a much better fit for this behavior.

While this seems harmless at first, tables (in the Tables.jl sense) like Matrix{NamedTuple} or a TypedTables.Table <: AbstractMatrix are required to follow Base behavior. I see Base.eachcol as an AbstractArray interface - and one that is actively being built upon in Base (soon it will return a lazy vector-of-vectors SliceArray rather than just an iterator - see JuliaLang/julia#32310).

This leaves the data table ecosystem a little bit split. Even ignoring TypedTables, users may first learn to use a CSV.File and a Vector{<:NamedTuple} and will need to use columns there and switch to eachcol later when they learn DataFrames.jl. Do you think it would be reasonable to use the Tables.jl interface columns to return a DataFrameColumns instead?

@andyferris andyferris added breaking The proposed change is breaking. ecosystem Issues in DataFrames.jl ecosystem labels Feb 28, 2021
@bkamins bkamins added this to the 1.0 milestone Feb 28, 2021
@bkamins bkamins added the bug label Feb 28, 2021
@bkamins
Copy link
Member

bkamins commented Feb 28, 2021

There are two issues here:

  1. Changing what Tables.columns(::AbstractDataFrame) returns

I think this is clearly a bug. This is currently an identity and does not follow the API of Tables.columns. Since @quinnj implemented this behavior in affc948 three years ago let us wait what he says (and maybe the answer will be that things have changed during the 3 years and we should just fix it).

I am fully in agreement to change it.

  1. Deciding what eachcol(::AbstractDataFrame) should return

This is a harder thing. In DataFrames.jl 0.21 eachcol returned AbstractVector. The problem is that we had to change it to just iterator in 0.22 because AbstractVectors are very sctict about what pairs returns (keys must be integers) and we wanted keys to be column names.

Now AbstractDataFrame is technically a two dimensional object (as per ndims, axes etc.) so we are consistent in defining eachrow and eachcol here with base. The only inconsistency is that eachcol is not AbstractVector (eachrow is AbstractVector as we do not allow row keys). When it was designed in Julia Base eachcol was not an AbstractVector.

In summary:

  • I am convinced that defining eachcol for AbstarctDataFrame is consistent with Julia Base design
  • I think given eachcol usage is very common (especially with pairs wrapper around it) changing what we do now is undesirable (the consequence is that we cannot return AbstractVector)

But of course I am open to hear other opinions as the decision here is hard.

CC @nalimilan

@andyferris
Copy link
Member Author

Excellent. I suspect that was originally implemented back when DataFrame iterated columns, and overlooked since.

Now AbstractDataFrame is technically a two dimensional object (as per ndims, axes etc.) so we are consistent in defining eachrow and eachcol here with base. The only inconsistency is that eachcol is not AbstractVector (eachrow is AbstractVector as we do not allow row keys).

As it's not an AbstractMatrix it is not inconsistent for eachcol to return something that isn't an AbstractVector. I think the best consistency you can ask for is indexing is basically consistent (so eachcol(df)[col][row] === eachrow[row, col] or whatever).

(Of course - if AbstractDataFrame were aiming to be a 2D object consistent with other 2D objects in Base, then it would iterate, map, filter, reduce etc over cells much like an AbstractMatrix. I think that could work out fine even if not particularly useful a lot of the time; we would just use e.g. filter(row -> pred(row), eachrow(df)) for example).

@quinnj
Copy link
Member

quinnj commented Mar 1, 2021

I've commented on the TypedTables issue, but in short, there's no bug/issue w/ the Tables.jl implementation. Specifically, Tables.columns is not guaranteed to return an object that iterates columns, only that implements the required Tables.columnnames and Tables.getcolumns interface methods.

@bkamins bkamins removed the bug label Mar 1, 2021
@bkamins
Copy link
Member

bkamins commented Mar 1, 2021

@quinnj - you are right. This is not a bug - it was late yesterday, and I read the docstring of Tables.columns too quickly. Sorry for this.

Still we could consider defining:

  • DataFrameColumns <: Tables.AbstractColumns
  • Tables.columns(df::AbstractDataFrame) = eachcol(df)

Do you think there would be a benefit of doing this (the downside is that it would be a breaking change, but I assume people do not use Tables.columns on AbstractDataFrame much currently).

@nalimilan
Copy link
Member

For the history, DataFrames defined eachcol and eachrow before Base added them (at JuliaLang/julia#29749).

Still we could consider defining:

* `DataFrameColumns <: Tables.AbstractColumns`

* `Tables.columns(df::AbstractDataFrame) = eachcol(df)`

Do you think there would be a benefit of doing this (the downside is that it would be a breaking change, but I assume people do not use Tables.columns on AbstractDataFrame much currently).

This came up before at #2244. I think we can consider that the type of object we return from Tables.columns is an implementation detail. All people can rely on currently is that the returned object supports the documented interface, i.e. Tables.columnnames and Tables.getcolumns. Though the difference between Tables.columns, Tables.Columns and eachcol is a bit confusing for users. It could be more practical to make Tables.columns and eachcol consistent as you propose (I think I also suggested this somewhere) else; OTOH it could mislead people into believing that Tables.jl guarantees that you can iterate over the result, which isn't the case.

@bkamins bkamins mentioned this issue Mar 4, 2021
19 tasks
@bkamins
Copy link
Member

bkamins commented Mar 25, 2021

Still we could consider defining:

  • DataFrameColumns <: Tables.AbstractColumns
  • Tables.columns(df::AbstractDataFrame) = eachcol(df)

Do we go this way? If we were to make this change I would prefer to have it included in 1.0 release.

CC @quinnj

@quinnj
Copy link
Member

quinnj commented Mar 26, 2021

That sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants