Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.reduce(::typeof(vcat), ...) on DataFrames does not support init #3309

Closed
ctarn opened this issue Apr 5, 2023 · 5 comments · Fixed by #3310
Closed

Base.reduce(::typeof(vcat), ...) on DataFrames does not support init #3309

ctarn opened this issue Apr 5, 2023 · 5 comments · Fixed by #3310
Labels
Milestone

Comments

@ctarn
Copy link

ctarn commented Apr 5, 2023

Unlike other reduce methods, Base.reduce(::typeof(vcat), ...) on DataFrames don't support init. Although I noticed that passing an empty DataFrame array will also output an empty DataFrame, it does not behavior like other reduce methods, which can be confusing. For example,

julia> reduce(vcat, DataFrame[])
0×0 DataFrame

julia> reduce(vcat, Any[])
ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
...

julia> df = DataFrame(x=1:2, y=2:2:4)
2×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2
   2 │     2      4

julia> reduce(vcat, DataFrame[]; init=df)
ERROR: MethodError: no method matching reduce(::typeof(vcat), ::Vector{DataFrame}; init::DataFrame)

Closest candidates are:
  reduce(::typeof(vcat), ::Union{Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}, AbstractVector{<:AbstractDataFrame}}; cols, source) got unsupported keyword argument "init"
   @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/abstractdataframe/abstractdataframe.jl:1847
  reduce(::Any, ::AbstractArray; kw...)
   @ Base reducedim.jl:406
  reduce(::Any, ::Any; kw...)
   @ Base reduce.jl:485
  ...

julia> reduce(vcat, Any[]; init=df)
2×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2
   2 │     2      4

julia> reduce(vcat, Any[]; init=empty(df))
0×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┴──────────────

julia> reduce(vcat, Any[df, df]; init=df)
6×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2
   2 │     2      4
   3 │     1      2
   4 │     2      4
   5 │     1      2
   6 │     2      4
@bkamins bkamins added the feature label Apr 5, 2023
@bkamins bkamins modified the milestones: patch, 1.6 Apr 5, 2023
@bkamins
Copy link
Member

bkamins commented Apr 5, 2023

It is not supported because it is not needed. See the docstring of reduce:

If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.

The point is that reduce in DataFrames.jl is able to work out the neutral element (which is DataFrame()) without it being passed.

So the question is in what cases the current behavior is confusing (in other words init in reduce is introduced to handle empty collection case ONLY, it should not be relied on when non-empty collection is passed; and we do not need this init value in case DataFrame[] is passed as an argument because we can work it out anyway)?

(I am not saying we should not add init, I just want to understand in what cases it is needed)

@ctarn
Copy link
Author

ctarn commented Apr 5, 2023

Hi! Thanks for your response. I did not notice that init for non-empty collections is unspecified. However, it seems that people don't totally agree on that, see JuliaLang/julia#49042.

Whatever, specifically, I want to split a DataFrame df into several SubDataFrames using something like groupby, and then vcat some of the SubDataFrames, say sdfs. I expect that the combined one always shares the same set of columns with df. However, reduce(vcat, sdfs) return DataFrame() when sdfs is empty. I try reduce(vcat, sdfs; init=empty(df)), but init is not supported currently.

I think adding init is useful and harmless since the current reduce(vcat, sdfs) works just like reduce(vcat, sdfs; init=DataFrame()). Hope init can be specified.

@bkamins
Copy link
Member

bkamins commented Apr 5, 2023

So your request is to make init:

init is the initial value to use in the reductions

We could do this as I agree it is useful (another use is to make sure that one gets a proper eltype of columns).

I will add it.

@bkamins
Copy link
Member

bkamins commented Apr 6, 2023

see #3310

@ctarn
Copy link
Author

ctarn commented Apr 6, 2023

Thank you very much for your efforts! I left a comment there for a typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants