Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pushing a DataFrameRow to a DataFrame #1439

Closed
wants to merge 2 commits into from

Conversation

omus
Copy link
Member

@omus omus commented Jul 5, 2018

No description provided.

Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, I just tried this yesterday and was surprised it didn't work. Glad it's an easy fix.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append! doesn't sound like the right function. DataFrameRow is a row, not a collection of rows, so we should use push! instead. We already have two methods for that.

@omus omus changed the title Support appending a DataFrameRow to a DataFrame Support pushing a DataFrameRow to a DataFrame Jul 9, 2018
@omus
Copy link
Member Author

omus commented Jul 9, 2018

Using push sounds fine to me. I originally didn't thing of push here as I tend to think of push only being applied to linear data structures.

@@ -994,6 +994,8 @@ Base.convert(::Type{DataFrame}, d::AbstractDict) = DataFrame(d)
##
##############################################################################

Base.push!(df::DataFrame, r::DataFrameRow) = append!(df, parent(r)[row(r),:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this implementation isn't really efficient, which can be a problem if you repeatedly push rows in a loop. I think we'd better allow iterating over a DataFrameRow, which is consistent with NamedTuple. Then the already existing method below should work automatically.

Copy link
Member Author

@omus omus Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameRows are already iterable but are incompatible with the push! for iterables defined below:

julia> df = DataFrame(A = 1:2, B = 'a':'b')
2×2 DataFrames.DataFrame
│ Row │ A │ B   │
├─────┼───┼─────┤
│ 11'a' │
│ 22'b' │

julia> r = DataFrameRow(df, 2)
DataFrameRow (row 2)
A  2
B  b


julia> collect(r)
2-element Array{Tuple{Symbol,Any},1}:
 (:A, 2)  
 (:B, 'b')

julia> push!(df, collect(r))
ERROR: ArgumentError: Error adding (:A, 2) to column :A. Possible type mis-match.
Stacktrace:
 [1] push!(::DataFrames.DataFrame, ::Array{Tuple{Symbol,Any},1}) at /home/omus/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:1034

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I think we should change this, so that DataFrameRow and NamedTuple can be used interchangeably (and the former is just a mutable version of the latter). It will be breaking, but hopefully not many people depend on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we cannot change it without a deprecation period (#1449). So in the meantime if you want, add a temporary push! method in deprecated.jl to make this work until the deprecation is removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This completely fell off my radar. I'll try to take care of it now.

@omus
Copy link
Member Author

omus commented Sep 26, 2018

I've rebased the code which required some minor changes which cleaned things up. Once we remove the collect(::DataFrameRow) deprecation everything should just work. The first commit places the new push! method in dataframe.jl while the second moves it into deprecated.jl. Feel free to squash or drop the last commit depending on what placement you prefer.

@nalimilan
Copy link
Member

nalimilan commented Sep 27, 2018

Thanks. I think there should also be tests for situations where you append a row from a different data frame, with different cases: column names match, column names are the same but in a different order, and the number of columns differs. AFAICT we should reorder values in the second situation (consistent with vcat, cf. #1347), and throw an error in the last one. For the case where the data frame is the same, having a fast path to avoid these expensive checks would be nice.

EDIT: this means we'll have to keep a special DataFrameRow method even after the deprecation is removed, so it should live in dataframe.jl.

@omus
Copy link
Member Author

omus commented Oct 9, 2018

I like those changes but they seem outside the scope of this PR. Are you okay with this PR if only the method is moved into dataframe.jl?

@nalimilan
Copy link
Member

No, that's really the same PR. With the current method, columns will be added only depending on their order, which is inconsistent with what we do in vcat.

@bkamins
Copy link
Member

bkamins commented Jan 13, 2019

Given the recent changes in `DataFrames.jl I think the rules should be:

  • API: NamedTuple and DataFrameRow should have exactly the same behavior when push!-ed to a DataFrame;
  • efficiency: we should check if DataFrameRow has the same parent as DataFrame to which it is pushed and then use more efficient column matching using Index or SubIndex depending on whether we have a subset of columns or all columns in a DataFrameRow.

Do we agree on the approach?

@bkamins
Copy link
Member

bkamins commented Jan 17, 2019

Can we close it and instead work in #1685?

@nalimilan nalimilan closed this Jan 17, 2019
@nalimilan nalimilan deleted the cv/append-dataframerow branch January 17, 2019 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants