Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameRow conversion to DataFrame broken #1675

Closed
alyst opened this issue Jan 13, 2019 · 10 comments · Fixed by #1685
Closed

DataFrameRow conversion to DataFrame broken #1675

alyst opened this issue Jan 13, 2019 · 10 comments · Fixed by #1685

Comments

@alyst
Copy link
Contributor

alyst commented Jan 13, 2019

On Julia v1.0.3 and DataFrames 0.16

julia> using DataFrames

julia> df = DataFrame(a = [1, 2], b = [:a, :b])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │

julia> df[1, :]
DataFrameRow (row 1)
a  1
b  a

julia> DataFrame(df[1, :])
0×0 DataFrame

I think it was working (as naturally expected) in 0.15. One consequence of this is that append!(df, df[1, :]) fails with "Column names do not match".

@nalimilan
Copy link
Member

Woops, looks like the fallback Tables constructor is used. We should define a special constructor for DataFrameRow.

But returning an empty data frame is really weird. I would expect either an error or something meaningful. What do you think @quinnj?

One consequence of this is that append!(df, df[1, :]) fails with "Column names do not match".

AFAIK that never worked (it should be push! anyway): see #1439.

@bkamins
Copy link
Member

bkamins commented Jan 15, 2019

DataFrameRow should behave in exactly the same way as a NamedTuple of values. Therefore:

  • one should use push! instead of append!;
  • constructing a DataFrame from a DataFrameRow as of current should fail;

The issue is that Base.IteratorEltype returns a different value for NamedTuple and for DataFrameRow. As we probably should not change it now we should add a special handling for DataFrameRow.

We should first decide if we want DataFrame((a=1, b=2)) to work (now it fails). Maybe @quinnj could comment here (the issue is that if it would work we would perform an implicit broadcasting which I think we should not do). Depending on the answer an appropriate implementation for DataFrameRow should be provided (so my "default" is that DataFrame(::DataFrameRow) should throw an error).

An idiomatic pattern that currently works is DataFrame(pairs(dfr)...) and the same with a NamedTuple: DataFrame(pairs(nt)...). In this approach we perform an implicit broadcasting, but it does not introduce ambiguity because it calls a different constructor.

@bkamins
Copy link
Member

bkamins commented Jan 15, 2019

We could implement a more convenient replacement for DataFrame(pairs(dfr)...), e.g. that DataFrame constructor accepts a generator/iterator of Paris.

@alyst
Copy link
Contributor Author

alyst commented Jan 15, 2019

I agree that push!(df, dfr) is a much better way and I already switched to using it. But since there's

Base.append!(df::DataFrame, x) = append!(df, DataFrame(x))

in tables.jl, append!(df, dfr) should also work if DataFrameRow could be properly converted to DataFrame.

@alyst alyst closed this as completed Jan 15, 2019
@alyst alyst reopened this Jan 15, 2019
@bkamins
Copy link
Member

bkamins commented Jan 15, 2019

Exactly - that is why I think we should decide in the first place if DataFrame((a=1, b=2)) should work or not and then the rest will be a consequence of this decision.

@nalimilan
Copy link
Member

I don't see the advantage of allowing append!, since that's inconsistent with its definition as appending one collection to another. We could throw an error pointing to push!, though.

I'm not sure about allowing DataFrame((a=1, b=2)), since things like DataFrame((a=[1, 2], b=[2, 1])) would then be ambiguous: does it represent a single-row table whose entries are arrays, or a two-row table whose entries are numbers?

@alyst
Copy link
Contributor Author

alyst commented Jan 15, 2019

I don't see the advantage of allowing append!, since that's inconsistent with its definition as appending one collection to another. We could throw an error pointing to push!, though.

+1. I can imagine it would be a common mis-pattern to grow the dataframe in a loop using append!(df, dfr) resulting in abysmal performance. Then there should be append!(::DataFrame, ::DataFrameRow) that throws. For the same reason DataFrame((a=1, b=2)) should not be allowed, because then append!(df, (a=1, b=2)) would be possible (via implicit conversion to dataframe).

@nalimilan
Copy link
Member

A solution would be to require DataFrame([r]) and/or DataFrame((r,)), to indicate that a r represents a row without ambiguity. This should consistent with Tables.jl.

@bkamins
Copy link
Member

bkamins commented Jan 17, 2019

See #1685

@bkamins
Copy link
Member

bkamins commented Jan 18, 2019

We allow for creation of a DataFrame from DataFrameRow as well as push! of a DataFrameRow to a DataFrame (the second operation is strict - i.e. we require that the column names must match exactly, but can have a different order - then pushing from a Dict we allow Dict to have more fields than needed; if it is a problematic restriction we can lift it in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants