Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multicolumn transformations for AbstractDataFrame #2461

Merged
merged 21 commits into from
Oct 9, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 1, 2020

This PR partially addresses #2410 and #2457.

It covers select etc. for AbstractDataFrame.

If we are OK with the functionality I will update the documentation.

TODO:

  • do the same for select etc. for GroupedDataFrame (this will be a separate PR to keep PRs more atomic)
  • add support for ByRow with no columns passed to filter (also a separate PR)

CC @nalimilan @pdeffebach @matthieugomez - this is a rather complex PR so independent testing (especially for corner cases) would be welcome (if you would have suggestions for types of tests to add please comment and I will add them).

@bkamins bkamins added breaking The proposed change is breaking. priority labels Oct 1, 2020
@bkamins bkamins added this to the 1.0 milestone Oct 1, 2020
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's impressive. I haven't looked at the tests in detail yet, feel free to point me at interesting cases that I may have missed.

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
test/select.jl Show resolved Hide resolved
test/select.jl Show resolved Hide resolved
@pdeffebach
Copy link
Contributor

Could you please clarify what AsTable is? It's odd to see both AsTable([:a, :b]) and AsTable that isn't called also. Is this a common pattern? Is it odd to dispatch on a data type and not an instance? Is there a better way to make an instance?

@bkamins
Copy link
Member Author

bkamins commented Oct 2, 2020

Could you please clarify what AsTable is?

it is a type. types in Julia are also values, and we use this fact here

It's odd to see both AsTable([:a, :b]) and AsTable that isn't called also.

The first is instance of a type the second is a type (AsTable is not instantiated - so it kind-of says it is yet undecided what column names will actually be)

Is this a common pattern?

What do you mean by "pattern"? In general the transformation mini-language is DataFrames.jl specific. What is important is that AsTable does not collide with other column selectors.

Is it odd to dispatch on a data type and not an instance?

We do not dispatch on type (actually if you look at the implementation there is a problem with this - we have to dynamically check for AsTable as a value)

Is there a better way to make an instance?

We could use AsTable() or astable that would be AsTable(), but the question is if it would be more readable. Actually it was not me who invented using AsTable as the DST but I came to the conclusion that it is not a problem to use it.


In summary: it is not a common pattern, but do you have a better proposal what to use instead?

The benefit of this approach is:

  • it is easy to type (which I guess Stata users like 😄)
  • it does not introduce a new symbol to learn

(and just to stress - we do not dispatch on AsTable as DST - it is a run-time check)

This situation is kind-of similar to : which is Colon() which is a function (and actually we have to also special case it), but how many times when you write : you think of it as a function?, so when you write vector[:] if Julia designer would be fancy it could be vector[sin] (fortunately it is not 😄).

@bkamins
Copy link
Member Author

bkamins commented Oct 2, 2020

As an aferthought: we could use : instead of AsTable there - as : also has a meaning "take everything". Do you think : would be clearer there than AsTable (for sure it would be shorter). So you could even write:

(:) => (:) => :

and it would be a valid transformation specification (note that there is no need of parens for trailing : here, which is fortunate)

@pdeffebach
Copy link
Contributor

Thank you for the clarification. I'm glad I have understanding of your thought process, here.

I think AsTable() would be more readable, but I'm not sure. I'm not a fan of mixing instances and types like this, but I don't know if others share my viewpoint. It's also not clear what the empty parenthesis mean in this context, as clearly there are new variables.

@bkamins
Copy link
Member Author

bkamins commented Oct 2, 2020

@nalimilan - what do you think? I dislike AsTable() as it normally indicates zero columns in SRC.

EDIT: sorry, actually AsTable() is disallowed now, but still I am not very much convinced of allowing it.

@nalimilan
Copy link
Member

I prefer AsTable over AsTable(), as the latter could mean that you expect zero columns. : is an interesting option too, though it's less explicit and it could mean "replace all existing columns with the output".

@bkamins
Copy link
Member Author

bkamins commented Oct 2, 2020

Let us keep AsTable then.

@pdeffebach
Copy link
Contributor

Sounds good. Perhaps the best mental model is for it to be a Callable that acts as function barrier for correct types. We already have a similar logic with the fun in src => fun => dest.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@pdeffebach
Copy link
Contributor

Is the following expected behavior?

julia> df = DataFrame(a = [1, 2], b = [3, 4]);

julia> transform(df; renamecols = true) do sdf
           sdf
       end
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

I would have thought with the renamecols = true we would have columns :a, :b, :a1, :b1

@pdeffebach
Copy link
Contributor

Additionally, should the following work?

julia> transform(df, foo)
2×3 DataFrame
│ Row │ a     │ b     │ x1    │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 3     │ 1     │
│ 2   │ 2     │ 4     │ 1     │
julia> df = DataFrame(a = [1, 2], b = [3, 4]);

julia> foo(df) = fill(1, nrow(df));

julia> transform(df, foo => :a)
ERROR: ArgumentError: Unrecognized column selector: foo => :a

@bkamins
Copy link
Member Author

bkamins commented Oct 3, 2020

I would have thought with the renamecols = true we would have columns :a, :b, :a1, :b1

No - this would be makeunique, but we do not support it in transformations (it would be ambiguous if you want to overwrite columns vs add new columns with de-duplicated names).

renamecols handles the way how automatic column generation works (if a suffix with function name is added or not)

julia> transform(df, foo)
2×3 DataFrame
│ Row │ a │ b │ x1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 3 │ 1 │
│ 2 │ 2 │ 4 │ 1 |

What is foo here (you define it below this call)? If it is foo(x) = 1 or foo(x) = [1,1] or foo(x) = ones(Int, 2, 1) then this is expected

julia> transform(df, foo => :a)
ERROR: ArgumentError: Unrecognized column selector: foo => :a

This is also expected - and follows your request to disallow FUN => DST syntax. Only FUN or SRC => FUN => DST or SRC => FUN syntaxes are allowed.

In general FUN syntax is mainly intended for SELECT(FUN, DF) style, but we support it in SELECT(DF, FUN) style as it is not a problem to support it.

@pdeffebach
Copy link
Contributor

Thanks, this is all very clear.

What is foo here (you define it below this call)? If it is foo(x) = 1 or foo(x) = [1,1] or foo(x) = ones(Int, 2, 1) then this is expected

Yes that is expected.

This is a really impressive work! Really appreciate it and the thought you've put into this.

@pdeffebach
Copy link
Contributor

pdeffebach commented Oct 3, 2020

I think a makeunique option where all the columns are not over-written by default would be useful in the future. But this is fine for now, especially since we don't support makeunique in existing transform.

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
@bkamins bkamins mentioned this pull request Oct 4, 2020
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@pdeffebach
Copy link
Contributor

Why allow returning matrices at all if we are deprecating the matrix constructor? For the transform case, if you can't call DataFrame on it, then we shouldn't support it. That seems like the most consistent approach.

@bkamins
Copy link
Member Author

bkamins commented Oct 4, 2020

Why allow returning matrices at all

Only for backward compatibility reasons. Note that we will not disallow returning them. The only question is what happens with them and we have two options:

  1. leave them "as is" when there is no AsTable (i.e. repeat as many times as there are rows without copying) and throw an error if there is AsTable
  2. throw an error when there is no AsTable and expand to multiple columns when there is AsTable

I was thinking which behavior the user would prefer when returning a matrix and I thought that the second is more natural. Would you prefer the first?

In general - under current rules the only case when we throw an error is NamedTuple that mixes scalars and vectors.


Note that this is a different case from what we discuss with @nalimilan, as he has raised a case when ByRow returns a matrix and there is AsTable at the end.

@pdeffebach
Copy link
Contributor

  • leave them "as is" when there is no AsTable (i.e. repeat as many times as there are rows without copying) and throw an error if there is AsTable
  • throw an error when there is no AsTable and expand to multiple columns when there is AsTable

The first option reminds me of R too much. The silently repetition is a source of bugs. The user can always pass it as a Ref if they want it to be repeated, right? I prefer option 2.

@bkamins
Copy link
Member Author

bkamins commented Oct 4, 2020

So option 2 is what we currently have 😄.

@bkamins
Copy link
Member Author

bkamins commented Oct 5, 2020

I have updated the documentation (so essentially when we accept this this should be good to be merged).

@nalimilan - as usual - feel free to rewrite the docstrings 😄 (and sorry for mistakes, as for sure there will be some).

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits October 7, 2020 09:14
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Oct 8, 2020

Thank you for all the comments. If there are no more issues with this proposal I will merge the PR tomorrow and follow up with a small filter and a big GroupedDataFrame PRs.

@bkamins bkamins merged commit 4ec8009 into JuliaData:master Oct 9, 2020
@bkamins bkamins deleted the flexible_transforms branch October 9, 2020 07:38
@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants