Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically fill in select and combine for scalars? #2162

Closed
pdeffebach opened this issue Mar 22, 2020 · 11 comments · Fixed by #2165
Closed

Automatically fill in select and combine for scalars? #2162

pdeffebach opened this issue Mar 22, 2020 · 11 comments · Fixed by #2165
Labels
breaking The proposed change is breaking. decision
Milestone

Comments

@pdeffebach
Copy link
Contributor

More thoughts from this discourse thread.

julia> df = DataFrame(rand(10, 5))
julia> select(df, :, [:x1] => mean => :y)
ERROR: ArgumentError: New columns must have the same length as old columns

In Stata,

gen mean_x = mean(x)

would spread the argument for you, creating a full column of just mean(x)

I also seem to remember that DataFramesMeta used to have this behavior, but now it throws an error.

@bkamins bkamins added breaking The proposed change is breaking. decision labels Mar 22, 2020
@bkamins bkamins added this to the 1.x milestone Mar 22, 2020
@bkamins
Copy link
Member

bkamins commented Mar 22, 2020

This was intentional because combine does the same:

julia> df.y = [ones(5); zeros(5)];

julia> by(df, :y, :x3 => x -> 1, :x4 => identity)
ERROR: ArgumentError: all functions must return values of the same length

However, I have also been considering if we should not add this as a third case of automatic broadcasting that is allowed (other two are in DataFrame and in insertcols!). Since this is non-breaking I have opted to disallow it and we can later allow it in both select and combine if we want. @nalimilan probably has also some thoughts here (as he designed the combine part).

There are special cases of several types that would be excluded for select in the process even if we allowed it (AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix) due to compatibility with combine and future extensions, but this should not be problematic I think.

@pdeffebach
Copy link
Contributor Author

Thanks for this.

I understand more the reasoning. If you have one function returning a scalar and another returns a vector, there is ambiguiity.

I agree that AbstractDataFrame and friends should be added eventually so they should throw errors for now. But they are already special-cased to throw errors.

I'm not sure about SemVer but if this is a post-1.0 thing we can wait.

@bkamins
Copy link
Member

bkamins commented Mar 22, 2020

It is not hard to add, the question is if we want to add it.

If you have one function returning a scalar and another returns a vector, there is ambiguiity.

There is a kind of "defensive" ambiguity now (i.e. we do not implicitly broadcast). But we could without a problem, i.e. all functions would have to return either one element or vectors of the same length (just like in broadcasting). We would just have to check if it would not have significant consequences for the performance of combine (@nalimilan can tell best, but I feel that this would be doable, although current design of combine separates row and table oriented return values from passed functions).

Again just two caveats here about corner cases (so that we do not forget them when we come back to this issue):

  • we shoud then probably unwrap Ref and 0-dimensional arrays
  • we should not broadcast vector of length 1 to fit the dimensionality (broadcast does this, but we do not in DataFrame and insertcols!)

@nalimilan
Copy link
Member

Yes it would make sense to recycle scalars at least (and probably Ref and 0D arrays).

It shouldn't have a big impact on performance of existing uses since we specialize the function on the column types, so any type checks we add will be resolved statically by the compiler. But the implementation will be a bit tricky as cases mixing arrays and scalars will have to be identified and scalars promoted to arrays of the correct length before calling the method that operates on arrays (simple implementation: a smarter one would avoid allocating intermediate arrays but I'm not sure how complex it would be).

@bkamins bkamins changed the title Automatically fill in select for scalars? Automatically fill in select and combine for scalars? Mar 23, 2020
@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

Just some additional notes:

  • NamedTuple of scalars should be broadcasted also and NamedTuple mixing scalars and vectors should get expanded);
  • a tricky part (not a problem in select) is if a function would start mixing scalars and vectors in its return value for a single column - the question is if this should be allowed

But I guess we can leave adding this in combine for later (first I would like to finalize the PR normalizing the API) and add this functionality to select now. I will open a PR for select

@pdeffebach
Copy link
Contributor Author

  • NamedTuple of scalars should be broadcasted also and NamedTuple mixing scalars and vectors should get expanded);

If this is the easiest way to go, then I'm for it. But broadcasting NamedTuples of scalars would lose the names, which the user probably wanted. We could disallow it if there is ambiguity about that.

@transform on a grouped data frame disallows mixing scalars and vectors. I think we could only fill in scalars if everything is non-broadcastable.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

If this is the easiest way to go, then I'm for it.

It is not easiest, but it is a consistent way to go 😄.

But broadcasting NamedTuples of scalars would lose the names

It would keep the names and there is no ambiguity here. We have clear rules how we perform pseudo-broadcasting in DataFrames.jl

@transform on a grouped data frame disallows mixing scalars and vectors

The whole point of this issue is to start allowing this if I understand things correctly. Right?

@pdeffebach
Copy link
Contributor Author

Well, I didn't think about "mixing" per-se. Just filling in scalars for one function call.

select(df, :, :x1 => mean)

It would keep the names and there is no ambiguity here. We have clear rules how we perform pseudo-broadcasting in DataFrames.jl

You mean make a column of Symbols? I think that's a good idea. Maybe strings to avoid allocations.

@pdeffebach

This comment has been minimized.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

You mean make a column of Symbols?

No, I mean that named tuple (a=1, b=[1,2,3]) would get expanded to (a=[1,1,1], b=[1,2,3]) before further processing.

solution of wrapping it in a NamedTuple is the easiest solution for now...

This was heavily discussed and the conclusion was that splatting by default is better because:

  • it will cause less recompilation
  • virtually no built-in functions accept a NamedTuple as an argument

Still, we want to provide that option. We just need a good name for the wrapper.

@pdeffebach
Copy link
Contributor Author

No, I mean that named tuple (a=1, b=[1,2,3]) would get expanded to (a=[1,1,1], b=[1,2,3]) before further processing.

Yes this is great.

This was heavily discussed and the conclusion was that splatting by default is better because:

Sorry! that comment means to go in #2161. Will repost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants