Automatically fill in `select` and `combine` for scalars? #2162

pdeffebach · 2020-03-22T16:45:07Z

More thoughts from this discourse thread.

julia> df = DataFrame(rand(10, 5))
julia> select(df, :, [:x1] => mean => :y)
ERROR: ArgumentError: New columns must have the same length as old columns

In Stata,

gen mean_x = mean(x)

would spread the argument for you, creating a full column of just mean(x)

I also seem to remember that DataFramesMeta used to have this behavior, but now it throws an error.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-03-22T21:38:52Z

This was intentional because combine does the same:

julia> df.y = [ones(5); zeros(5)];

julia> by(df, :y, :x3 => x -> 1, :x4 => identity)
ERROR: ArgumentError: all functions must return values of the same length

However, I have also been considering if we should not add this as a third case of automatic broadcasting that is allowed (other two are in DataFrame and in insertcols!). Since this is non-breaking I have opted to disallow it and we can later allow it in both select and combine if we want. @nalimilan probably has also some thoughts here (as he designed the combine part).

There are special cases of several types that would be excluded for select in the process even if we allowed it (AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix) due to compatibility with combine and future extensions, but this should not be problematic I think.

pdeffebach · 2020-03-22T23:01:30Z

Thanks for this.

I understand more the reasoning. If you have one function returning a scalar and another returns a vector, there is ambiguiity.

I agree that AbstractDataFrame and friends should be added eventually so they should throw errors for now. But they are already special-cased to throw errors.

I'm not sure about SemVer but if this is a post-1.0 thing we can wait.

bkamins · 2020-03-22T23:11:36Z

It is not hard to add, the question is if we want to add it.

If you have one function returning a scalar and another returns a vector, there is ambiguiity.

There is a kind of "defensive" ambiguity now (i.e. we do not implicitly broadcast). But we could without a problem, i.e. all functions would have to return either one element or vectors of the same length (just like in broadcasting). We would just have to check if it would not have significant consequences for the performance of combine (@nalimilan can tell best, but I feel that this would be doable, although current design of combine separates row and table oriented return values from passed functions).

Again just two caveats here about corner cases (so that we do not forget them when we come back to this issue):

we shoud then probably unwrap Ref and 0-dimensional arrays
we should not broadcast vector of length 1 to fit the dimensionality (broadcast does this, but we do not in DataFrame and insertcols!)

nalimilan · 2020-03-23T10:42:31Z

Yes it would make sense to recycle scalars at least (and probably Ref and 0D arrays).

It shouldn't have a big impact on performance of existing uses since we specialize the function on the column types, so any type checks we add will be resolved statically by the compiler. But the implementation will be a bit tricky as cases mixing arrays and scalars will have to be identified and scalars promoted to arrays of the correct length before calling the method that operates on arrays (simple implementation: a smarter one would avoid allocating intermediate arrays but I'm not sure how complex it would be).

bkamins · 2020-03-23T11:09:11Z

Just some additional notes:

NamedTuple of scalars should be broadcasted also and NamedTuple mixing scalars and vectors should get expanded);
a tricky part (not a problem in select) is if a function would start mixing scalars and vectors in its return value for a single column - the question is if this should be allowed

But I guess we can leave adding this in combine for later (first I would like to finalize the PR normalizing the API) and add this functionality to select now. I will open a PR for select

pdeffebach · 2020-03-23T12:29:16Z

NamedTuple of scalars should be broadcasted also and NamedTuple mixing scalars and vectors should get expanded);

If this is the easiest way to go, then I'm for it. But broadcasting NamedTuples of scalars would lose the names, which the user probably wanted. We could disallow it if there is ambiguity about that.

@transform on a grouped data frame disallows mixing scalars and vectors. I think we could only fill in scalars if everything is non-broadcastable.

bkamins · 2020-03-23T12:56:33Z

If this is the easiest way to go, then I'm for it.

It is not easiest, but it is a consistent way to go 😄.

But broadcasting NamedTuples of scalars would lose the names

It would keep the names and there is no ambiguity here. We have clear rules how we perform pseudo-broadcasting in DataFrames.jl

@transform on a grouped data frame disallows mixing scalars and vectors

The whole point of this issue is to start allowing this if I understand things correctly. Right?

pdeffebach · 2020-03-23T12:58:28Z

Well, I didn't think about "mixing" per-se. Just filling in scalars for one function call.

select(df, :, :x1 => mean)

It would keep the names and there is no ambiguity here. We have clear rules how we perform pseudo-broadcasting in DataFrames.jl

You mean make a column of Symbols? I think that's a good idea. Maybe strings to avoid allocations.

bkamins · 2020-03-23T13:06:20Z

You mean make a column of Symbols?

No, I mean that named tuple (a=1, b=[1,2,3]) would get expanded to (a=[1,1,1], b=[1,2,3]) before further processing.

solution of wrapping it in a NamedTuple is the easiest solution for now...

This was heavily discussed and the conclusion was that splatting by default is better because:

it will cause less recompilation
virtually no built-in functions accept a NamedTuple as an argument

Still, we want to provide that option. We just need a good name for the wrapper.

pdeffebach · 2020-03-23T13:07:50Z

No, I mean that named tuple (a=1, b=[1,2,3]) would get expanded to (a=[1,1,1], b=[1,2,3]) before further processing.

Yes this is great.

This was heavily discussed and the conclusion was that splatting by default is better because:

Sorry! that comment means to go in #2161. Will repost.

bkamins added breaking The proposed change is breaking. decision labels Mar 22, 2020

bkamins added this to the 1.x milestone Mar 22, 2020

bkamins changed the title ~~Automatically fill in select for scalars?~~ Automatically fill in select and combine for scalars? Mar 23, 2020

This comment has been minimized.

Sign in to view

bkamins mentioned this issue Mar 23, 2020

[BREAKING] allow pseudo-broadcasting in select #2165

Merged

bkamins linked a pull request Apr 6, 2020 that will close this issue

[BREAKING] allow pseudo-broadcasting in select #2165

Merged

bkamins closed this as completed in #2165 Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically fill in `select` and `combine` for scalars? #2162

Automatically fill in `select` and `combine` for scalars? #2162

pdeffebach commented Mar 22, 2020

bkamins commented Mar 22, 2020

pdeffebach commented Mar 22, 2020

bkamins commented Mar 22, 2020

nalimilan commented Mar 23, 2020

bkamins commented Mar 23, 2020 •

edited

Loading

pdeffebach commented Mar 23, 2020

bkamins commented Mar 23, 2020

pdeffebach commented Mar 23, 2020

This comment has been minimized.

bkamins commented Mar 23, 2020

pdeffebach commented Mar 23, 2020

Automatically fill in select and combine for scalars? #2162

Automatically fill in select and combine for scalars? #2162

Comments

pdeffebach commented Mar 22, 2020

bkamins commented Mar 22, 2020

pdeffebach commented Mar 22, 2020

bkamins commented Mar 22, 2020

nalimilan commented Mar 23, 2020

bkamins commented Mar 23, 2020 • edited Loading

pdeffebach commented Mar 23, 2020

bkamins commented Mar 23, 2020

pdeffebach commented Mar 23, 2020

This comment has been minimized.

bkamins commented Mar 23, 2020

pdeffebach commented Mar 23, 2020

Automatically fill in `select` and `combine` for scalars? #2162

Automatically fill in `select` and `combine` for scalars? #2162

bkamins commented Mar 23, 2020 •

edited

Loading