Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emulating Stata's rowtotal #2161

Closed
pdeffebach opened this issue Mar 22, 2020 · 19 comments
Closed

Emulating Stata's rowtotal #2161

pdeffebach opened this issue Mar 22, 2020 · 19 comments
Labels

Comments

@pdeffebach
Copy link
Contributor

pdeffebach commented Mar 22, 2020

I have been playing around with the new select merged into master. And inspired by this discourse thread I have been trying to emulate Stata's rowtotal command.

Stata's rowtotal from egen looks like

egen y = rowtotal(x1 x2 x3)

Where it skips missing values and if all values are missing, sets the result to missing.

In julia, we almost have this behavior with skipmissing except the sum would be 0 instead of missing, which is fine. It's more consistent anyways and can be achieved with some keyword arguments in egen.

In the new select function, I tried to write the following and got an error.

julia> select(df, :, [:x1, :x2] => ByRow(x -> sum(x))  => :y)
ERROR: MethodError: no method matching (::var"#46#47")(::Float64, ::Float64)

The correct syntax is

julia> df = DataFrame(rand(10, 5))
julia> select(df, :, [:x1, :x2] => ByRow((x...,) -> sum(skipmissing(x)))  => :y)

The error comes from the fact that the anonymous function written above only accepts one input argument.

We can't write

julia> select(df, :, [:x1, :x2] => ByRow(+)  => :y)

because there is no way to pass skipmissings to this.

The splatting is a bit awkward and there could be a performance penalty to making a 100-length Tuple. But I don't know how the compiler is handling these cases.

cc @juliohm
cc @jmboehm

EDIT: Any discussion has to also consider the merits of a more complicated transform over something like

julia> df.y = map(eachrow(df[!,[:x1, :x2, :x3]])) do row
              sum(skipmissing(row))
          end

EDIT: There are more concerns when we have an empty collection. Note that

sum(skipmissing(Union{Int, Missing}[missing, missing])) 

works because Base.mapreduce_empty has a method for Int.

However

sum(skipmissing((missing, missing)))

will fail because the collection is empty and there is no "backup" eltype. The latter is what will be called with the working select method above. Something like

julia> select(df, :, [:x1, :x2] => ByRow((x...,) -> sum(skipmissing(collect(x))))  => :y)

will also fail because collect(x) will return Missing[missing, missing] which has no "backup" eltype.

I can't think of any obvious solutions that don't rely on a lot of machinery to make a Vector{Union{T, Missing}} of a row and yet somehow also know what T is.

@nalimilan
Copy link
Member

That's actually one situation where passing a named tuple is more useful since it stores the information about the column eltype:

julia> sum(skipmissing(NamedTuple{(:a,:b), Tuple{Union{Int,Missing}, Union{Int,Missing}}}((missing, missing))))
0

The plan is to allow this via select(df, NT([:x1, :x2]) => ByRow(x -> sum(skipmissing(x))) => newcol). But we need to find a better name for NT.

It's not very intuitive that you'd like to request names in order to be able to compute the sum -- maybe the name could be more general, to indicate that you get an object representing a row. Also it has the drawback that the function needs to be recompiled for each set of names even if you don't care about them.

@pdeffebach
Copy link
Contributor Author

That's an interesting solution. It's a shame this doesn't work for normal Tuples.

julia> t = Tuple{Union{Int, Missing}, Union{Int, Missing}}([1,2])
(1, 2)

julia> typeof(t)
Tuple{Int64,Int64}

I do feel like this is going to come up a lot and making someone make a named tuple is a bit esoteric for new data-oriented users.

@pdeffebach
Copy link
Contributor Author

I will also note that the behavior you describe doesn't work for a DataFrameRow, which should parallel NamedTuple.

You can't even get around it by collecting a DataFrameRow. You have to do convert(Vector, dfr) to get the Union{T, Missing} that is needed for reducing over an empty collection.

@bkamins
Copy link
Member

bkamins commented Mar 22, 2020

  1. splatting large number of arguments should not be a problem for a compiler (at least when they have homogeneous type, which I expect in this case)

  2. I think that rowtotal like function could be considered to be added (in this issue we can discuss the idea) - in general reduce and/or mapreduce with and option of handling as the request so somehow "aggregate" over rows of a data frame pops up quite often (although I would say that the canonical way to do it should be to do stack + by - I do not say it will be faster, but it feels to me more in the spirit of tabular data processing).

  3. The way I was assuming one would write it (disregarding NT for the moment as we plan to add it, and ... is not that bad I think, given what else you have to write):

select(df, :, [:x1, :x2] => ByRow((x...) -> reduce(+, skipmissing(x), init=0.0))  => :y)

or (but this will be slower)

df.y = map(eachrow(df[!,[:x1, :x2, :x3]])) do row
              reduce(+, skipmissing(row), init=0.0)
          end

in both cases I assume it is relatively clear what value init kwarg should take.

@pdeffebach
Copy link
Contributor Author

Thanks for this response. I was not aware of the init value.

Is the fact that Tuples can't have Union types something that can be changed in Base? I presume it is breaking.

I will file a separate issue for DataFrameRow, though.

@nalimilan
Copy link
Member

Is the fact that Tuples can't have Union types something that can be changed in Base? I presume it is breaking.

That's due to covariance of tuple types, and that would be pretty hard to change. I think it's been considered, but not before 2.0 at the very least, and that would probably introduce lots of problems.

What would a hypothetical rowtotal function do? Apart from skipping missing values, it doesn't seem it could fix the problem of finding out the right zero type.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

I guess rowtotal would virtually (probably not explicitly as it should be possible to do it more efficiently) do:

rowtotal(fun, df) = fun.(skipmissing.(eachrow(Matrix(df)))

where conversion to Matrix ensures that promote_type is properly applied.

or maybe even just:

rowtotal(fun, df) = fun.(skipmissing.(Tables.namedtupleiterator(df)))

would be good enough?

@pdeffebach
Copy link
Contributor Author

What would a hypothetical rowtotal function do? Apart from skipping missing values, it doesn't seem it could fix the problem of finding out the right zero type.

I'm not proposing to write a function called rowtotal, though I'm open to the idea. Rather, I'm hoping that we can make an easy way for skipmissing to work. The init keyword in reduce is a good idea, but I'm still wondering if there is more we can do behind the scenes to help the user.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Mar 23, 2020

Here is a good idea that might solve this: passmissing, but which coalesces missing values to a default value.

select(df, :, [:x1, :x2, :x3] => coalescefun(+, 0) => :y)

EDIT: This would only work for rowtotal but would not be able to emulate rowmean.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Mar 23, 2020

I think @nalimilan's solution of wrapping it in a NamedTuple is the easiest solution for now...

Was there any discussion of having it be the default? And then having the user splat manually?

To which @bkamins replied

No, I mean that named tuple (a=1, b=[1,2,3]) would get expanded to (a=[1,1,1], b=[1,2,3]) before further processing.

solution of wrapping it in a NamedTuple is the easiest solution for now...

This was heavily discussed and the conclusion was that splatting by default is better because:

  • it will cause less recompilation
  • virtually no built-in functions accept a NamedTuple as an argument

Still, we want to provide that option. We just need a good name for the wrapper.

@pdeffebach
Copy link
Contributor Author

I wonder if the naming should be done as an alternative to ByRow. Something like RowNT. That would cut down on the number of combinations that could overwhelm the user.

@floswald
Copy link
Contributor

or maybe even just:

rowtotal(fun, df) = fun.(skipmissing.(Tables.namedtupleiterator(df)))

would be good enough?

I think that would a good solution, yes. I'm assuming that this would iterate over rows of df.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

Something like RowNT

NT would wrap the source not a function, so you would write e.g. NT(cols) => ByRow(fun), as the form of passing the values is independent form whether we process whole columns or single rows.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2020

fun.(skipmissing.(Tables.namedtupleiterator(df))) is something you already can do now:

julia> df = DataFrame([1 2 missing
                       missing missing missing
                       missing 2 3])
3×3 DataFrame
│ Row │ x1      │ x2      │ x3      │
│     │ Int64?  │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1       │ 2       │ missing │
│ 2   │ missing │ missing │ missing │
│ 3   │ missing │ 2       │ 3       │

julia> sum.(skipmissing.(Tables.namedtupleiterator(df)))
3-element Array{Int64,1}:
 3
 0
 5

it is just a question if we want to have a special function providing this functionality (or just assume that users would use the long form)

@pdeffebach
Copy link
Contributor Author

NT would wrap the source not a function, so you would write e.g. NT(cols) => ByRow(fun), as the form of passing the values is independent form whether we process whole columns or single rows.

Yes I see. I forgot that one still might want to pass a named tuple of columns. That syntax wouldn't just work for rows.

I think that would a good solution, yes. I'm assuming that this would iterate over rows of df.

I agree this would be easy to do. However I would really like to get a solution that works elegantly in the new select, transform workflow.

I will try and think of a better name for NT, something that is more intuitive for the user I think.

@floswald
Copy link
Contributor

I had no idea that this works. this is the functionality of rowtotal. sure if we can make it shorter, and work with transform and select, even better.

julia> df.y = sum.(skipmissing.(Tables.namedtupleiterator(select(df,[:x2, :x3])))
              )
3-element Array{Int64,1}:
 2
 0
 5

julia> df
3×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ y     │
│     │ Int64?  │ Int64?  │ Int64?  │ Int64 │
├─────┼─────────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ 2       │ missing │ 2     │
│ 2   │ missing │ missing │ missing │ 0     │
│ 3   │ missing │ 2       │ 3       │ 5     │

@pdeffebach
Copy link
Contributor Author

I'm seeing the value in NT better. People aren't realistically going to use x -> (x...,) in their function for ByRow. At the very least, it's as easy to type as NT.

Additionally, the only function that you would both apply skipmissing to would be sum, mean, var, etc. There is no equivelent of +(args...) for mean.

I still think that maybe the following could work

select(df, :, [:x1, :x2] => fun)
select(df, :, [:x1, :x2] => NT(fun))
select(df, :, [:x1, :x2] => ByRow(fun))
select(df, :, [:x1, :x2] => ByRowNT(fun))

@bkamins
Copy link
Member

bkamins commented Apr 4, 2020

Just to be clear - the conclusion is that if we add an option to pass NamedTuple to function this issue is resolved?

PS. my current proposal for the name of the wrapper is AsTable which is not very long and I hope explicit in what it promises to pass conceptually.

@pdeffebach
Copy link
Contributor Author

clused with the new AsTable syntax #2183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants