Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@groupby failed on CategoricalString with missing value #278

Open
AnselmJeong opened this issue Aug 18, 2019 · 8 comments
Open

@groupby failed on CategoricalString with missing value #278

AnselmJeong opened this issue Aug 18, 2019 · 8 comments

Comments

@AnselmJeong
Copy link

Let's say, we try to @groupby on a CategoricalString column

df = DataFrame(
    fruit    =  ["Apple", "Banana", "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

The result is exactly as expected.

3×2 DataFrame
│ Row │ key          │ m       │
│     │ Categorical… │ Float64 │
├─────┼──────────────┼─────────┤
│ 1   │ Apple        │ 1.2     │
│ 2   │ Banana       │ 2.0     │
│ 3   │ Cherry       │ 0.4     │

Next, we make the same column contain some missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

Then, if failed with an error message stating somewhat like this.

ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

However, this behavior is weird since String column with similar missing values does not issue any complaints.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})
key      │ m
─────────┼────
"Apple"  │ 1.2
#NA      │ 2.0
"Cherry" │ 0.4
"Banana" │ 2.0

This limitation may have to be fixed, since it is natural that String columns are recoded as Categorical columns before beginning any analysis.

Thanks.

@AnselmJeong
Copy link
Author

AnselmJeong commented Aug 18, 2019

Similar behavior found on @filter missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @filter(!isna(_.fruit)) |> DataFrame

It fails with the same ERROR message:

LoadError: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

This example works without any error if String colums were not converted into Categorical columns or if there were not any missing values.

The same kind of error persists with simpler types of filtering.

df |> @filter(_.fruit == "Banana") |> DataFrame

Also fails with the same ERROR message.

Conclusively, @filter or @groupby seems not work with Categorical variables with missing value.

Thanks.

@greimel
Copy link

greimel commented Aug 20, 2019

Same for

df |> @select(:price)

or

df |> @filter(true)

(with the above DataFrame).

So it seems that the problem occurs when transforming the DataFrame to a Query.jl iterator.

@davidanthoff
Copy link
Member

This is actually unrelated to Query.jl, one can get the same error with:

using IteratorInterfaceExtensions

it = getiterator(df)

first(it)

The implementation for that lives in Tables.jl.

The root problem though seems to be that both DataValues.jl and CategoricalArrays.jl add methods to convert that don't play well together.

I'm a bit lost what to do here, maybe @quinnj has an idea? Also CCing @nalimilan, the resident categorical expert. Is this a case that needs special casing in Tables.jl?

@nalimilan
Copy link

I haven't been able to reproduce locally (even after updating all packages). Is there anything special I need to do? I have DataValues 0.4.12.

@davidanthoff
Copy link
Member

Here is what I'm using:

(repo1) pkg> st
    Status `C:\Users\david\.julia\environments\repo1\Project.toml`
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0

And the code:

julia> using DataFrames, IteratorInterfaceExtensions, DataValues

julia> df = DataFrame(fruit= ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"], price= [1.2, 2.0, 0.4, 1.2, 2.0, 0.4]);

julia> categorical!(df, :fruit);

julia> it = getiterator(df);

julia> first(it)
ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous. Candidates:
  convert(::Type{S}, x::T) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} in CategoricalArrays at C:\Users\david\.julia\packages\CategoricalArrays\xjesC\src\value.jl:91
  convert(::Type{DataValue{T}}, x::T) where T in DataValues at C:\Users\david\.julia\packages\DataValues\XQWvG\src\scalar\core.jl:31
Possible fix, define
  convert(::Type{DataValue{T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)}}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R))
Stacktrace:
 [1] macro expansion at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:102 [inlined]
 [2] iterate(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}, ::Tuple{}) at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:96 (repeats 2 times)
 [3] first(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}) at .\abstractarray.jl:342
 [4] top-level scope at REPL[11]:1 

Make sure you load DataValues, otherwise Tables.jl will use a different code path.

@nalimilan
Copy link

Still no error with these versions:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.2
  [336ed68f] CSV v0.5.0 [`~/.julia/dev/CSV`]
  [324d7699] CategoricalArrays v0.5.5 [`~/.julia/dev/CategoricalArrays`]
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0
  [2dfb63ee] PooledArrays v0.5.2+ [`~/.julia/dev/PooledArrays`]
  [1a8c2f83] Query v0.12.1
  [295af30f] Revise v1.0.2
  [bd369af6] Tables v0.2.6

@davidanthoff
Copy link
Member

@nalimilan: here is a reproducer repo that you can run with this: Binder

@nalimilan
Copy link

Thanks. So this is indeed a legitimate ambiguity which cannot be fixed separately in either package. I think @quinnj's JuliaLang/julia#32613 should help fixing this, but that won't work for current Julia versions. Maybe we should define that unwrap method in Compat or in DataAPI. But first we should bump that PR so that at least it's merged in master.

Do you think this is a new problem? I still don't understand why I didn't observe it in my original tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants