@groupby failed on CategoricalString with missing value #278

AnselmJeong · 2019-08-18T05:04:49Z

Let's say, we try to @groupby on a CategoricalString column

df = DataFrame(
    fruit    =  ["Apple", "Banana", "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

The result is exactly as expected.

3×2 DataFrame
│ Row │ key          │ m       │
│     │ Categorical… │ Float64 │
├─────┼──────────────┼─────────┤
│ 1   │ Apple        │ 1.2     │
│ 2   │ Banana       │ 2.0     │
│ 3   │ Cherry       │ 0.4     │

Next, we make the same column contain some missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

Then, if failed with an error message stating somewhat like this.

ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

However, this behavior is weird since String column with similar missing values does not issue any complaints.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

key      │ m
─────────┼────
"Apple"  │ 1.2
#NA      │ 2.0
"Cherry" │ 0.4
"Banana" │ 2.0

This limitation may have to be fixed, since it is natural that String columns are recoded as Categorical columns before beginning any analysis.

Thanks.

The text was updated successfully, but these errors were encountered:

AnselmJeong · 2019-08-18T05:27:26Z

Similar behavior found on @filter missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @filter(!isna(_.fruit)) |> DataFrame

It fails with the same ERROR message:

LoadError: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

This example works without any error if String colums were not converted into Categorical columns or if there were not any missing values.

The same kind of error persists with simpler types of filtering.

df |> @filter(_.fruit == "Banana") |> DataFrame

Also fails with the same ERROR message.

Conclusively, @filter or @groupby seems not work with Categorical variables with missing value.

Thanks.

greimel · 2019-08-20T12:21:08Z

Same for

df |> @select(:price)

or

df |> @filter(true)

(with the above DataFrame).

So it seems that the problem occurs when transforming the DataFrame to a Query.jl iterator.

davidanthoff · 2019-08-25T04:49:31Z

This is actually unrelated to Query.jl, one can get the same error with:

using IteratorInterfaceExtensions

it = getiterator(df)

first(it)

The implementation for that lives in Tables.jl.

The root problem though seems to be that both DataValues.jl and CategoricalArrays.jl add methods to convert that don't play well together.

I'm a bit lost what to do here, maybe @quinnj has an idea? Also CCing @nalimilan, the resident categorical expert. Is this a case that needs special casing in Tables.jl?

nalimilan · 2019-08-25T15:14:17Z

I haven't been able to reproduce locally (even after updating all packages). Is there anything special I need to do? I have DataValues 0.4.12.

davidanthoff · 2019-08-25T18:21:35Z

Here is what I'm using:

(repo1) pkg> st
    Status `C:\Users\david\.julia\environments\repo1\Project.toml`
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0

And the code:

julia> using DataFrames, IteratorInterfaceExtensions, DataValues

julia> df = DataFrame(fruit= ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"], price= [1.2, 2.0, 0.4, 1.2, 2.0, 0.4]);

julia> categorical!(df, :fruit);

julia> it = getiterator(df);

julia> first(it)
ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous. Candidates:
  convert(::Type{S}, x::T) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} in CategoricalArrays at C:\Users\david\.julia\packages\CategoricalArrays\xjesC\src\value.jl:91
  convert(::Type{DataValue{T}}, x::T) where T in DataValues at C:\Users\david\.julia\packages\DataValues\XQWvG\src\scalar\core.jl:31
Possible fix, define
  convert(::Type{DataValue{T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)}}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R))
Stacktrace:
 [1] macro expansion at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:102 [inlined]
 [2] iterate(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}, ::Tuple{}) at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:96 (repeats 2 times)
 [3] first(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}) at .\abstractarray.jl:342
 [4] top-level scope at REPL[11]:1

Make sure you load DataValues, otherwise Tables.jl will use a different code path.

nalimilan · 2019-08-26T09:45:44Z

Still no error with these versions:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.2
  [336ed68f] CSV v0.5.0 [`~/.julia/dev/CSV`]
  [324d7699] CategoricalArrays v0.5.5 [`~/.julia/dev/CategoricalArrays`]
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0
  [2dfb63ee] PooledArrays v0.5.2+ [`~/.julia/dev/PooledArrays`]
  [1a8c2f83] Query v0.12.1
  [295af30f] Revise v1.0.2
  [bd369af6] Tables v0.2.6

davidanthoff · 2019-08-26T19:33:59Z

@nalimilan: here is a reproducer repo that you can run with this:

nalimilan · 2019-09-01T16:05:04Z

Thanks. So this is indeed a legitimate ambiguity which cannot be fixed separately in either package. I think @quinnj's JuliaLang/julia#32613 should help fixing this, but that won't work for current Julia versions. Maybe we should define that unwrap method in Compat or in DataAPI. But first we should bump that PR so that at least it's merged in master.

Do you think this is a new problem? I still don't understand why I didn't observe it in my original tests.

davidanthoff added bug upstream labels Aug 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@groupby failed on CategoricalString with missing value #278

@groupby failed on CategoricalString with missing value #278

AnselmJeong commented Aug 18, 2019

AnselmJeong commented Aug 18, 2019 •

edited

Loading

greimel commented Aug 20, 2019

davidanthoff commented Aug 25, 2019

nalimilan commented Aug 25, 2019

davidanthoff commented Aug 25, 2019

nalimilan commented Aug 26, 2019

davidanthoff commented Aug 26, 2019

nalimilan commented Sep 1, 2019

@groupby failed on CategoricalString with missing value #278

@groupby failed on CategoricalString with missing value #278

Comments

AnselmJeong commented Aug 18, 2019

AnselmJeong commented Aug 18, 2019 • edited Loading

greimel commented Aug 20, 2019

davidanthoff commented Aug 25, 2019

nalimilan commented Aug 25, 2019

davidanthoff commented Aug 25, 2019

nalimilan commented Aug 26, 2019

davidanthoff commented Aug 26, 2019

nalimilan commented Sep 1, 2019

AnselmJeong commented Aug 18, 2019 •

edited

Loading