Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing levels fallback #54

Open
aplavin opened this issue Oct 28, 2022 · 9 comments
Open

Confusing levels fallback #54

aplavin opened this issue Oct 28, 2022 · 9 comments

Comments

@aplavin
Copy link

aplavin commented Oct 28, 2022

Looking at the concrete implementation of levels in CategoricalArrays, I see that this function conveniently extracts possible levels from both arrays and individual values:

julia> a = CategoricalArray(["abc", "def", "abc"])
3-element CategoricalArray{String,1,UInt32}:
 "abc"
 "def"
 "abc"

julia> x = a[1]
CategoricalValue{String, UInt32} "abc"

julia> levels(a)
2-element Vector{String}:
 "abc"
 "def"

julia> levels(x)
2-element Vector{String}:
 "abc"
 "def"

However, the fallback implemented in DataAPI itself is only correct for collections, not individual values:

# like unique(x), makes sense?
julia> levels(["abc", "def", "abc"])
2-element Vector{String}:
 "abc"
 "def"

# makes no sense:
julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

# especially confusing given that:
julia> x == "abc"
true

Maybe, the fallback shouldn't exist at all, and something like haslevels(x)::Bool added instead?

@bkamins
Copy link
Member

bkamins commented Oct 28, 2022

is only correct for collections

The problem is different. The function is correct, but it is defined for all collections, e.g.:

julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

The problem is, that many collections, like "abc" or 1 are perceived by users as individual values not as collections.

The same problem is with unique.

What we maybe could do is changing that levels is defined for arrays by default (the problem is that it would be breaking).

@aplavin
Copy link
Author

aplavin commented Oct 28, 2022

What we maybe could do is changing that levels is defined for arrays by default

There are also arrays commonly perceived as individual values - most notably, SVectors and FieldArrays from StaticArrays.

My point is that "levels of a collection" and "levels of a value" are fundamentally different functions, if talking about their generic versions, not specific levels(::CategoricalArray) and levels(::CategoricalValue). So, maybe the fallback shouldn't be defined at all? Is there any actual usage of it?

@bkamins
Copy link
Member

bkamins commented Oct 28, 2022

Let me summarize the discussion, so that I am clear what problem we want to resolve (I think you raised it in Slack but I do not remember the details, and soon it will be lost on Slack). Current state is the following:

  • for collections levels recovers levels of a collection;
  • for CategoricalValue as a special case levels recovers levels from a source array of this categorical value.

In what cases the current behavior is problematic (I mean in practice; I understand your conceptual concerns so there is no need to comment on this).

@aplavin
Copy link
Author

aplavin commented Oct 31, 2022

In what cases the current behavior is problematic

First, I saw the levels(::CategoricalValue) function. It returns the set of levels the value is chosen from, exactly what I needed.
Then, I wanted to support both categorical and non-categorical values, and needed a way to retrieve levels only if they are defined for the value. So, I tried levels(any other value) such as levels("abc"). I expected nothing as the result, or at worst ("abc",), but definitely not ['a', 'b', 'c'].

So, currently it looks like levels is hardly useable with generic values - one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.

@bkamins
Copy link
Member

bkamins commented Oct 31, 2022

one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.

I understand the current design the opposite way (I do not want to say that it should not be changed, but I want to express my current understanding of the design):

  • levels normally should be only called on a collection; you should not call it on a value that is not a collection
  • as a special case, if you know that you have a CategoricalValue you can call it to get levels of a parent of this value

So, you should not call levels on any value that is not a collection, unless you know beforehand that it is a CategoricalValue.

Therefore, what you essentially ask for is how to check that a non-collection is CategoricalValue without having CategoricalArrays.jl as a dependency. This could be doable (e.g. we could add some DataAPI.jl method to check for it). However, a normal way to do it would be to add dependency on CategoricalArrays.jl and just check if some value is a CategoricalValue.

An alternative would be to add something like parentlevels function, so that parentlevels would be defined for CategoricalValue as the same as levels, and would be nothing by default as you want. @nalimilan - what do you think?

@aplavin
Copy link
Author

aplavin commented Oct 31, 2022

Yes, I do understand the levels-of-collection PoV, maybe it's more common indeed. I'm talking from the perspective of someone who only needed "parent levels" of a single value, and found that the levels method does exactly what's needed for CategoricalValues (but only for them).

@bkamins
Copy link
Member

bkamins commented Oct 31, 2022

Agreed. That is why maybe adding parentlevels to be explicit makes sense. Let us wait for @nalimilan to respond what he things, as he designed these features.

@nalimilan
Copy link
Member

Yes, I think the best solution would be to add an API that allows checking whether a value or array is categorical, whose only implementations (currently) would be CategoricalArray and CategoricalValue.

It's true that the fact that the levels fallback works on any iterable value and can return weird results for strings and numbers is a bit confusing, but introducing parentlevels would be overkill IMO.

@bkamins
Copy link
Member

bkamins commented Nov 2, 2022

And for the mean time having CategoricalArrays.jl as a dependency is a temporary solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants