Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical with levels and recoding at once #389

Open
jkrumbiegel opened this issue Mar 26, 2022 · 2 comments
Open

categorical with levels and recoding at once #389

jkrumbiegel opened this issue Mar 26, 2022 · 2 comments

Comments

@jkrumbiegel
Copy link

jkrumbiegel commented Mar 26, 2022

I looked through the issues but didn't see something comparable, excuse me if I missed something and duplicate old discussions.

Whenever I work with categorical data, it's usually something simple like "male"/"female", but often coded in the original dataset with placeholders such as 1 and 2 or 'm' and 'f'. So if I want a categorical array with "male" "female" I have to take two steps, create the array and then recode. I feel like it would be more straightforward to allow recoding at creation of the data, that could also be faster if there's a lot of data. I'm thinking about an API with a vector of pairs like this:

arr = [1, 2, 2, 1, 2, 1]
cat = categorical(arr, levels = [2 => "female", 1 => "male"])

So you can see that this both allows to set the categorical values that I want, and at the same time allows to set the ordering that differs from the natural 1, 2 sequence.

I think usually one would need to do something like this:

cat = recode(categorical(arr, levels = [2, 1]), 1 => "male", 2 => "female")

This gets more cumbersome the more levels there are and two full arrays need to be created.

@greimel
Copy link
Contributor

greimel commented Mar 29, 2022

I've been looking for the same functionality. You have two cases in mind. The one with where arr ⊆ [1,2] works like this.

CategoricalArray{String,1}(
	arr,
	CategoricalPool(Dict("female" => 2, "male" => 1))
)

(I think it's undocumented though)

@nalimilan, shouldn't it be possible to construct a CategoricalArray from a refarray and leveldict? Is there a specific reason this doesn't exist? Would you mind a PR making categorical(refarray, leveldict) possible? E.g.

function categorical(refarray::AbstractArray{R, N},
                     invleveldict::Dict{V,R},
                     ordered=false
) where {N, V, R <: Integer}
	CategoricalArray{V,N}(refarray, CategoricalPool(invleveldict, ordered))
end

Probably one could also allow leveldict::Dict{R,V} and !(R :< Int) (which is the other case arr ⊆ ['m','f'] @jkrumbiegel mentioned)

@nalimilan
Copy link
Member

Yeah this definitely makes sense. I haven't implemented these yet because I concentrated on getting the basics right, without working too much on convenience. But feel free to make a PR.

There are a few subtle issues to address though:

  • Do we want to just wrap the input vector or to make a copy? I'd tend to avoid a copy, given that array constructors tend to be wrappers. Though CategoricalArray could avoid copying, but categorical could make a copy (possibly with an argument to choose the best behavior).
  • If we don't make a copy, we are forced to use the input vector's type as the reference type. This may not be what is intended in general as often one has a Vector{Int} input (as in @jkrumbiegel's example), but UInt32 takes twice less memory, is faster to process (e.g. for grouping) and reduces the amount of recompilation of functions (since it's the default type).
  • When adding different constructors, we must ensure no ambiguity can happen (now or later). As @greimel noted, they could take either invleveldict or leveldict as the second argument. Yet I don't think it's possible to distinguish these in dispatch since Dict{Int, Int} could be both. One solution would be to pass these as keyword arguments to distinguish them, though that wouldn't allow inferring the return type. We could also decide that the two-argument constructors would always take the refs as the first argument, so it would make more sense that the second argument would either be a vector of levels or leveldict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants