Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

by drops zero-length categorical groups #2106

Open
dgkf opened this issue Feb 10, 2020 · 10 comments
Open

by drops zero-length categorical groups #2106

dgkf opened this issue Feb 10, 2020 · 10 comments
Labels
grouping non-breaking The proposed change is not breaking
Milestone

Comments

@dgkf
Copy link

dgkf commented Feb 10, 2020

Related to #2104 and #1256

One of the recently revised behaviors within dplyr (as of 0.8.0) was the move to grouping-by categorical variables producing zero-length groups for unrepresented category levels. I haven't seen this behavior touched on in other issues, and wanted to raise it as a topic of consideration. They've added field called .drop which when set to FALSE will retain groupings for unrepresented categorical levels.

## existing behavior
x = categorical(["a", "a", "b", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 2, 3, 4])
by(df, :x, length = :y => length)
# 2×2 DataFrame
# │ Row │ x            │ length │
# │     │ Categorical… │ Int64  │
# ├─────┼──────────────┼────────┤
# │ 1   │ a            │ 2      │
# │ 2   │ b            │ 2      │

## alternative output
by(df, :x, length = :y => length)
# 2×2 DataFrame
# │ Row │ x            │ length │
# │     │ Categorical… │ Int64  │
# ├─────┼──────────────┼────────┤
# │ 1   │ a            │ 2      │
# │ 2   │ b            │ 2      │
# │ 3   │ c            │ 0      │

Just a couple ideas - this could possibly use skipmissing=false, which could be interpreted colloquially as "missing", although I understand this is a bit of conceptual conflation with the value of missing. Alternatively, it might be nice to introduce something analogous to .drop which specifies behavior of zero-length groups specifically.

There are certainly times when you want to retain the fact that a dataset doesn't contain values of a specific level where this can be very handy.

@bkamins
Copy link
Member

bkamins commented Feb 10, 2020

There are two issues:

  1. as far as I remember an assumption that the length of group is greater than zero is scattered around the split-apply-combine code, so adding this might be tricky (but doable)
  2. I am not clear how we should handle the case of missing level when grouping with several columns (it is ambiguous what values other grouping variables should be then).

How do you see the second one?

For now what you want can be achieved by something like:

d = Dict([lev => Int[] for lev in levels(df.x)])
foreach(((i, v),) -> push!(d[v], i), enumerate(df.x))

or

levels(df.x) .=> findall.(.==(levels(df.x)), Ref(df.x))

and then you can work with the indices to compute what you need.

@dgkf
Copy link
Author

dgkf commented Feb 10, 2020

Thanks @bkamins - I only raise it for discussion as it's sort of cropped up in #2104 and I think it's an important thing to design around and consider early. I fully appreciate that there are ways to address this already (although your code is much cleaner than whatever I would have stumbled upon) - I wasn't raising it because it's impossible, but rather just inconvenient.

@bkamins #2106
how we should handle the case of missing level when grouping with several columns

From my perspective, the purpose here is to treat your data as observations of a system, not as a complete dataset. If you want to summarize observations of possible outcomes, you often also want to reflect which outcomes weren't observed. When grouping by multiple factor columns, you would then want to characterize all the permutations of those columns.

This can produce some enormous datasets, so I wouldn't want this behavior to be the default, but think that a convenient way of retaining that information would be helpful when desired.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2020

From my perspective, the purpose here is to treat your data as observations of a system, not as a complete dataset.

OK. Now I get the idea.

However, if this is the case maybe it is better to add this option to #1864 (i.e. to add an additional kwarg that would request expansion over all levels of a categorical columns) rather than to groupby / by? What do you think? For your use case you would just call expand on the result of by and just indicate that missing combinations should be filled with 0.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2020

Just as an additional comment why this is problematic with by/groupby. We do not store currently the values of the grouping variables per group. They are identified by the first row of each group.

Potentially as an extension to #2095 we might add this option.

@dgkf
Copy link
Author

dgkf commented Feb 11, 2020

Something like that would certainly work. My first reaction is that it feels comfortable when you expect one observation per set of indexing variables, but it feels like a bit a hack if the goal is to summarize over possible values. In both cases, it assumes that missing has a very particular meaning that might conflict with what missing represents already in a dataset.

Expanding observations

When the goal is just to reflect unobserved possible values as missing, this feels quite comfortable - at least for complete data sets (without existing missing values).

x = categorical(["a", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 2])
expanddf!(df, [:x])
# 2×2 DataFrame
# │ Row │ x            │ y       │
# │     │ Categorical… │ Int64⍰  │
# ├─────┼──────────────┼─────────┤
# │ 1   │ a            │ 1       │
# │ 2   │ b            │ 2       │
# │ 3   │ c            │ missing │

Summarizing unobserved groups

To do some simple summary statistics where you also want to reflect unobserved groups, this feels like a bit of a hack. You have to introduce new missing data only to summarize it. If missing is used meaningfully in your dataset already (e.g. a measurement was not recorded as part of the observation), you also run the risk of conflating missing values introduced for expanding your dataset with missing values that have meaning in the context of the data.

x = categorical(["a", "a", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 1, 2])
expanddf!(df, [:x])
by(df, :x, n_observations = x -> length(skipmissing(x)))
# 2×2 DataFrame
# │ Row │ x            │ n_observations │
# │     │ Categorical… │ Int64          │
# ├─────┼──────────────┼────────────────┤
# │ 1   │ a            │ 2              │
# │ 2   │ b            │ 1              │
# │ 3   │ c            │ 0              │

@bkamins
Copy link
Member

bkamins commented Feb 11, 2020

Do you mean with what you write as expanddf to be expand from #1864?

If yes then note that its signature is:

function expand(df::AbstractDataFrame, indexcols; error::Bool=true, complete::Bool=false, fill=missing, replaceallmissing::Bool=false)

and you can choose the value of fill you want (only by default it is missing). So your last example would be rather written as:

expand(by(df, :x, n_observations = :x => length), [:x], fill=0, expand_categorical=true)

if we added a keyword like expand_categorical that would force expansion of a categorical vector.

CC @nalimilan - in this case the column :x correctly retains all levels after by so we are OK, xref #2104.

@dgkf
Copy link
Author

dgkf commented Feb 11, 2020

Good catch - I didn't think to group first and then expand but that's probably a more suitable approach.

The reason expanding first seems more idiomatic to me is that you will often summarize by more than one function at a time, and the handling of the "zero-observation" case might produce different values for each summarizing function where the fill parameter appears to fill all new cells with the same value.

For instance, if someone were to summarize over an unobserved group within their data and try to get back the number of records in that category and also the maximum of a single value column. They may expect that the number of records gets filled with 0 and the maximum gets filled with missing. Perhaps fill would instead need to accept something like a mapping of column name to fill value, possibly with a default? That interface starts to feel cumbersome to me, but at least lets a user address this behavior.

@nalimilan
Copy link
Member

I can see that being able to do droplevels=false (let's call it that way) would be useful in some situations. Though as a default it could be annoying, since many summary functions will fail for empty groups (in R they more often return NA or Inf). For more complex cases, using expand may be better. For reference the discussion in dplyr is in PR tidyverse/dplyr#3492.

As long as we don't change the default, we can introduce this feature at any point in the 1.x series. Though as @bkamins noted it would require adapting the internals. That's tricky since we don't want to make a copy of all unique values when there are many groups.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2020

Just to add - I expect that after #2095 we will see groupby and GroupedDataFrame objects used much more often.
Now mostly people used by, but when we have a fast lookup then GroupedDataFrame will provide first class indexing for DataFrames.jl. In particular I can easily imagine people even running groupby when the key is unique (to have quick look up).

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added this to the 2.0 milestone Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

I am giving it 2.0 milestone, as, after thinking, I think it is nice to have also in groupby (unless we do not find an efficient way to do it while staying performant and non-breaking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

3 participants