Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA in levels vs NA in values #337

Merged
merged 15 commits into from
Jan 9, 2023
Merged

NA in levels vs NA in values #337

merged 15 commits into from
Jan 9, 2023

Conversation

hadley
Copy link
Member

@hadley hadley commented Jan 5, 2023

This is an attempt to take a comprehensive look at handling NAs in levels vs NAs in values:

Copy link
Member

@mine-cetinkaya-rundel mine-cetinkaya-rundel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like making it more explicit (ha!) what "explicit" means so moving from value->level / level->value language from "explicit NA" sounds good to me.

I left a few typo fixes as well as some suggestions for how to present this idea in the vignette.

R/na.R Outdated Show resolved Hide resolved
vignettes/forcats.Rmd Outdated Show resolved Hide resolved
vignettes/forcats.Rmd Outdated Show resolved Hide resolved
vignettes/forcats.Rmd Outdated Show resolved Hide resolved

There are two ways to represent a missing value in a factor:

- You can include it in the values of the factor; it does not appear in the levels and `is.na()` reports it as missing. This is the usual way to encode a missing value in a factor:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just the "usual" but also the "default", so maybe "This is how missing values are encoded in a factor by default"?

- You can include in the levels of the factor and `is.na()` does not report it as missing. This requires a little more work to create:

```{r}
f <- factor(c("x", "y", NA), exclude = NULL)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to say "don't exclude NAs" without specifying exclude = NULL? If not, I think it would be good to add a sentence about why we brought NULL into play all of a sudden, like something about "this works because in R NULLs are ..."

vignettes/forcats.Rmd Outdated Show resolved Hide resolved
vignettes/forcats.Rmd Outdated Show resolved Hide resolved
Comment on lines 81 to 82
`NA`s in the values tend to be best for data analysis, since `is.na()` works as you'd expect.
`NA`s in the levels are often better for display since you can control where they appear.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move these two sentences to the end of this section since they're like a summary.

Also, I wonder about saying something like

You should be careful about using fct_na_value_to_level() in a data transformation pipeline where you overwrite your data since that will affect all downstream analysis as well. This should only be done if you will be using that updated data frame or creating displays or summaries, chances are you don't want NAs as an explicit level in your data analysis. If you do, you might consider relabeling them as something like "Unknown" or "Not available".

Maybe this is an overreach, but I feel like this function should be used as a quick fix, but if you really care about those NAs it's more practical to not make them NAs but something else, so they don't accidentally get swooped up by another function that treats NAs in a special way, say in a modelling context or something...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do, you might consider relabeling them as something like "Unknown" or "Not available".

Which you'd do with fct_na_value_to_level(x, name = "Unknown") if we add the idea of fct_explicit_na(na_level =) back in

NAMESPACE Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
R/count.R Show resolved Hide resolved
}

names(levs) <- names(new)[rep(seq_along(new), vapply(new, length, integer(1)))]
out <- fct_recode(.f, !!!levs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fct_recode() disappeared does that mean this weirdness went away? #291 (comment)

I was expecting to see a test for that or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added #339 to think about this more. I can't figure out if it deserves a test or not. On the one hand, it was arguably a bug, so it should get a test. On the other hand, thinking about fct_collapse() from first principles, it seems strange to test specific values of other_level.

R/na.R Outdated Show resolved Hide resolved
R/na.R Outdated Show resolved Hide resolved
R/na.R Outdated
f <- check_factor(f)

f <- fct_expand(f, NA)
f[is.na(f)] <- NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't figure out what this line is for. Do you have an example of why this is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like f[is.na(f)] <- "x" where the RHS is matched into the levels (which now contain NA) because of the previous line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something but to me it looked like is.na(f) would always return FALSE because fct_expand() turns NA values into integer codes

library(forcats)

x <- factor(c("a", NA))
unclass(x)
#> [1]  1 NA
#> attr(,"levels")
#> [1] "a"

# No NAs here anymore!
x <- fct_expand(x, NA)
unclass(x)
#> [1] 1 2
#> attr(,"levels")
#> [1] "a" NA

Created on 2023-01-05 with reprex v2.0.2.9000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooooh, I didn't notice that. Yes, you're right that it's unnecessary.

Comment on lines 81 to 82
`NA`s in the values tend to be best for data analysis, since `is.na()` works as you'd expect.
`NA`s in the levels are often better for display since you can control where they appear.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do, you might consider relabeling them as something like "Unknown" or "Not available".

Which you'd do with fct_na_value_to_level(x, name = "Unknown") if we add the idea of fct_explicit_na(na_level =) back in

```{r}
#| fig.alt: >
#| The bar chart of hair color, now ordered so that NAs, now labelled
#| "(missing)" are ordered where you'd expect: in between white (4) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narrator: "They were not, in fact, labelled "(missing)""

return(f)
# Replace specified levels (if any), with other.
# @param keep A logical vector the same length as `levels(f)`
lvls_other <- function(f, keep, other_level = "Other") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added this new helper function to ensure that the implementation of "othering" is identical for all code paths.

#' @rdname fct_na_value_to_level
#' @param extra_levels Optionally, a character vector giving additional levels
#' that should also be converted to `NA` values.
fct_na_level_to_value <- function(f, extra_levels = NULL) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this extra_levels because if you don't include NA, it's easy to create factors that have a mix of NAs in values and levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I was actually wondering if we even needed extra_levels at all, but there doesn't seem to be any other way to do this in forcats, so it seems useful to have

x <- factor(c("x", "missing"))

forcats::fct_na_level_to_value(x, extra_levels = "missing")
#> [1] x    <NA>
#> Levels: x

It is a nice way of declaring, "no, missing should really be a missing value"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually surprised this has never come up before. It seems pretty useful

R/na.R Outdated
Comment on lines 39 to 43
if (!identical(level, NA) && !is_string(level)) {
cli::cli_abort(
"{.arg level} must be a string or {.code NA}, not {.obj_type_friendly level}."
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_string(level, allow_na = TRUE) allows for logical NA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess you aren't using the checkers yet

R/na.R Outdated
fct_na_value_to_level <- function(f, level = NA) {
if (!identical(level, NA) && !is_string(level)) {
cli::cli_abort(
"{.arg level} must be a string or {.code NA}, not {.obj_type_friendly level}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need {.obj_type_friendly {level}}

It isn't letting me do a "suggestion" for some reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh

Comment on lines +46 to +50
f <- fct_expand(f, NA)
new_levels <- levels(f)
new_levels[is.na(new_levels)] <- level

lvls_revalue(f, new_levels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you already have an NA level, this will rename it to level? Seems reasonable

#' @rdname fct_na_value_to_level
#' @param extra_levels Optionally, a character vector giving additional levels
#' that should also be converted to `NA` values.
fct_na_level_to_value <- function(f, extra_levels = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I was actually wondering if we even needed extra_levels at all, but there doesn't seem to be any other way to do this in forcats, so it seems useful to have

x <- factor(c("x", "missing"))

forcats::fct_na_level_to_value(x, extra_levels = "missing")
#> [1] x    <NA>
#> Levels: x

It is a nice way of declaring, "no, missing should really be a missing value"

R/na.R Outdated Show resolved Hide resolved
@hadley hadley merged commit 201f30e into main Jan 9, 2023
@hadley hadley deleted the other-na branch January 9, 2023 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fct_infreq() doesn't reorder NA levels fct_collapse() should allow other_level = NA
3 participants