-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NA in levels vs NA in values #337
Conversation
#Conflicts: # NEWS.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like making it more explicit (ha!) what "explicit" means so moving from value->level / level->value language from "explicit NA" sounds good to me.
I left a few typo fixes as well as some suggestions for how to present this idea in the vignette.
vignettes/forcats.Rmd
Outdated
|
||
There are two ways to represent a missing value in a factor: | ||
|
||
- You can include it in the values of the factor; it does not appear in the levels and `is.na()` reports it as missing. This is the usual way to encode a missing value in a factor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just the "usual" but also the "default", so maybe "This is how missing values are encoded in a factor by default"?
vignettes/forcats.Rmd
Outdated
- You can include in the levels of the factor and `is.na()` does not report it as missing. This requires a little more work to create: | ||
|
||
```{r} | ||
f <- factor(c("x", "y", NA), exclude = NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to say "don't exclude NAs" without specifying exclude = NULL
? If not, I think it would be good to add a sentence about why we brought NULL
into play all of a sudden, like something about "this works because in R NULL
s are ..."
vignettes/forcats.Rmd
Outdated
`NA`s in the values tend to be best for data analysis, since `is.na()` works as you'd expect. | ||
`NA`s in the levels are often better for display since you can control where they appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move these two sentences to the end of this section since they're like a summary.
Also, I wonder about saying something like
You should be careful about using
fct_na_value_to_level()
in a data transformation pipeline where you overwrite your data since that will affect all downstream analysis as well. This should only be done if you will be using that updated data frame or creating displays or summaries, chances are you don't want NAs as an explicit level in your data analysis. If you do, you might consider relabeling them as something like "Unknown" or "Not available".
Maybe this is an overreach, but I feel like this function should be used as a quick fix, but if you really care about those NA
s it's more practical to not make them NA
s but something else, so they don't accidentally get swooped up by another function that treats NAs in a special way, say in a modelling context or something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do, you might consider relabeling them as something like "Unknown" or "Not available".
Which you'd do with fct_na_value_to_level(x, name = "Unknown")
if we add the idea of fct_explicit_na(na_level =)
back in
} | ||
|
||
names(levs) <- names(new)[rep(seq_along(new), vapply(new, length, integer(1)))] | ||
out <- fct_recode(.f, !!!levs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since fct_recode()
disappeared does that mean this weirdness went away? #291 (comment)
I was expecting to see a test for that or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added #339 to think about this more. I can't figure out if it deserves a test or not. On the one hand, it was arguably a bug, so it should get a test. On the other hand, thinking about fct_collapse()
from first principles, it seems strange to test specific values of other_level
.
R/na.R
Outdated
f <- check_factor(f) | ||
|
||
f <- fct_expand(f, NA) | ||
f[is.na(f)] <- NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't figure out what this line is for. Do you have an example of why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like f[is.na(f)] <- "x"
where the RHS is matched into the levels (which now contain NA
) because of the previous line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something but to me it looked like is.na(f)
would always return FALSE
because fct_expand()
turns NA
values into integer codes
library(forcats)
x <- factor(c("a", NA))
unclass(x)
#> [1] 1 NA
#> attr(,"levels")
#> [1] "a"
# No NAs here anymore!
x <- fct_expand(x, NA)
unclass(x)
#> [1] 1 2
#> attr(,"levels")
#> [1] "a" NA
Created on 2023-01-05 with reprex v2.0.2.9000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooooh, I didn't notice that. Yes, you're right that it's unnecessary.
vignettes/forcats.Rmd
Outdated
`NA`s in the values tend to be best for data analysis, since `is.na()` works as you'd expect. | ||
`NA`s in the levels are often better for display since you can control where they appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do, you might consider relabeling them as something like "Unknown" or "Not available".
Which you'd do with fct_na_value_to_level(x, name = "Unknown")
if we add the idea of fct_explicit_na(na_level =)
back in
vignettes/forcats.Rmd
Outdated
```{r} | ||
#| fig.alt: > | ||
#| The bar chart of hair color, now ordered so that NAs, now labelled | ||
#| "(missing)" are ordered where you'd expect: in between white (4) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Narrator: "They were not, in fact, labelled "(missing)"
"
Co-authored-by: Davis Vaughan <davis@rstudio.com> Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
return(f) | ||
# Replace specified levels (if any), with other. | ||
# @param keep A logical vector the same length as `levels(f)` | ||
lvls_other <- function(f, keep, other_level = "Other") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added this new helper function to ensure that the implementation of "othering" is identical for all code paths.
#' @rdname fct_na_value_to_level | ||
#' @param extra_levels Optionally, a character vector giving additional levels | ||
#' that should also be converted to `NA` values. | ||
fct_na_level_to_value <- function(f, extra_levels = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this extra_levels
because if you don't include NA
, it's easy to create factors that have a mix of NAs in values and levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I was actually wondering if we even needed extra_levels
at all, but there doesn't seem to be any other way to do this in forcats, so it seems useful to have
x <- factor(c("x", "missing"))
forcats::fct_na_level_to_value(x, extra_levels = "missing")
#> [1] x <NA>
#> Levels: x
It is a nice way of declaring, "no, missing
should really be a missing value"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually surprised this has never come up before. It seems pretty useful
R/na.R
Outdated
if (!identical(level, NA) && !is_string(level)) { | ||
cli::cli_abort( | ||
"{.arg level} must be a string or {.code NA}, not {.obj_type_friendly level}." | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_string(level, allow_na = TRUE)
allows for logical NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I guess you aren't using the checkers yet
R/na.R
Outdated
fct_na_value_to_level <- function(f, level = NA) { | ||
if (!identical(level, NA) && !is_string(level)) { | ||
cli::cli_abort( | ||
"{.arg level} must be a string or {.code NA}, not {.obj_type_friendly level}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need {.obj_type_friendly {level}}
It isn't letting me do a "suggestion" for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh
f <- fct_expand(f, NA) | ||
new_levels <- levels(f) | ||
new_levels[is.na(new_levels)] <- level | ||
|
||
lvls_revalue(f, new_levels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you already have an NA
level, this will rename it to level
? Seems reasonable
#' @rdname fct_na_value_to_level | ||
#' @param extra_levels Optionally, a character vector giving additional levels | ||
#' that should also be converted to `NA` values. | ||
fct_na_level_to_value <- function(f, extra_levels = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I was actually wondering if we even needed extra_levels
at all, but there doesn't seem to be any other way to do this in forcats, so it seems useful to have
x <- factor(c("x", "missing"))
forcats::fct_na_level_to_value(x, extra_levels = "missing")
#> [1] x <NA>
#> Levels: x
It is a nice way of declaring, "no, missing
should really be a missing value"
This is an attempt to take a comprehensive look at handling
NA
s in levels vsNAs
in values:fct_na_levels_to_values()
andfct_na_values_to_levels()
(and deprecatingfct_explicit_na()
)fct_infreq()
doesn't reorder NA levels #326)other_level = NA
everywhere. (Fixesfct_collapse()
should allowother_level = NA
#291)