Add vec_proxy_missing() #1928

khusmann · 2024-04-15T23:55:30Z

vec_proxy_missing() adds a new proxy that allows developers to extend the is.na(), na.omit() etc. behavior of their vctrs (as I describe in (#1925). It's useful/needed when vctrs contain special values (or data structures) that represent different types of missingness in their data. In these cases, you need a separate proxy for equal and missing. It also lets you redefine the meaning of complete.

I wanted to take an experimental shot at the implementation here because it didn't seem to hard, and I wanted to play around with the resulting API. I'm extremely satisfied with the result -- this is exactly what I need for implementing a generic vctrs-backed Result<> type in the library I'm working on, and I think could have similar applications elsewhere (like haven::labelled_spss, or declared)

I've added a few tests and documentation updates, but it's only a first-pass. I would be happy to do a deeper & more thorough dive if this is considered for inclusion into vctrs.

Thank you for your consideration!

khusmann · 2024-08-15T19:21:18Z

@lionel- @DavisVaughan

Ok, I know I passionately advocated for this yesterday at conf, but after starting thinking about this again after a long break, I'm embarrassingly reconsidering the need for this (for my use case, at least).

I realized that for my application, it may actually be more theoretically consistent for comparisons with missing values (including missing reasons) to result in missing values. In base R, we have:

1 == NA => NA
NA == NA => NA

similarly:

1 + NA => NA
1 - NA => NA
(etc).

In interlacer, I say "operations on interlaced values are applied to the value channel". This way, when adding two interlaced vectors, only the values are added in the vector and missing values get automatically "masked out". This relies on:

1 + na("reason") => NA
1 - na("reason") => NA

Therefore, equality should also only apply to the value channel, just like for missing values without reasons:

1 == na("reason") => NA
na("reason1") == na("reason2") => NA

But what I was asking for in this PR was the ability to compare missing reasons:

na("reason1") == na("reason2") => false
na("reason1") == na("reason1") => true

But this is actually a different type of comparison, and breaks the interlacer rule if we allow it! (that is, breaks the rule that operations apply on values, masking out missingness). Instead, we should think of it as a new kind of comparison that compares values AND missing values (rather than comparing values and masks out missing). So I think what I actually need is to define a new form of the == and != operators that compare both the value and missingness of a value, and always return a boolean (never missing). Something like:

na("reason1") %==% na("reason2") => false
na("reason1") %==% na("reason1") => true
1 %==% na("reason") => false

This operator would also make sense with unlabelled missingness (raw NA values):

1 %==% NA => false
NA %==% NA => true
na("reason1") %==% NA => false

Anyway, tldr I need to think about this a little more… but I think we'll probably be able to close the PR! (if you have any theory input/perspective on this I'd really appreciate it but no pressure; I know y'all are super busy)

Sorry for the flip flop on this!

The interlacer rule is "operations are applied to the value channel" and missing values are automatically masked out. the original behavior did not follow this in order to allow for comparsion of missing reasons. Instead, for missing reason comparsions we will need to define new equality operators that compare both values and missing reason channels. See discussion here for more info: r-lib/vctrs#1928

khusmann · 2024-09-05T03:20:09Z

The latest version of interlacer (v0.4.0) now has the behavior I describe above, which so far seems to be working well... I'm closing this issue now!

DavisVaughan · 2024-09-05T13:49:13Z

I think it makes sense that all the vctrs ops operate on the underlying value, but then you could have special interlacer_equals() operations or something like that (possibly an infix op like you showed) that "know" about the different kinds of missingness as well

It's kind of like how in ivs you can do vctrs::vec_set_difference() and it uses the vctrs definition of sets, but there is also ivs::iv_set_difference() that uses the interval definition of sets

khusmann · 2024-09-05T18:28:05Z

It's kind of like how in ivs you can do vctrs::vec_set_difference() and it uses the vctrs definition of sets, but there is also ivs::iv_set_difference() that uses the interval definition of sets

Good parallel! (and ivs looks super cool btw)

The other nice part of the infix %==% operator is it is I'm finding that even without interlaced types the definition makes sense and is actually pretty useful. It can be used to determine if the values of two columns are identical in values and missing pattern, even with unlabeled missingness:

library(interlacer)

a <- c(1, 2, NA, 3)
b <- c(1, 2, NA, NA)

a == b
#> [1] TRUE TRUE   NA   NA

a %==% b
#> [1]  TRUE  TRUE  TRUE FALSE

# This becomes handy if I want to see if the vectors are identical
# in values and in missing pattern:
all(a %==% b)
#> [1] FALSE
all(a %==% a)
#> [1] TRUE

# By contrast, if I naively use base `==` I get `NA`
all(a == b)
#> [1] NA
all(a == a)
#> [1] NA

# To get the same behavior as all(a %==% b) requires a bit of gymnastics:
all(ifelse(is.na(a == b), is.na(a) & is.na(b), a == b))
#> [1] FALSE

^{Created on 2024-09-05 with reprex v2.0.2}

As far as I'm aware, there aren't any other convenient ways to make this basic kind of value & missingness pattern comparison in base or rlang or vctrs... Right?

Sure, there's all.equal() and identical(), but 1. they operate on the full vector & don't have vectorized versions and 2. the former is for "near" equality, and the latter additionally compares attributes (rather than just values).

DavisVaughan · 2024-09-05T19:17:44Z

vec_equal(na_equal = TRUE) offers some of this behavior

a <- c(1, 2, NA, 3)
b <- c(1, 2, NA, NA)

a == b
#> [1] TRUE TRUE   NA   NA

vctrs::vec_equal(a, b)
#> [1] TRUE TRUE   NA   NA

vctrs::vec_equal(a, b, na_equal = TRUE)
#> [1]  TRUE  TRUE  TRUE FALSE

^{Created on 2024-09-05 with reprex v2.0.2}

khusmann · 2024-09-05T21:07:12Z

Ah nice! So I can simplify / generalize the definition of x %==% y to be:

`%==%` <- function(x, y) {
  args <- vec_cast_common(x, y)
  vec_equal(vec_proxy(args[[1]]), vec_proxy(args[[2]]), na_equal = TRUE)
}

(Note vec_proxy() is used here instead of vec_proxy_equal() because vec_proxy_equal() returns a vctr designed for value comparison with ==, not for deep structural comparison as we want here)

With this definition the operator becomes meaningful for all vctrs record types as a "record equal" (or "deep equal"?) operator, as opposed to the usual "value equal". Cool.

For ivs, for example, this would allow you to compare intervals with one or more bounds missing. Interval vectors essentially can have 3 types of structural missingness: missing start, missing end, and missing both. So you'd have:

iv(5, 5) %==% iv(5, 5) # TRUE
iv(5, NA) %==% iv(5, NA) # TRUE
iv(NA, 5) %==% iv(NA, 5) # TRUE
iv(NA, NA) %==% iv(NA, NA) # TRUE

iv(5, NA) %==% iv(NA, NA) # FALSE
# Etc.

Compared to behavior of regular ==:

iv(5, 5) == iv(5, 5) # TRUE
iv(5, NA) == iv(5, NA) # NA
iv(NA, 5) == iv(NA, 5) # NA
iv(NA, NA) == iv(NA, NA) # NA

iv(5, NA) == iv(NA, NA) # NA
# Etc.

This makes sense because when we test iv(5, NA) == iv(5, NA), we're asking "were these the same measured interval?" and the correct answer is "we don't know, because we forgot what the upper bound was... so we can't say either way". Whereas when we ask iv(5, NA) %==% iv(5, NA), we're asking "are these record entries equivalent?" and the answer is "yes, because they both have the same lower bound and are both missing the upper bound".

Of course right now in ivs when either the start or end value of the input is NA you force both to be NA (to avoid needing to deal with the complexity of incomplete intervals, which is totally valid). Still, it's interesting to think about applications of this operator outside of interlacer... If this has more broad application, it might be worth considering for inclusion into vctrs.

Anyway, thanks for thinking this thru with me! I'll continue to stew, and pls don't hesitate to tag me if you see any more applications or parallels on anything else you're working on...

khusmann added 2 commits April 15, 2024 16:29

add vec_proxy_missing(), a proxy for extending NA vctrs behaviors

5b5caad

add tests for vec_proxy_missing()

8fac6c2

khusmann mentioned this pull request Sep 5, 2024

upgrade to v0.4.0 khusmann/interlacer#11

Merged

khusmann closed this Sep 5, 2024

khusmann mentioned this pull request Sep 5, 2024

feature request: vec_proxy_na() #1925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vec_proxy_missing() #1928

Add vec_proxy_missing() #1928

khusmann commented Apr 15, 2024 •

edited

Loading

khusmann commented Aug 15, 2024 •

edited

Loading

khusmann commented Sep 5, 2024 •

edited

Loading

DavisVaughan commented Sep 5, 2024

khusmann commented Sep 5, 2024

DavisVaughan commented Sep 5, 2024

khusmann commented Sep 5, 2024 •

edited

Loading

Add vec_proxy_missing() #1928

Add vec_proxy_missing() #1928

Conversation

khusmann commented Apr 15, 2024 • edited Loading

khusmann commented Aug 15, 2024 • edited Loading

khusmann commented Sep 5, 2024 • edited Loading

DavisVaughan commented Sep 5, 2024

khusmann commented Sep 5, 2024

DavisVaughan commented Sep 5, 2024

khusmann commented Sep 5, 2024 • edited Loading

khusmann commented Apr 15, 2024 •

edited

Loading

khusmann commented Aug 15, 2024 •

edited

Loading

khusmann commented Sep 5, 2024 •

edited

Loading

khusmann commented Sep 5, 2024 •

edited

Loading