Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row binding data frames with Surv objects #443

Closed
topepo opened this issue Aug 3, 2023 · 8 comments
Closed

row binding data frames with Surv objects #443

topepo opened this issue Aug 3, 2023 · 8 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@topepo
Copy link
Member

topepo commented Aug 3, 2023

validation_set() uses rbind() to concatenate the training and validation sets.

This fails for Surv objects (but dplyr::bind_rows() does the right thing):

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(survival)

lung$suv_obj <- Surv(lung$time, lung$status)
lung_1 <- head(lung)
lung_2 <- tail(lung)


r_bound <- rbind(lung_1, lung_2)
inherits(r_bound$suv_obj, "Surv")
#> [1] FALSE
class(r_bound$suv_obj)
#> [1] "matrix" "array"

dplyr_bound <- bind_rows(lung_1, lung_2)
inherits(dplyr_bound$suv_obj, "Surv")
#> [1] TRUE
class(dplyr_bound$suv_obj)
#> [1] "Surv"

Created on 2023-08-03 with reprex v2.0.2

@topepo topepo added the bug an unexpected problem or unintended behavior label Aug 3, 2023
@topepo
Copy link
Member Author

topepo commented Aug 3, 2023

There are a lot of other places that this will occur when the Surv object is in a basic R data frame.

For example, the partitioning functions use base R subscripting:

x$data[sort(x$train_id), , drop = FALSE]

and that drops the class. These can be changed to:

  ind <- sort(x$train_id)
  dplyr::slice(x$data, ind)

and it solves the issue.

We would have to do this change in a lot of places:

  • {training,testing,validation}.initial_validation_split
  • as.data.frame.rsplit

so it would be a system-wide change. For some methods, the latter uses column subscripting at the same time so I would also use dplyr::select() to keep the Surv objects intact.

@topepo
Copy link
Member Author

topepo commented Aug 3, 2023

I think I know what to do about these lines but will hold off before making changes there.

permuted_col <-
   x$data[as.integer(x, data = data, ...), x$col_id, drop = FALSE]
x$data[, x$col_id] <- permuted_col

@topepo
Copy link
Member Author

topepo commented Aug 3, 2023

The fix also changes the row names of the resulting object (restarting the numbers):

── Failure ('test-rsplit.R:31:3'): as.data.frame ───────────────────────────────
as.data.frame(rs4) (`actual`) not equal to dat1[c(1, 1, 1, 2, 2, 2), ] (`expected`).

`attr(actual, 'row.names')` is an integer vector (1, 2, 3, 4, 5, ...)
`attr(expected, 'row.names')` is a character vector ('1', '1.1', '1.2', '2', '2.1', ...)

@hfrick
Copy link
Member

hfrick commented Aug 10, 2023

@topepo I'm trying to understand why the subsetting via [ is a problem with Surv objects. It isn't in the reprex below so I'm wondering what I'm missing?

library(dplyr)
library(survival)

lung2 <- lung %>% mutate(surv_obj = Surv(time, status))
class(lung2)
#> [1] "data.frame"

lung2[1:3, , drop = FALSE] %>% 
  pull(surv_obj) %>% 
  class()
#> [1] "Surv"

lung2[1:3, "surv_obj" , drop = FALSE] %>% 
  pull(surv_obj) %>% 
  class()
#> [1] "Surv"

Created on 2023-08-10 by the reprex package (v2.0.1)

@EmilHvitfeldt
Copy link
Member

As far as I understand, the problem arrises when you don't load the survival package.

@hfrick
Copy link
Member

hfrick commented Aug 10, 2023

Ah, thank you! I think that situation might not be that common but still totally worth guarding against. That's just different from the initial example because rbind() drops the class even with the survival package loaded.

srv <-
  list(
    age = c(74, 68, 56, 57, 60, 74, 76, 77, 39, 75, 66, 58),
    sex = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2),
    surv_obj = structure(
      c(306, 455, 1010, 210, 883, 1022, 1, 1, 0, 1, 1, 0,
        116, 188,  191, 105, 174,  177, 1, 0, 0, 0, 0, 0),
      dim = c(12L, 2L),
      dimnames = list(NULL, c("time", "status")),
      type = "right",
      class = "Surv"))
surv_df <-
  structure(
    srv,
    row.names = paste(1:12),
    class = "data.frame")

surv_df
#>    age sex surv_obj.time surv_obj.status
#> 1   74   1           306             116
#> 2   68   1           455             188
#> 3   56   1          1010             191
#> 4   57   1           210             105
#> 5   60   1           883             174
#> 6   74   1          1022             177
#> 7   76   1             1               1
#> 8   77   1             1               0
#> 9   39   1             0               0
#> 10  75   2             1               0
#> 11  66   1             1               0
#> 12  58   2             0               0
surv_df$surv_obj |> class()
#> [1] "Surv"

surv_df[1:3, , drop = FALSE]$surv_obj |> class()
#> [1] "matrix" "array"

Created on 2023-08-10 by the reprex package (v2.0.1)

hfrick added a commit that referenced this issue Aug 14, 2023
* indexing changes for #443

* dplyr::filter -> vctrs::vec_slice

* update tests and news for differences in row names

* be more specific

* use `expect_s3_class()`

* vctrs is already in Imports

* more realistic-looking Surv object

* safe-guard Surv in permutations

---------

Co-authored-by: Hannah Frick <hannah@rstudio.com>
@hfrick
Copy link
Member

hfrick commented Aug 14, 2023

closed in #444

@hfrick hfrick closed this as completed Aug 14, 2023
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants