Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Implement as generics for SOMASparseNDArrayReader #1458

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pablo-gar
Copy link
Member

@pablo-gar pablo-gar commented Jun 7, 2023

Issue and/or context:
#1453

Changes:

  • Add coerce functions for SOMASparseNDArrayReader and TableIter
  • Update vignettes to reflect changes and make them more R-like
  • Add tests
  • Update some tests

Notes for Reviewer:

  • @aaronwolen for visibility and general approval
  • @mojaveazure Please take a look at the changes with close attention to the vignettes, let me know if these are good for the general R user
  • @eddelbuettel please let me know if the implementation of the method for the as generics is correct

Usage

A more R-like behavior is enabled for $read()

library(tiledbsoma)

human_experiment = load_dataset("soma-exp-pbmc-small")
soma_df <- human_experiment$obs
soma_sparse <-  human_experiment$ms$get("RNA")$X$get("data")

as.data.frame(soma_df$read())
# # A tibble: 80 × 9
#    soma_joinid orig.ident  nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents
#          <int> <chr>            <dbl>        <int> <chr>           <chr>        
#  1           0 SeuratProj…         70           47 0               A            
#  2           1 SeuratProj…         85           52 0               A            
#  3           2 SeuratProj…         87           50 1               B            
#  4           3 SeuratProj…        127           56 0               A            
#  5           4 SeuratProj…        173           53 0               A            
#  6           5 SeuratProj…         70           48 0               A            
#  7           6 SeuratProj…         64           36 0               A            
#  8           7 SeuratProj…         72           45 0               A            
#  9           8 SeuratProj…         52           36 0               A            
# 10           9 SeuratProj…        100           41 0               A            
# # ℹ 70 more rows
# # ℹ 3 more variables: groups <chr>, RNA_snn_res.1 <chr>, obs_id <chr>
# # ℹ Use `print(n = ...)` to see more rows

arrow::as_arrow_table(soma_df$read())
# Table
# 80 rows x 9 columns
# $soma_joinid <int64 not null>
# $orig.ident <large_string>
# $nCount_RNA <double>
# $nFeature_RNA <int32>
# $RNA_snn_res.0.8 <large_string>
# $letter.idents <large_string>
# $groups <large_string>
# $RNA_snn_res.1 <large_string>
# $obs_id <large_string>
    
arrow::as_arrow_table(soma_sparse$read())
# Table
# 4456 rows x 3 columns
# $soma_dim_0 <int64 not null>
# $soma_dim_1 <int64 not null>
# $soma_data <double not null>

as.data.frame(soma_sparse$read())
# # A tibble: 4,456 × 3
#    soma_dim_0 soma_dim_1 soma_data
#         <int>      <int>     <dbl>
#  1          0          1      4.97
#  2          0          5      4.97
#  3          0          8      6.06
#  4          0         11      4.97
#  5          0         22      4.97
#  6          0         30      6.35
#  7          0         33      4.97
#  8          0         34      6.57
#  9          0         36      4.97
# 10          0         38      4.97
# # ℹ 4,446 more rows
# # ℹ Use `print(n = ...)` to see more rows

as(soma_sparse$read(), "TsparseMatrix")[1:5, 1:5]
# 5 x 5 sparse Matrix of class "dgTMatrix"
#                             
# [1,] . 4.968821 . .        .
# [2,] . .        . 4.776153 .
# [3,] . .        . .        .
# [4,] . .        . .        .
# [5,] . .        . 4.074201 .

as(soma_sparse$read(), "CsparseMatrix")[1:5, 1:5]
# 5 x 5 sparse Matrix of class "dgCMatrix"
#                             
# [1,] . 4.968821 . .        .
# [2,] . .        . 4.776153 .
# [3,] . .        . .        .
# [4,] . .        . .        .
# [5,] . .        . 4.074201 .

as(soma_sparse$read(), "RsparseMatrix")[1:5, 1:5]
# 5 x 5 sparse Matrix of class "dgTMatrix"
#                             
# [1,] . 4.968821 . .        .
# [2,] . .        . 4.776153 .
# [3,] . .        . .        .
# [4,] . .        . .        .
# [5,] . .        . 4.074201 .

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2023

Codecov Report

Patch coverage has no change and project coverage change: -11.89 ⚠️

Comparison is base (a7eb23e) 63.97% compared to head (acd83e3) 52.08%.

❗ Current head acd83e3 differs from pull request most recent head ca0593a. Consider uploading reports for the commit ca0593a to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1458       +/-   ##
===========================================
- Coverage   63.97%   52.08%   -11.89%     
===========================================
  Files         101       72       -29     
  Lines        8219     5725     -2494     
===========================================
- Hits         5258     2982     -2276     
+ Misses       2961     2743      -218     
Flag Coverage Δ
python ?
r 52.08% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 31 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

'utils-uris.R'
'utils.R'
'write_seurat.R'
'write_soma.R'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Why would we need this now when we did not before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Collate secion is not necessary. My guess is it was generated when @includes were part of the code, then added as part of this PR

apis/r/R/utils.R Outdated
@@ -71,7 +71,7 @@ arrow_to_dt <- function(arrlst) {
}

##' @noRd
as_arrow_table <- function(arrlst) {
to_arrow_table <- function(arrlst) {
Copy link
Contributor

@eddelbuettel eddelbuettel Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could possibly make that simple list of two external pointers a simple S3 class (which I considered for the simple features like pretty printing). That may open the door for a dispatch of as_arrow_table.CLASSHERE as in your coercion utilities. Is that better?

In the short term the renaming is fine but I do like these "verbs" to start with 'as' ...

(We could also decide to make it .as_arrow_table() with a leading dot. It is already a non-documented, non-exported internal helper.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could possibly make that simple list of two external pointers a simple S3 class

Love this idea. Almost did it myself #1461

Copy link
Member

@aaronwolen aaronwolen Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But @pablo-gar, note that I removed the internal as_arrow_table() in #1461 since it was redundant with soma_array_to_arrow_table() and conflicted with arrow::as_arrow_table().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correctness, let's add "proposing to remove as_arrow_table() in #1461" and maybe we should move a little slower here / not quite do that.

Copy link
Contributor

@eddelbuettel eddelbuettel Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this idea.

Come to think about it, in tiledb-r I changed this and am now returning at arrow Table each time. That is simpler. Maybe we should do that here too. So then the as_arrow_table() would become an R-level internal function, not exported, not visible that wraps around sr_next() and other data gathers (i.e. soma_array_reader() and instead of being handed a list of two (external pointers to Arrow structs) it returns an arrow Table made from them.

Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left questions inline. This is quite promising, and I trust the unit tests...

@pablo-gar
Copy link
Member Author

Thanks @eddelbuettel you got here a little earlier than expected, I wanted to take a final look before opening it up for reviews since opened the PR very late yesterday. I did a final review and everything looks good.

I'll address your comments shortly.

@pablo-gar pablo-gar marked this pull request as ready for review June 7, 2023 17:03
Copy link
Member

@mojaveazure mojaveazure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to the reason for implementing this at the SparseArrayRead level instead of the SparseArray level; I would think that as(array, "RsparseMatrix") would be more user-friendly than as(array$read(), "RsparseMatrix")

That's not to say the SparseArrayRead coercions can't exist, the array-level could simply be

methods::setOldClass("SOMASparseNDArray")

#' @importClassesFrom Matrix TsparseMatrix
#'
methods::setAs(from = "SOMASparseNDArray", to = "TsparseMatrix", def = \(from) methods::as(from$read(), "TsparseMatrix"))

but perhaps there's some issue that I'm not thinking of for why we're not providing coercions at the array-level

Additional comments provided within the code

@@ -0,0 +1,40 @@
#' Coercion methods for SOMA classes

#' @importFrom arrow as_arrow_table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reexport as_arrow_table?

#' @importFrom arrow as_arrow_table
#' @export
#'
arrow::as_arrow_table

If not, these methods will be inaccessible to the end-user without library(arrow) first

Comment on lines +15 to +19
# Coerce \link[tiledbsoma]{SOMASparseNDArrayRead} to Matrix::\link[Matrix]{dgTMatrix}
setAs(from = "SOMASparseNDArrayRead",
to = "TsparseMatrix",
def = function(from) from$sparse_matrix()$concat()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The R6 classes may need to be declared as an old class with methods::setOldClass()

methods::setOldClass("SOMASparseNDArrayRead")

We may also need to import the target classes from Matrix

#' @importClassesFrom Matrix TsparseMatrix CsparseMatrix RsparseMatrix 
#'
NULL

)

# Coerce \link[tiledbsoma]{SOMASparseNDArrayRead} to Matrix::\link[Matrix]{dgCMatrix}
setAs(from = "SOMASparseNDArrayRead",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also provide a delayed method for SeuratObject::as.sparse(); this is called in SeuratObject::CreateSeuratObject() and SeuratObject::CreateAssayObject() and would allow passing a sparse array/sparse array read directly to those functions

#' @exportS3Method SeuratObject::as.sparse
#'
as.sparse.SOMASparseNDArrayRead <- function(x, ...) {
  as(x, "CsparseMatrix")
}

'utils-uris.R'
'utils.R'
'write_seurat.R'
'write_soma.R'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Collate secion is not necessary. My guess is it was generated when @includes were part of the code, then added as part of this PR

@pablo-gar
Copy link
Member Author

Thanks @mojaveazure for all the recommendations! I was scratching my head with some behaviors that are going to be solved by your suggestions.

I'm curious as to the reason for implementing this at the SparseArrayRead level instead of the SparseArray level; I would think that as(array, "RsparseMatrix") would be more user-friendly than as(array$read(), "RsparseMatrix")`

That's not to say the SparseArrayRead coercions can't exist, the array-level could simply be [...]

Yes I thought about this and there's a compromise, if we add them at SparseArray level we don't have access to the arguments of read() importantly coords and value_filter. I could add coercions to all: SOMASparseNDArray SOMASparseNDArrayReader TableIter, we would get good coverage although the ux maybe a little confusing.

What do you think?

@mojaveazure
Copy link
Member

mojaveazure commented Jun 7, 2023

if we add them at SparseArray level we don't have access to the arguments of read() importantly coords and value_filter

That's true, and is the reason why we built as.sparse() in Seurat

One option that I'm toying with is using as.matrix() instead of methods::as() for coercion. The formals for as.matrix() are x and ..., which gives us the flexibility to add new parameters to the method definition (so long as x is first and ... are present), to give access to coords, value_filter, reindex (once that gets implemented), and repr

The odd thing would be that our as.matrix() methods would return sparse matrices from Matrix rather than standard S3 matrices. I'd argue that there's (some) precedent for this, though, as as.data.frame() on Arrow tables returns tibbles, not standard data.frames, but not enough to actually recommend this without hearing from @eddelbuettel and @aaronwolen

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Jun 7, 2023

I am mostly confused at this point. Just as I was getting used to method chaining (which we tamed nicely now) we revert to the (to me still more familiar converters ie as(...) or as_...(). I don't actually have that much practical experience crafting elaborate class structure such these so not sure if there are S3 / S4 / R6 intersection pitfalls waiting for us.

(And yes: not a fan that as.data.frame() return a tibble here:

> arr <- tiledb_array("/tmp/tiledb/penguins", return_as="arrow") 
> class(as.data.frame(at))                                                                      
[1] "tbl_df"     "tbl"        "data.frame"                                                    
> 

But that is arrow and not our call. tibble::as_tibble() is more honest.)

Edit: Sorry about the close/open -- that was a fat-fingered early hit on the return key.

@eddelbuettel eddelbuettel reopened this Jun 7, 2023
@aaronwolen
Copy link
Member

aaronwolen commented Jun 7, 2023

My preference would be to update the vignettes to leverage the new iterated reader in a separate PR so we can discuss those changes in isolation.

Re

if we add them at SparseArray level we don't have access to the arguments of read() importantly coords and value_filter

That's true but I'm not sure as.data.frame(sdf$read()) provides much of a usability improvement over the status quo. Perhaps some more design discussions would be beneficial before we start adding these S3 helpers for our R6 classes.

x$tables()$concat()
}

#' Coerce \link[tiledbsoma]{SOMASparseNDArrayRead} to \link{data.frame} or \link[tibble]{tibble}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mention the existence of these methods in the docs for SOMASparseNDArray but I don't think we should add a description here, which results in a separate documentation entry.

@pablo-gar
Copy link
Member Author

pablo-gar commented Jun 12, 2023

I incorporated some of @mojaveazure suggestions. I still want to include a few more things suggested in the comments and do some testing.

In addition, after a conversation with @aaronwolen I will hold on merging this PR. He has asked me to hold until a broader strategy is created for R generics across all TileDB-SOMA.

@johnkerl johnkerl changed the title Implement as generics for SOMASparseNDArrayReader [r] Implement as generics for SOMASparseNDArrayReader Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants