Write DelayedArray to file on the R side. #35

LTLA · 2021-01-27T07:55:58Z

Closes #32.

library(zellkonverter)
library(scRNAseq)
sce <- ZeiselBrainData()
counts(sce) <- DelayedArray(counts(sce))
 
temp <- tempfile(fileext = ".h5ad")
writeH5AD(sce, temp)

Some points:

Need to test that it works with both DA in the first and/or later assays.
Need to test that the result is still readable by readH5AD.

Known to-dos:

The code does not respect is_sparse() being TRUE. A little bit of work is required to achieve a one-pass writer that does not rely on knowing the total number of non-zero entries.

R/write.R

LTLA · 2021-01-29T09:03:55Z

Latest commit solves the to-do above, but needs a lot of testing.

Briefly, we set up extensible HDF5 datasets and then iterate across row-wise chunks of the sparse DA. In each chunk, we extract the non-zero values, convert them to compressed sparse row format, and append them to the existing HDF5 dataset via h5set_extent (to make the existing dataset larger) and h5writeDataset to write the new values in the expanded space. This approach respects the memory constraints of block processing, by ensuring that we do not load all non-zero values into memory; while retaining a single-pass approach for rapid processing, by avoiding the need to know the total number of non-zeroes.

The code here makes a number of assumptions:

I used values, indptrs and index for the values, column indices and pointers, respectively. Can't remember what H5AD actually calls them. Writing this now, I suspect I mixed up index and indptrs.
Column indices are assumed to be zero-based.
Stored the pointers as 64-bit unsigned integers as the number of non-zeros can exceed ~3 billion for large datasets.

lazappi · 2021-01-30T15:12:03Z

Added a couple of tests and fixed how the loop for rewriting matrices handled paths.

LTLA · 2021-01-30T19:40:20Z

Hit a roadblock in the form of grimbough/rhdf5#79; the AnnData reader doesn't like the fixed-width byte strings that rhdf5 emits.

For the time being, I suggest we just add a clause to the initial check for DAs where we do not skip a DA if it is_sparse() is TRUE. This implies that it will be realized into memory in SCE2AnnData - hopefully this is tolerated on the user machine. We can switch back to the current block processing behavior once the rhdf5 issue is resolved.

LTLA · 2021-02-12T09:32:38Z

@lazappi were you going to work on this?

lazappi · 2021-02-12T09:39:46Z

Sorry I've lost track of it a bit. Is it just the is_sparse() check that we need to add?

LTLA · 2021-02-12T17:30:40Z

Yes, I think line 60 in my PR should only skip the assay if it's a DA and !is_sparse(). Sparse DAs should be coerced to in-memory dgCMatrix objects for the time being.

Avoid grimbough/rhdf5#79

lazappi · 2021-02-16T15:33:55Z

Ok, I've added that check. The code for writing sparse DelayedArrays can't be reached by anything now which made the test coverage drop but that's fine and better to have it there for when the rhdf5 issue is fixed. Anything else to add?

LTLA · 2021-02-16T15:53:53Z

Think that's it, just slap on some # nocov start and ends around it for the time being, I guess.

Write DelayedArray to file on the R side.

2cc6f41

lazappi reviewed Jan 28, 2021

View reviewed changes

R/write.R Outdated Show resolved Hide resolved

Support single-pass writing of a sparse DA to HDF5.

38c007f

LTLA and others added 4 commits January 30, 2021 01:18

Edge closer towards correct writing of sparse matrices.

3debab3

Fix layer path when deleting DelayedArrays in writeH5AD

37fe0ee

Add DelayedArray tests for writeH5AD

dd7891d

Import DelayedArray::type()

58383f3

lazappi added 2 commits February 16, 2021 14:20

Avoid skipping sparse DelayedArrays in writeH5AD()

271383b

Avoid grimbough/rhdf5#79

Add tests for sparse DelayedArrays

57fc36f

lazappi added 3 commits February 18, 2021 11:43

Skip code coverage for currently unused writers

c6d4a36

Document handling of DelayedArrays

f502788

Bump version and update NEWS

37e74f7

lazappi merged commit 62ff57f into theislab:master Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write DelayedArray to file on the R side. #35

Write DelayedArray to file on the R side. #35

LTLA commented Jan 27, 2021

LTLA commented Jan 29, 2021

lazappi commented Jan 30, 2021

LTLA commented Jan 30, 2021

LTLA commented Feb 12, 2021

lazappi commented Feb 12, 2021

LTLA commented Feb 12, 2021

lazappi commented Feb 16, 2021

LTLA commented Feb 16, 2021

Write DelayedArray to file on the R side. #35

Write DelayedArray to file on the R side. #35

Conversation

LTLA commented Jan 27, 2021

LTLA commented Jan 29, 2021

lazappi commented Jan 30, 2021

LTLA commented Jan 30, 2021

LTLA commented Feb 12, 2021

lazappi commented Feb 12, 2021

LTLA commented Feb 12, 2021

lazappi commented Feb 16, 2021

LTLA commented Feb 16, 2021