Zarr Deprecation Notice June 2020 |
---|
The Zarr output has been deprecated and is no longer generated by skylab pipelines as of June 2020. |
The version of the ZARR output can be identified by examining the optimus_output_schema_version
attribute of the root node. Sematic versioning of the file schema should be used.
The secondary analysis pipelines shall write expression matrices and associated tabular metadata using the zarr version 2 DirectoryStore format. The format is specified in more detail here.
The top level entity contains all the data arrays and has two attributes:
- sample_id: The sample identifier
- optimus_output_schema_version: A string with the version identifier of the ZARR output schema
Data are stored in the following chunked and compressed arrays:
- gene_metadata_numeric_name ("<U80" # little-endian 80 char unicode)
- gene_metadata_numeric (np.float32)
- cell_metadata_float_name ("<U80" # little-endian 80 char unicode)
- cell_metadata_float (np.float32)
- cell_metadata_bool_name ("<U80" # little-endian 80 char unicode)
- cell_metadata_bool (np.bool)
- cell_id ("<U80")
- gene_id ("<U80")
- gene_metadata_string_name ("<U80")
- gene_metadata_string ("<U80")
- expression (dtype=np.float32)
The following documentation outlines the format of the ZARR output before versioning started.
The secondary analysis pipelines shall write expression matrices and associated tabular metadata using the zarr version 2 directory store format. The format is specified more precisely here, but in general zarr stores contain groups and arrays. Groups can contain other groups and arrays. Arrays are stored chunked and compressed. The file names and directory structure of the zarr store convey the group and chunk structure.
Expression values and metadata are stored in eleven arrays. These arrays are stored within a single zarr group names {unique_id}.zarr
, where {unique_id}
is a string that uniquely identifies this array. This could be a UUID generated by the DCP or a user-submitted identifier.
-
expression - The expression values themselves. Rows correspond to cells and columns to genes. These must represent "raw" expression values, though the precise definition of those values may vary by experiment type. Any normalization is deferred, but metadata that enables normalization should be included in the metadata arrays defined below.
- shape:
(n_cells, n_genes)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(10000, 10000)
- dtype:
float32
- shape:
-
cell_id - Identifiers for each cell. The means of identifying a cell varies by experiment type. For example, in droplet data this is a barcode; in well-based assays it could be a well or cell suspension identifier. The
ith
element of this array identifies theith
row of theexpression
array. Elements in this array must be unique.- shape:
(n_cells,)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(10000,)
- dtype:
U40
- shape:
-
gene_id - Identifiers for each gene. The
ith
element of this array identifies the gene at theith
column of theexpression
array. Thegene_id
array must be identical for all expression data produced by HCA secondary analysis pipelines. Elements in this array must be unique.- shape:
(n_genes,)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(10000,)
- dtype:
U40
- shape:
-
cell_metadata_numeric - Numeric metadata values associated with each cell. These vary by experiment type but should be as consistent as possible. These include quality control values as well as other computed values that aid in cell filtering and selection.
- shape:
(n_cells, n_cell_metadata_numeric_values)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(10000, n_cell_metadata_numeric_values)
- dtype:
float32
- shape:
-
cell_metadata_string - String metadata values associated with each cell. As with
cell_metadata_numeric
, these will vary by experiment type.- shape:
(n_cells, n_cell_metadata_string_values)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(10000, n_cell_metadata_string_values)
- dtype:
U40
- shape:
-
gene_metadata_numeric - Numeric metadata values associated with each gene. Like
cell_metadata_numeric
these can vary by experiment type, but care should be taken to maintain consistency.- shape:
(n_gene_metadata_numeric_values, n_genes)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(n_gene_metadata_numeric_values, n_genes)
- dtype:
float32
- shape:
-
gene_metadata_string - String metadata values associated with each gene.
- shape:
(n_gene_metadata_string_values, n_genes)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(n_gene_metadata_string_values, n_genes)
- dtype:
U40
- shape:
-
cell_metadata_numeric_name - Field names for
cell_metadata_numeric
. Elements in this array must be unique.- shape:
(n_cell_metadata_numeric_values,)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(n_cell_metadata_numeric_values,)
- dtype:
U40
- shape:
-
cell_metadata_string_name - Field names for
cell_metadata_string
. Elements in this array must be unique.- shape:
(n_cell_metadata_string_values,)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(n_cell_metadata_string_values,)
- dtype:
U40
- shape:
-
gene_metadata_numeric_name - Field names for
gene_metadata_numeric
. Elements in this array must be unique.- shape:
(n_gene_metadata_numeric_values,)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(n_gene_metadata_numeric_values,)
- dtype:
U40
- shape:
-
gene_metadata_string_name - Field names for
gene_metadata_string
. Elements in this array must be unique.- shape:
(n_gene_metadata_string_values,)
- compressor:
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
- chunk shape:
(n_gene_metadata_string.values,)
- dtype:
U40
- shape:
To illustrate how the zarr directory store manifests in an HCA DCP bundle, consider a bundle with 25,000 cells. That bundle should have files with the following names:
expression_matrix/.zgroup
expression_matrix/expression/.zarray
expression_matrix/expression/0.0
expression_matrix/expression/0.1
expression_matrix/expression/0.2
expression_matrix/cell_id/.zarray
expression_matrix/cell_id/0
expression_matrix/cell_id/1
expression_matrix/cell_id/2
expression_matrix/gene_id/.zarray
expression_matrix/gene_id/0
expression_matrix/cell_metadata_numeric/.zarray
expression_matrix/cell_metadata_numeric/0.0
expression_matrix/cell_metadata_numeric/0.1
expression_matrix/cell_metadata_numeric/0.2
expression_matrix/cell_metadata_string/.zarray
expression_matrix/cell_metadata_string/0.0
expression_matrix/cell_metadata_string/0.1
expression_matrix/cell_metadata_string/0.2
expression_matrix/gene_metadata_string/.zarray
expression_matrix/gene_metadata_string/0.0
expression_matrix/gene_metadata_numeric/.zarray
expression_matrix/gene_metadata_numeric/0.0
expression_matrix/cell_metadata_numeric_name/.zarray
expression_matrix/cell_metadata_numeric_name/0
expression_matrix/cell_metadata_string_name/.zarray
expression_matrix/cell_metadata_string_name/0
expression_matrix/gene_metadata_numeric_name/.zarray
expression_matrix/gene_metadata_numeric_name/0
expression_matrix/gene_metadata_string_name/.zarray
expression_matrix/gene_metadata_string_name/0