Skip to content

Latest commit

 

History

History
148 lines (121 loc) · 7.7 KB

matrix_format_spec.md

File metadata and controls

148 lines (121 loc) · 7.7 KB
Zarr Deprecation Notice June 2020
The Zarr output has been deprecated and is no longer generated by skylab pipelines as of June 2020.

Version 1.0

The version of the ZARR output can be identified by examining the optimus_output_schema_version attribute of the root node. Sematic versioning of the file schema should be used.

Format

The secondary analysis pipelines shall write expression matrices and associated tabular metadata using the zarr version 2 DirectoryStore format. The format is specified in more detail here.

The top level entity contains all the data arrays and has two attributes:

  • sample_id: The sample identifier
  • optimus_output_schema_version: A string with the version identifier of the ZARR output schema

Data are stored in the following chunked and compressed arrays:

  • gene_metadata_numeric_name ("<U80" # little-endian 80 char unicode)
  • gene_metadata_numeric (np.float32)
  • cell_metadata_float_name ("<U80" # little-endian 80 char unicode)
  • cell_metadata_float (np.float32)
  • cell_metadata_bool_name ("<U80" # little-endian 80 char unicode)
  • cell_metadata_bool (np.bool)
  • cell_id ("<U80")
  • gene_id ("<U80")
  • gene_metadata_string_name ("<U80")
  • gene_metadata_string ("<U80")
  • expression (dtype=np.float32)

Pre-version 1.

The following documentation outlines the format of the ZARR output before versioning started.

Format

The secondary analysis pipelines shall write expression matrices and associated tabular metadata using the zarr version 2 directory store format. The format is specified more precisely here, but in general zarr stores contain groups and arrays. Groups can contain other groups and arrays. Arrays are stored chunked and compressed. The file names and directory structure of the zarr store convey the group and chunk structure.

Groups and Arrays

Expression values and metadata are stored in eleven arrays. These arrays are stored within a single zarr group names {unique_id}.zarr, where {unique_id} is a string that uniquely identifies this array. This could be a UUID generated by the DCP or a user-submitted identifier.

  • expression - The expression values themselves. Rows correspond to cells and columns to genes. These must represent "raw" expression values, though the precise definition of those values may vary by experiment type. Any normalization is deferred, but metadata that enables normalization should be included in the metadata arrays defined below.

    • shape: (n_cells, n_genes)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (10000, 10000)
    • dtype: float32
  • cell_id - Identifiers for each cell. The means of identifying a cell varies by experiment type. For example, in droplet data this is a barcode; in well-based assays it could be a well or cell suspension identifier. The ith element of this array identifies the ith row of the expression array. Elements in this array must be unique.

    • shape: (n_cells,)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (10000,)
    • dtype: U40
  • gene_id - Identifiers for each gene. The ith element of this array identifies the gene at the ith column of the expression array. The gene_id array must be identical for all expression data produced by HCA secondary analysis pipelines. Elements in this array must be unique.

    • shape: (n_genes,)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (10000,)
    • dtype: U40
  • cell_metadata_numeric - Numeric metadata values associated with each cell. These vary by experiment type but should be as consistent as possible. These include quality control values as well as other computed values that aid in cell filtering and selection.

    • shape: (n_cells, n_cell_metadata_numeric_values)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (10000, n_cell_metadata_numeric_values)
    • dtype: float32
  • cell_metadata_string - String metadata values associated with each cell. As with cell_metadata_numeric, these will vary by experiment type.

    • shape: (n_cells, n_cell_metadata_string_values)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (10000, n_cell_metadata_string_values)
    • dtype: U40
  • gene_metadata_numeric - Numeric metadata values associated with each gene. Like cell_metadata_numeric these can vary by experiment type, but care should be taken to maintain consistency.

    • shape: (n_gene_metadata_numeric_values, n_genes)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (n_gene_metadata_numeric_values, n_genes)
    • dtype: float32
  • gene_metadata_string - String metadata values associated with each gene.

    • shape: (n_gene_metadata_string_values, n_genes)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (n_gene_metadata_string_values, n_genes)
    • dtype: U40
  • cell_metadata_numeric_name - Field names for cell_metadata_numeric. Elements in this array must be unique.

    • shape: (n_cell_metadata_numeric_values,)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (n_cell_metadata_numeric_values,)
    • dtype: U40
  • cell_metadata_string_name - Field names for cell_metadata_string. Elements in this array must be unique.

    • shape: (n_cell_metadata_string_values,)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (n_cell_metadata_string_values,)
    • dtype: U40
  • gene_metadata_numeric_name - Field names for gene_metadata_numeric. Elements in this array must be unique.

    • shape: (n_gene_metadata_numeric_values,)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (n_gene_metadata_numeric_values,)
    • dtype: U40
  • gene_metadata_string_name - Field names for gene_metadata_string. Elements in this array must be unique.

    • shape: (n_gene_metadata_string_values,)
    • compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
    • chunk shape: (n_gene_metadata_string.values,)
    • dtype: U40

Example Bundle Structure

To illustrate how the zarr directory store manifests in an HCA DCP bundle, consider a bundle with 25,000 cells. That bundle should have files with the following names:

expression_matrix/.zgroup
expression_matrix/expression/.zarray
expression_matrix/expression/0.0
expression_matrix/expression/0.1
expression_matrix/expression/0.2
expression_matrix/cell_id/.zarray
expression_matrix/cell_id/0
expression_matrix/cell_id/1
expression_matrix/cell_id/2
expression_matrix/gene_id/.zarray
expression_matrix/gene_id/0
expression_matrix/cell_metadata_numeric/.zarray
expression_matrix/cell_metadata_numeric/0.0
expression_matrix/cell_metadata_numeric/0.1
expression_matrix/cell_metadata_numeric/0.2
expression_matrix/cell_metadata_string/.zarray
expression_matrix/cell_metadata_string/0.0
expression_matrix/cell_metadata_string/0.1
expression_matrix/cell_metadata_string/0.2
expression_matrix/gene_metadata_string/.zarray
expression_matrix/gene_metadata_string/0.0
expression_matrix/gene_metadata_numeric/.zarray
expression_matrix/gene_metadata_numeric/0.0
expression_matrix/cell_metadata_numeric_name/.zarray
expression_matrix/cell_metadata_numeric_name/0
expression_matrix/cell_metadata_string_name/.zarray
expression_matrix/cell_metadata_string_name/0
expression_matrix/gene_metadata_numeric_name/.zarray
expression_matrix/gene_metadata_numeric_name/0
expression_matrix/gene_metadata_string_name/.zarray
expression_matrix/gene_metadata_string_name/0