Skip to content

Commit

Permalink
Lk remove zarr readme (#350)
Browse files Browse the repository at this point in the history
Updated Optimus documentation for Zarr removal
  • Loading branch information
ekiernan authored Jun 9, 2020
1 parent 05c4505 commit f9edfe4
Show file tree
Hide file tree
Showing 5 changed files with 27 additions and 28 deletions.
16 changes: 10 additions & 6 deletions docs/matrix_format_spec.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# Version 1.0
| Zarr Deprecation Notice June 2020 |
| --- |
| The Zarr output has been deprecated and is no longer generated by skylab pipelines as of June 2020. |

### Version 1.0
The version of the ZARR output can be identified by examining the `optimus_output_schema_version` attribute of the root node. Sematic versioning of the file schema should be used.

## Format
#### Format

The secondary analysis pipelines shall write expression matrices and associated tabular metadata using the zarr version 2 DirectoryStore format. The format is specified in more detail [here](https://zarr.readthedocs.io/en/stable/spec/v2.html).

Expand All @@ -24,15 +28,15 @@ Data are stored in the following chunked and compressed arrays:
- gene_metadata_string ("<U80")
- expression (dtype=np.float32)

# Pre-version 1.
### Pre-version 1.
The following documentation outlines the format of the ZARR output before versioning started.


## Format
#### Format

The secondary analysis pipelines shall write expression matrices and associated tabular metadata using the zarr version 2 directory store format. The format is specified more precisely [here](https://zarr.readthedocs.io/en/stable/spec/v2.html), but in general zarr stores contain groups and arrays. Groups can contain other groups and arrays. Arrays are stored chunked and compressed. The file names and directory structure of the zarr store convey the group and chunk structure.

## Groups and Arrays
### Groups and Arrays

Expression values and metadata are stored in eleven arrays. These arrays are stored within a single zarr group names `{unique_id}.zarr`, where `{unique_id}` is a string that uniquely identifies this array. This could be a UUID generated by the DCP or a user-submitted identifier.

Expand Down Expand Up @@ -103,7 +107,7 @@ Expression values and metadata are stored in eleven arrays. These arrays are sto
- chunk shape: `(n_gene_metadata_string.values,)`
- dtype: `U40`

# Example Bundle Structure
### Example Bundle Structure

To illustrate how the zarr directory store manifests in an HCA DCP bundle,
consider a bundle with 25,000 cells. That bundle should have files with the
Expand Down
11 changes: 5 additions & 6 deletions pipelines/optimus/Loom_schema.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# What's in the Optimus Pipeline Loom File?

The Loom file is an HDF5 file generated using [Loompy v.2.0.17](http://loompy.org/). It contains global attributes detailing how counts were generated for the single-cell or single-nuclei parameters ([Table 1](#table-1-global-attributes)). It additionally contains UMI-corrected counts as well as multiple metrics for both individual cells (the columns of the matrix; [Table 2](#table-2-column-attributes-cell-metrics)) and individual genes (the rows of the matrix; [Table 3](#table-3-row-attributes-gene-metrics)). The tables below document these metrics, list which tools generate them, and define them. This Loom file is an optional output of the Optimus pipeline. The default matrix output of the Optimus pipeline is a Zarr Array. The Loom file is directly derived from the Zarr and contains the same information with only minor header updates for schema compatibility.
The Loom file is an HDF5 file generated using [Loompy v.3.0.6](http://loompy.org/). It contains global attributes detailing how counts were generated for the single-cell or single-nuclei parameters ([Table 1](#table-1-global-attributes)). It additionally contains UMI-corrected counts as well as multiple metrics for both individual cells (the columns of the matrix; [Table 2](#table-2-column-attributes-cell-metrics)) and individual genes (the rows of the matrix; [Table 3](#table-3-row-attributes-gene-metrics)). The tables below document these metrics, list which tools generate them, and define them. This Loom file is the default matrix output of the Optimus pipeline.

**Note**: Loom files generated by Optimus are different from the final Loom file distributed on the [Human Cell Atlas Data Portal](https://data.humancellatlas.org/explore/projects), which removes some of the metadata detailed in this document and contains additional metadata relating to each individual project.

Expand All @@ -11,13 +11,14 @@ The global attributes in the Loom apply to the whole file, not any specific part
| :-- | :-- |
| LOOM_SPEC_VERSION | String with the loom file spec version |
| expression_data_type | String describing if the pipeline counts exonic or whole transcript (exonic and intronic) reads. For the single-cell mode (counting_mode = sc_rna), the value will be "exonic"; for the single-nuclei mode (counting_mode = sn_rna), the value will be "whole_transcript" |
| sample_id | The sample or cell id listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata. |


## Table 2. Column Attributes (Cell Metrics)

| Cell Metrics | Program |Details |
|:---|:---:|:--------------------|
|`CellID` | [SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics) | The unique identifier for each cell based on cell barcodes |
|`cell_names` | [SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics) | The unique identifier for each cell based on cell barcodes |
|`n_reads`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| The number of reads associated with this entity. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.CellMetrics.n_reads)|
|`noise_reads`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| Number of reads that are categorized by 10x Genomics Cell Ranger as "noise". Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.CellMetrics.noise_reads)|
|`perfect_molecule_barcodes`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| The number of reads with molecule barcodes that have no errors. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.CellMetrics.perfect_molecule_barcodes)|
Expand Down Expand Up @@ -63,8 +64,8 @@ The global attributes in the Loom apply to the whole file, not any specific part

| Gene Metrics | Program |Details |
|-------------------------------|--------------------|------------------------|
|`Accession` | [GENCODE GTF](https://www.gencodegenes.org/) | The gene_id listed in the GENCODE GTF |
|`Gene` | [GENCODE GTF](https://www.gencodegenes.org/) | The unique gene_name provided in the GENCODE GTF |
|`ensembl_ids` | [GENCODE GTF](https://www.gencodegenes.org/) | The gene_id listed in the GENCODE GTF |
|`gene_names` | [GENCODE GTF](https://www.gencodegenes.org/) | The unique gene_name provided in the GENCODE GTF |
|`n_reads`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| The number of reads associated with this entity. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.CellMetrics.n_reads)|
|`noise_reads`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| The number of reads that are categorized by 10x Genomics Cell Ranger as "noise". Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.CellMetrics.noise_reads)|
|`perfect_molecule_barcodes`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| The number of reads with molecule barcodes that have no errors. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.CellMetrics.perfect_molecule_barcodes)|
Expand Down Expand Up @@ -93,5 +94,3 @@ The global attributes in the Loom apply to the whole file, not any specific part
|`number_cells_expressing`|[SC Tools](https://github.com/HumanCellAtlas/sctools/tree/master/src/sctools/metrics)| The number of cells that detect this gene. [Metrics Definitions](https://sctools.readthedocs.io/en/latest/sctools.metrics.html#sctools.metrics.aggregator.GeneMetrics.number_cells_expressing)|




Binary file modified pipelines/optimus/Optimus_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified pipelines/optimus/Optimus_diagram.pptx
Binary file not shown.
Loading

0 comments on commit f9edfe4

Please sign in to comment.