SOMA – for “Stack Of Matrices, Annotated” – is a flexible, extensible, and open-source API enabling access to data in a variety of formats. SOMA is designed to be general-purpose for data that can be modeled as one or more sets of 2D annotated matrices with measurements of features across observations. The driving use case of SOMA is for single-cell data in the form of annotated matrices where observations are frequently cells and features are genes, proteins, or genomic regions.
Datasets generated by profiling single cells are rapidly increasing in size and complexity. This has resulted in a need for scalable solutions to accommodate data sizes that no longer fit in memory and flexibility to accommodate the diversity of data being produced.
To address these emerging needs in the single cell ecosystem, the Chan Zuckerberg Initiative in partnership with TileDB is:
- Driving the development of SOMA.
- Providing its first implementation, TileDB-SOMA which utilizes the TileDB Embedded engine.
- Adopting TileDB-SOMA at CZ CELLxGENE Discover to build its Census which provides efficient access and querying to a corpus containing nearly 50 million cells, compiled from 700+ datasets.
The SOMA
specification and its TileDB-SOMA
implementation provide the following capabilities for single-cell data:
- An abstract specification with flexibility for data from multiple modalities (e.g. RNA, spatial, epigenomics)
- A format to store and access datasets larger than memory, as compared to the current paradigm of
.h5ad
/.mtx
/.tgz
/.RData
/.h5Seurat
/ etc. - Eliminates in-memory limitations by providing query-ready data management for reading and writing at low latency and cloud scale.
- R and python APIs with the flexibility to expand to other languages.
- SOMA abstract specification — language-agnostic SOMA API specification.
- Python SOMA specification — persistence-layer–agnostic Python definition of SOMA core types.
- TileDB-SOMA — Python and R implementation of SOMA specification using TileDB Embedded. R coming soon.
- R SOMA specification and its implementation through TileDB-SOMA.
- End-user documentation for both Python and R TileDB-SOMA APIs, including a getting-started guide, notebooks, and API reference.
- We expect the TileDB-SOMA repository to be the front door for reporting and tracking implementation issues https://github.com/single-cell-data/TileDB-SOMA/issues. In addition, for spec-related issues please submit an issue at https://github.com/single-cell-data/SOMA/issues.
- If you believe you have found a security issue, in lieu of filing an issue please responsibly disclose it by contacting security@chanzuckerberg.com.
- Feedback is appreciated, as this is a community-driven project. If you have well-scoped features/discussions please add them to https://github.com/single-cell-data/SOMA/issues. For any other inquiries please reach out to soma@chanzuckerberg.com.
- If you would like to learn more about SOMA or would like to keep up to date with the latest developments, please join our mailing list here.
This project adheres to CZI's Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.