In the Gene Expression team at the EBI we produce Single Cell Expression Atlas (SCXA), using a consistent pipeline to analyse data from the raw FASTQ files and produce the results visibile in the SCXA interface. Intermediate in this process is an 'analysis bundle' whereby the subset of results needed are gathered and formatted correctly for loading into our databases and indices.
We sometimes encounter datasets without raw data, but where the user has an annData file containing expression matrices and derived results. The purpose of this repository is to provide a way of producing an analysis bundle directly from an annData object, allowing us (with some manual curation) to input these experiments. It will also be useful to simplify our own processes, since we can take the annData files now produced at the end of our analysis to produce a bundle in one step.
This is the implementation of an internal strategy document.
See also a walk-through for a specific example dataset
An SCXA analysis bundle contains:
- MTX-format expression matrices (raw, filtered, normalised)
- TSV-format cell metadata used in analysis
- TSV-format gene metadata used in analysis
- TSV-format dimension reductions of different parameterisation (t-SNE, UMAP)
- An annData-format file containing all the above
- Reference files (GTF, cDNA) used in analysis
- A software report detailing tools and versions used
- A manifest file containing all the above
Here is an example manifest:
Description File Parameterisation
software_versions_file software.tsv
mtx_matrix_rows filtered_normalised/genes.tsv.gz filtered_normalised
mtx_matrix_cols filtered_normalised/barcodes.tsv.gz filtered_normalised
mtx_matrix_content filtered_normalised/matrix.mtx.gz filtered_normalised
tsv_matrix filtered_normalised/filtered_normalised.tsv filtered_normalised
mtx_matrix_rows tpm/genes.tsv.gz tpm
mtx_matrix_cols tpm/barcodes.tsv.gz tpm
mtx_matrix_content tpm/matrix.mtx.gz tpm
tsv_matrix tpm/tpm.tsv tpm
mtx_matrix_rows raw/genes.tsv.gz raw
mtx_matrix_cols raw/barcodes.tsv.gz raw
mtx_matrix_content raw/matrix.mtx.gz raw
tsv_matrix raw/raw.tsv raw
mtx_matrix_rows raw_filtered/genes.tsv.gz raw_filtered
mtx_matrix_cols raw_filtered/barcodes.tsv.gz raw_filtered
mtx_matrix_content raw_filtered/matrix.mtx.gz raw_filtered
tsv_matrix raw_filtered/raw_filtered.tsv raw_filtered
mtx_matrix_rows tpm_filtered/genes.tsv.gz tpm_filtered
mtx_matrix_cols tpm_filtered/barcodes.tsv.gz tpm_filtered
mtx_matrix_content tpm_filtered/matrix.mtx.gz tpm_filtered
tsv_matrix tpm_filtered/tpm_filtered.tsv tpm_filtered
cell_metadata E-MTAB-6077.cell_metadata.tsv
condensed_sdrf E-MTAB-6077.condensed-sdrf.tsv
project_file E-MTAB-6077.project.h5ad
reference_transcriptome reference/Danio_rerio.GRCz11.cdna.all.104.fa.gz
reference_annotation reference/Danio_rerio.GRCz11.104.gtf.gz
gene_metadata reference/gene_annotation.txt
protocol smart-seq
tsne_embeddings tsne_perplexity_1.tsv 1
tsne_embeddings tsne_perplexity_10.tsv 10
umap_embeddings umap_n_neighbors_10.tsv 10
umap_embeddings umap_n_neighbors_100.tsv 100
cluster_markers markers_2.tsv 2
cluster_markers markers_23.tsv 23
cluster_markers markers_32.tsv 32
cluster_markers markers_42.tsv 42
cluster_markers markers_49.tsv 49
cluster_markers markers_5.tsv 5
marker_stats filtered_normalised_stats.csv filtered_normalised
marker_stats tpm_filtered_stats.csv tpm_filtered
cluster_memberships clusters_for_bundle.txt
The steps required to produce a bundle are:
- Initialise the bundle (
init
step). Process the annData file to determine what information is available and tore a summary of this information in a YAML-format configuration file (anndata-config.yaml
in the bundle directory), and us that to Generate bundle files (1st time) based on the starting config files in 1. - Examine the cell metadata files and use that information to refine configuration related to cell metadata. This includes flagging fields that should be included in the pre-MAGE-TAB files, and for droplet experiments, finding the field that separates cells from different libraries. Also check the gene metadata at this stage, to check that Ensembl gene identifiers are available.
- Initialise MAGE-TAB (
init_magetab
step). Generate bundle files (2nd time) based on configuration refined in 2.. This will include pre-MAGE-TAB files suitable as a basis for curation. - Undertake curation based on pre-magetab files generated at 3.
- Inject curated metadata into the annData object and configuration (
inject_magetab
step). Condense the new SDRF file (including zoomification) and inject the new metadata back into the annData object where it can be used in curation. Output a new configuration that includes any new fields added in curation. - Make final edits to the config YAML. Complete anything marked with FILL ME, set the
load_to_scxa_db
matrix and flag any e.g. cell type fields for marker generation. - Generate the final bundle suitable for loading into SCXA (
final
step).
Once finalised, the config YAML should be added to the scxa_metadata
repo alongside the MAGE-TAB files.
The process is progressive, with information on an experiment gradulally being accumulated for potential partial analysis and loading. Steps 1-5 could be undertaken by curators, with the bundle and configuration then being handed off to bionformaticians. However additional author-provided information (e.g. on matrix processing) is likely to be needed at step 6, and it may be that that information could be gathered from authors alongside curation in step 4.
This repository contains a python packages which should be used to facilitate production of a bundle.
Probably the simplest install method is with Conda (or Mamba). e.g. to create a new environment with the package:
conda create -n atlas-anndata atlas-anndata
OR install it from PyPi like:
pip install atlas-anndata
The make_bundle_from_anndata
script is the central utility over which a bundle is created over the steps above. The usage is:
Usage: make_bundle_from_anndata [OPTIONS] EXP_NAME
[init|init_magetab|inject_magetab|final]
Build a bundle directory compatible with Single Cell Expression Atlas
(SCXA) build proceseses
exp_name - Specify an Expression Atlas identifier for this experiment.
step - Specify the bundle creation step. One of:
* 'init': start a bundle based on anndata file content
* 'init_magetab': create starting MAGE-TAB files as a
basis for curation, having checked and refined
configuration at the 'init' stage.
* 'inject_magetab': With curation done, read metadata
from the scxa-metadata repo, and modify bundle
configuration and annData object accordingly.
* 'final': Having made any refinements to the field
configuration modified by 'inject_magetab', produce the
final bundle.
Options:
--anndata-file PATH For the 'init' stage, specify a path to a
file of the annData hdf5 specification, with
all necessaryinformation for SCXA.
--bundle_dir PATH A directory under which bundle directories
should be created. Defaults to the current
working directory.
--atlas-style Assume the tight conventions from SCXA, e.g.
on .obsm slot naming?
--analysis-versions-file PATH A four-column tab-delimited file with analys,
analysis, version and citation
--droplet Is this a droplet experiment?
--gene-id-field TEXT Field in .var where gene ID is stored.
--gene-name-field TEXT Field in .var where gene name (symbol) is
stored.
--sample-field TEXT Field in .obs which separates cells from
different libraries.
--default-clustering TEXT Of the unsupervised clusterings, which
clustering should be set as the default? If
not set, the middle (or first middle)
clustering will be selected, or if --atlas-
style is set, this will be the clustering
corresponding to a resolution of 1.
--max-rank-for-stats INTEGER For how many top marker genes should stats
(mean, median expression) be output?
--matrix-for-markers TEXT Where cell groups in the configuration file
have been flagged with markers, which matrix
should be used? Can be X, or an entry in
.layers(). The matrix must be appropriate for
Scanpy's tl.rank_genes_groups() method,
usually meaning filtered, normalised and log
transformed, but without additional scaling.
[required]
--conda-prefix PATH Specify a Conda directory to be used for
environments when running Snakemake
workflows.
--scxa-metadata-branch TEXT When searching the SCXA metadata repository
for curation for this experiment, which
branch should we use? [required]
--sanitize-columns When adding data from curation into the
anndata object, should we remove the Comment,
Characteristic etc?
--exp-name TEXT Specify an Expression Atlas identifier that
will be used for this experiment. If not set,
a placeholder value E-EXP-1234 will be used
and can be edited in the bundle later.
[required]
--scxa-db-scale INTEGER To what overall scale should cell counts be
multiplied for the SCXA DB? A multiplier will
be calculated from this value and the median
cell-wise sum in the given matrix.
--help Show this message and exit.
(Note that the atlas-style
flag is probably only useful for annData files produced by the Experession Atlas team, and relies on a number of assumptions about the content of the file in order to infer some additional information.)
1. Bundle initialisation: Produce a YAML format annData description file and starting bundle content
To produce a valid bundle from an anndata file, we need to describe that file, outlining which of the cell/ gene metadata columns, matrices,dimension reductions etc should be included. This is done via a YAML-format config file (see example). Then produce a starting bundle based on that config.
Both of these things are accomplished by 'make_starting_config_from_anndata` with the 'init' step:
make_bundle_from_anndata --anndata-file atlas_anndata/data/bundles/E-MTAB-6077/E-MTAB-6077.project.h5ad E-MTAB-6077 init
(note: we would supply --droplet
at this stage for a droplet experiment).
This will create a starting bundle, by default in the current directory (see --bundle-dir
), including a starting version of the bundle configuration and a copy of the annData file which we will eventually customise and distribute:
> tree E-MTAB-6077/
E-MTAB-6077/
├── anndata-config.yaml
├── E-MTAB-6077.cell_metadata.tsv
├── E-MTAB-6077.project.h5ad
├── MANIFEST
└── reference
└── gene_annotation.txt
1 directory, 5 files
The config is likely to wrong in a number of ways, but its just a starting point.
With the configuration file and unmoderated content available to us we can make some sensible decisions about some of those settings in the YAML file.
For SCXA we need the gene symbol and ID fields. The configuration YAML might have populated these if the default field names are present, but you may well get:
gene_meta:
id_field: FILL ME with a string
name_field: FILL ME with a string
You need to look at the reference/gene_annotation.txt
file from the bundle directory and set these fields. id_field
must be a field containing Ensembl gene IDs. If these are not available we cannot work with a dataset. name_field
is a field containing gene symbols.
The sample field is encoded in the config generated above like:
sample_field: FILL ME with a string
The value of this configuration field must be a field name from NONAME.cell_metadata.tsv in the bundle directory corresponding to a field that separates cells from different libraries.
In case there is no such a field, please use the sample
field in this file, which is usually derived from the cell identifiers and should only be used in the absence of more concrete information.
Starting MAGE-TAB content can now be generated with the init_magetab
step of make_bundle_from_anndata
:
make_bundle_from_anndata E-MTAB-6077 init_magetab
The experiment name ('E-MTAB-6077' here) and step ('init_magetab') must be supplied. The annData and configuration will be read from the bundle directory, assumed by default to be in the current working directory (see above).
The pre-MAGE-TAB can now be used to start curation by the curation team. Assuming contact with authors is occurring at this point, it would save time if the information needed at step 6 could also be gathered here, such as the processing status of component matrices. Otherwise the appropriate contact should be passed to bioinformaticians, who will need this info prior to producing the final bundle.
MAGE-TAB format metadata should now be available in the scxa-metadata
repository, from where we can retrieve it to be condensed, zoomified, and added to the annData object. This can all be done with another run of make_bundle_from_anndata
:
make_bundle_from_anndata E-MTAB-6077 inject_magetab
This will pull the curated metadata from the scxa-metadata
repo, condense the SDRF (adding ontology terms) and re-generate cell-wise annotations that will be used to enrich the content of the annData file, with any new fields from curation added to .obs
. It will also add configuration for those fields to the YAML.
Not that --conda-prefix
can be spedified here, and is a location conda environments will be stored for the workflow that does SDRF condensation etc.
We're almost ready to creat our final bundle, but we must first finalise all the configuration in anndata-config.yaml
for the bundle.
Cell meta data from annData objects is a mixture of any input sample metadata provided by the author, plus annotations added over the course of analysis. The latter may not be appropriate for inclusion in the metadata in SCXA. Check the fields described in cell_meta
, especially their kind ('curation', 'clustering', 'analysis'). Curated fields are those present before analysis, biological and technical info for cells and samples. Clustering is used to indicate the results of unsupervised cell clustering stored in .obs. Analysis is everthing else, comprising all other fields added to .obs during analysis.
Most importantly:
- Flip
curation
toanalysis
for any field entry which should not ultimately form part of the SCXA experiment MAGE-TAB format metadata. - Ensure any fields corresponding to unsupervised clusterings are flagged correctly.
Unlike our standard submission pathways, for pre-analysed data we need additional information before the data are ingested for SCXA, which must currently be provided via the configuration YAML. The completed config from the following steps should be added to the scxa-metadata
alongside the MAGE-TAB files.
- Under analyses please describe the analysis that was done. At a minimum you should describe the reference used (see the example) and the mapping tool used.
- Under
matrices
check that you want all these matrices to be considered. You can remove any matrix that's not useful, and you should check the processing flags / matrices for each one. - Under load_to_scxa_db please state the matrix that should be used by Atlas in expression-based displays. This should be filtered and normalised but not scaled or transformed. If no matrix in the object matches these criteria please remove this part of the config and Atlas will not show displays for this experiment based on expression values.
- Check the dimension reductions described, again paying attention to 'kind'.
For all sections, check the example for an idea of how things should look. For example under matrices
there will be an entry pertaining the content of .X. You shold add a name (e.g 'scaled'), and check that the status of all the boolean flags is correct. This may well require some queries to the authors.
- cell_filtered: true
gene_filtered: true
log_transformed: true
measure: counts
name: FILL ME with a string
normalised: true
parameters: {}
scaled: true
slot: X
You would fill the 'name' field here with something more descriptive for the matrix.
Any categorical field can be used for marker detection (where appropriate matrices are available). This involves flipping the 'marker' status on a field annotation, e.g. changing:
- default: false
kind: clustering
markers: true
parameters: {}
slot: leiden
... to
- default: false
kind: clustering
markers: false
parameters: {}
slot: leiden
Having edited the config YAML, you should validate it against a schema we provide and the annData file itself. We can use this mechanism to ensure that inputs match the expectations of Single Cell Expression Atlas.
validate_anndata_with_config E-MTAB-6077
This will check that the config in the bundle directory for this experiment matches with the modified annData object there.
Bundling steps will also run this automatically before proceeding, but running it yourself will flag any issues early. If the validation flags any issues, resolve them.
With the configuration finalised, it's just a case of running the final
step for make_bundle_from_anndata
:
make_bundle_from_anndata E-MTAB-6077 final
All bundle content including matrices, dimension reductionss etc will be written, and this should form the final bundle that can be read by SCXA loading processes.