Code for recreating results from "A Bayesian nonparametric semi-supervised model for integration of multiple single-cell experiments." Given multiple single cell RNA seq datasets with some shared genes, sstGPVLM fits a joint latent space that can be used for downstream analysis.
processing contains jupter notebooks for converting the orignal data to hdf5 files inputed to scripts. Simulated data can be generated in the analysis notebook provided. Original single cell data is available at:
- Pancrease data: https://github.com/MarioniLab/MNN2017. Follow the processing steps pre-alignment in the provided by the R files first.
- Gilad data: https://github.com/jdblischak/singlecell-qtl
- seqFISH+ data: https://github.com/CaiGroup/seqFISH-PLUS
alignment-scripts contains python scripts for fitting the model to data. It also contains a python script for calculating the average Wasserstein-based distance of a fit from the true latent space.
sstGPLVM is implemented in python 2.7 with:
- numpy 1.14.5
- pandas 0.23.3
- h5py 2.8.0
- tensorflow 1.6.0
- edwards 1.3.5
- sklearn 0.19.2
Input:
- A numpy array or sparse csr/csc matrix of scRNA counts (or other types data) with format N cells (samples) as rows by p genes (features) as columns (loaded to
y_train
). Input this directly into the code as y_train. - A numpy array of relevant metadata with format N cells as rows by m metadata fatures (loaded to
z_init
). It is also possible to structure the metadata with some missing cells that can be imputed (seealignment-seqfish
for an example).
Options: The following parameters can be adjusted in the script to adjust inference:
- Degrees of freedom (
--df
) - default: 4 - Use t-Distribution error model (otherwise normal error) (
--T
) - default: True - Initial Number of Dimensions (
--Q
) - default: 3 - Kernel Function
- Matern 1/2, 3/2, 5/2 (
--m12, --m32, --m52
) - default: False - Periodic (
--per_bool
) - default: False
- Matern 1/2, 3/2, 5/2 (
- Number of Inducing Points (
--m
) - default: 30 - Batch size (
--M
) - default: 250 - Max iterations (
--iterations
) - default: 5000 - Save frequency (
--save_freq
): - default: 250 - Sparse data type (is CSC or CSR) (
--sparse
): - default: False - PCA Initialization (otherwise random initialization) (
--pca_init
): - default: True - Output directory (
--out
): - default: ./test
Output: hdf5 file with
- Latent mapping posterior (mean and variance)
- Gene-specific noise
- Kernel hyperparameters (variance, lengthscale)
- Inducing points in latent and high-dimensional space
- The final metadata (Z) variables
analysis-nbs contains jupyter notebooks and the required output files for recreating figures from the paper.