Skip to content

Code and data for Manifold Alignment of Multiple single cell data

License

Notifications You must be signed in to change notification settings

architverma1/sc-manifold-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sc-manifold-alignment

Code for recreating results from "A Bayesian nonparametric semi-supervised model for integration of multiple single-cell experiments." Given multiple single cell RNA seq datasets with some shared genes, sstGPVLM fits a joint latent space that can be used for downstream analysis.

Data Processing

processing contains jupter notebooks for converting the orignal data to hdf5 files inputed to scripts. Simulated data can be generated in the analysis notebook provided. Original single cell data is available at:

  1. Pancrease data: https://github.com/MarioniLab/MNN2017. Follow the processing steps pre-alignment in the provided by the R files first.
  2. Gilad data: https://github.com/jdblischak/singlecell-qtl
  3. seqFISH+ data: https://github.com/CaiGroup/seqFISH-PLUS

Fitting

alignment-scripts contains python scripts for fitting the model to data. It also contains a python script for calculating the average Wasserstein-based distance of a fit from the true latent space.

Requirements

sstGPLVM is implemented in python 2.7 with:

  1. numpy 1.14.5
  2. pandas 0.23.3
  3. h5py 2.8.0
  4. tensorflow 1.6.0
  5. edwards 1.3.5
  6. sklearn 0.19.2

Running

Input:

  1. A numpy array or sparse csr/csc matrix of scRNA counts (or other types data) with format N cells (samples) as rows by p genes (features) as columns (loaded to y_train). Input this directly into the code as y_train.
  2. A numpy array of relevant metadata with format N cells as rows by m metadata fatures (loaded to z_init). It is also possible to structure the metadata with some missing cells that can be imputed (see alignment-seqfish for an example).

Options: The following parameters can be adjusted in the script to adjust inference:

  1. Degrees of freedom (--df) - default: 4
  2. Use t-Distribution error model (otherwise normal error) (--T) - default: True
  3. Initial Number of Dimensions (--Q) - default: 3
  4. Kernel Function
    • Matern 1/2, 3/2, 5/2 (--m12, --m32, --m52) - default: False
    • Periodic (--per_bool) - default: False
  5. Number of Inducing Points (--m) - default: 30
  6. Batch size (--M) - default: 250
  7. Max iterations (--iterations) - default: 5000
  8. Save frequency (--save_freq): - default: 250
  9. Sparse data type (is CSC or CSR) (--sparse): - default: False
  10. PCA Initialization (otherwise random initialization) (--pca_init): - default: True
  11. Output directory (--out): - default: ./test

Output: hdf5 file with

  1. Latent mapping posterior (mean and variance)
  2. Gene-specific noise
  3. Kernel hyperparameters (variance, lengthscale)
  4. Inducing points in latent and high-dimensional space
  5. The final metadata (Z) variables

Analysis

analysis-nbs contains jupyter notebooks and the required output files for recreating figures from the paper.

About

Code and data for Manifold Alignment of Multiple single cell data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages