A pipeline for self-supervised phenotype detection from confocal images of Ciona robusta embryos.
circlize
, class
, cluster
, ComplexHeatmap
, fgsea
, ggplot2
, ggpubr
, igraph
, keras
, leiden
, optparse
, parallel
, purrr
, STRINGdb
, umap
This pipeline also uses the custom packages dirfns
and moreComplexHeatmap
# parse segmentation data from /segdat
Rscript readEmbryos.R
# train autoencoder models
Rscript encode.R
# generate graph of known target protein interactions from STINGdb
Rscript interactions.R
# generate and score clusters for randomized hyperparameters
Rscript leiden.R
# generate plots and characterize optimal clusterings
Rscript plot.clust.R
The pipeline uses features extracted from segmentation of confocal images using Imaris. Summary statistics are extracted for segmened cells in each embryo. From the cell segmentation statistics 116 embryo-level parameters are computed. Parameters are normalized by z-score then scaled between -1 and 1.
Sample parameters are often strongly correlated. This is undesirable for self-supervised learning because each parameter additively contributes to distance used for clustering, resulting in disproportionate weight being given to phenotypes captured by multiple parameters. Linear methods of dimenison reduction (e.g. PCA) assume that all variables are independent and can be linearly combined. We could not assume that all of our measured input parameters were independent, so we instead used an autoencoder for dimension reduction.
An autoencoder is a neural network architecture widely used for denoising and image recognition. It works by encoding the input data into a lower dimensional representation that can be decoded with minimal loss. By extracting this lower dimensional encoding (the "bottleneck" or "embedding" layer), an autoencoder can be used for dimension reduction. This results in an embedding that corresponds to the information content of the input data rather than absolute distance in phenotype space.
encode.R
trains four autoencoders using embedding layers of 2, 3, 7, and 14 dimensions. leiden.R
selects the optimal embedding based on Akaike Information Criterion defined as
where
Euclidean distance between embeddings is used to compute a k-nearest neighbors graph. The graph is then partitioned into clusters by modularity, which is defined as
where
Because optimizing modularity is NP-hard, we used the leiden algorithm to approximate an optimal solution.
Though modularity ensures that clusters are well-connected, the number of clusters returned is dependent on
leiden.R
performs clustering for 100 random
We selected
Ortholog Lookup
interactions.R
uses STRINGdb to construct a known protein interaction network of the perturbed genes. Because the C. robusta network is poorly characterized, we use ENSEMBL to obtain orthologs from M. musculus and H. sapiens.
GSEA
The known protein interactions can be treated as a gene set for GSEA. Interactions can be ranked by edge count between embryos in two conditions. An enrichment score is calculated based on occurrence of known interactions near the top of the ranked list. An optimal
Gene Network
A gene network can be created from the
where
After selecting a
Reduced
GSEA
Condition pairs for each clustering are ranked by the proportion of edges that are between embryos in the same cluster vs. between embryos in different clusters. An enrichment score can be calculates as with
Comparison to Known Protein Interactions
A second gene network is constructed using partial modularity between pairs of conditions. We define the partial modularity
where
Mean Silhouette Width
Pointwise silhouette width
where
plot.clusts.R
tests for enrichment of experimental perturbations and experimenter-labeled phenotypes in each cluster using a hypergeometric test. For each condition
where
We define the odds ratio