Update tutorial notebooks

Genentech · Mar 22, 2024 · b322be1 · b322be1
1 parent 5f272b3
commit b322be1
Show file tree

Hide file tree

Showing 5 changed files with 1,813 additions and 669 deletions.
diff --git a/docs/notebooks/advanced_tutorial.ipynb b/docs/notebooks/advanced_tutorial.ipynb
diff --git a/docs/notebooks/cell_annotation_tutorial.ipynb b/docs/notebooks/cell_annotation_tutorial.ipynb
@@ -5,30 +5,29 @@
    "id": "8943365d-1db3-4b0f-9ba0-18ed16e124fd",
    "metadata": {},
    "source": [
-    "# SCimilarity search for IPF derived myofibroblasts-similar cells across 22.7M cells\n",
-    "This tutorial is to familiarize users with SCimilarity's basic cell search function.\n",
+    "# Annotating cell types\n",
+    "This tutorial is to familiarize users with SCimilarity's basic cell annotation functionality.\n",
     "\n",
-    "- System requirements\n",
-    "   - At least 64GB of RAM\n",
-    "   - SCimilarity package installed\n",
-    " - Note: these are large files. Downloading and processing can take a several minutes.\n",
+    "System requirements:\n",
+    "\n",
+    "  - At least 64GB of RAM\n",
     "\n",
     "## 0. Required software and data\n",
     "Things you need for this demo:\n",
     "\n",
     " 0. [SCimilarity](https://github.com/Genentech/scimilarity) package should already be installed.\n",
     "\n",
-    " 1. SCimilarity trained model. [Download SCimilarity models](https://zenodo.org/record/8240464).\n",
+    " 1. SCimilarity trained model. [Download SCimilarity models](https://zenodo.org/record/8240464). Note, this is a large tarball - downloading and uncompressing can take a several minutes.\n",
     "\n",
-    " 2. Data. We will use [Adams et al., 2020](https://www.science.org/doi/10.1126/sciadv.aba1983?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed) healthy and IPF lung scRNA-seq data. [Download tutorial data](https://zenodo.org/record/8242083)."
+    " 2. A dataset to annotate. We will use [Adams et al., 2020](https://www.science.org/doi/10.1126/sciadv.aba1983?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed) healthy and IPF lung scRNA-seq data. [Download tutorial data](https://zenodo.org/record/8242083)."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "21a6de3d-b297-420c-a731-59bc2eee5d81",
    "metadata": {},
    "source": [
-    "If the models haven't been downloaded please uncomment and run the two command below"
+    "If the model hasn't been downloaded please uncomment and run the two command below"
    ]
   },
   {
@@ -38,16 +37,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "#!curl -L -o /models/model_v1.tar.gz https://zenodo.org/record/8240464/files/model_v1.tar.gz?download=1\n",
-    "#!tar -xzvf /models/model_v1.tar.gz"
+    "#!curl -L -o /models/model_v1.1.tar.gz \\\n",
+    "#https://zenodo.org/records/10685499/files/model_v1.1.tar.gz?download=1\n",
+    "#!tar -xzvf /models/model_v1.1.tar.gz"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "45048601-6f1b-4ee3-a60e-ebc5a7ce6deb",
    "metadata": {},
    "source": [
-    "If the models haven't been downloaded please uncomment and run the two command below"
+    "If the data hasn't been downloaded please uncomment and run the two command below"
    ]
   },
   {
@@ -57,30 +57,24 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "#!curl -L -o \"/data/GSE136831_subsample.h5ad\" https://zenodo.org/record/8242083/files/GSE136831_subsample.h5ad?download=1"
+    "#!curl -L -o \"/data/GSE136831_subsample.h5ad\" \\\n",
+    "#https://zenodo.org/record/8242083/files/GSE136831_subsample.h5ad?download=1"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "id": "c91db7eb",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/kuot/miniconda3/envs/gpy/lib/python3.10/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.7.3' currently installed).\n",
-      "  from pandas.core.computation.check import NUMEXPR_INSTALLED\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "import scanpy as sc\n",
+    "\n",
     "sc.set_figure_params(dpi=100)\n",
     "\n",
     "import warnings\n",
-    "warnings.filterwarnings('ignore')"
+    "\n",
+    "warnings.filterwarnings(\"ignore\")"
    ]
   },
   {
@@ -121,7 +115,9 @@
    },
    "outputs": [],
    "source": [
-    "model_path = '/models/model_v1'\n",
+    "# Instantiate the CellAnnotation object\n",
+    "# Set model_path to the location of the uncompressed model\n",
+    "model_path = \"/models/model_v1.1\"\n",
     "ca = CellAnnotation(model_path=model_path)"
    ]
   },
@@ -140,7 +136,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data_path = '/data/GSE136831_subsample.h5ad'\n",
+    "# Load the tutorial data\n",
+    "# Set data_path to the location of the tutorial dataset\n",
+    "data_path = \"/data/GSE136831_subsample.h5ad\"\n",
     "adams = sc.read(data_path)"
    ]
   },
@@ -155,7 +153,7 @@
     "#### Match feature space with SCimilarity models \n",
     "SCimilarity's gene expression ordering is fixed. New data should be reorderd to match that, so that it is consistent with how the model was trained. Genes that are not present in the new data will be zero filled to comply to the expected structure. Genes that are not present in SCimilarity's gene ordering will be filtered out. \n",
     "\n",
-    "Note: SCimilarity was trained with high data dropout to increase robustness to differences in gene lists. "
+    "Note, SCimilarity was trained with high data dropout to increase robustness to differences in gene lists. "
    ]
   },
   {
@@ -217,7 +215,7 @@
    },
    "outputs": [],
    "source": [
-    "adams.obsm['X_scimilarity'] = ca.get_embeddings(adams.X)"
+    "adams.obsm[\"X_scimilarity\"] = ca.get_embeddings(adams.X)"
    ]
   },
   {
@@ -238,7 +236,7 @@
    },
    "outputs": [],
    "source": [
-    "sc.pp.neighbors(adams, use_rep='X_scimilarity')\n",
+    "sc.pp.neighbors(adams, use_rep=\"X_scimilarity\")\n",
     "sc.tl.umap(adams)"
    ]
   },
@@ -275,7 +273,7 @@
     }
    ],
    "source": [
-    "sc.pl.umap(adams, color='celltype_raw', legend_fontsize=5)"
+    "sc.pl.umap(adams, color=\"celltype_raw\", legend_fontsize=5)"
    ]
   },
   {
@@ -302,18 +300,18 @@
     "## 3. Cell type classification\n",
     "\n",
     "Two methods within the CellAnnotation class:\n",
-    " 1. `annotate_dataset` - automatically computes embeddings\n",
-    " 2. `get_predictions` - more detailed control of annotation\n",
+    " 1. `annotate_dataset` - automatically computes embeddings.\n",
+    " 2. `get_predictions` - more detailed control of annotation.\n",
     "\n",
     "*Description of inputs*\n",
-    " - X_scimilarity: embeddings from the model, which can be used to generate UMAPs in lieu of PCA and is in theory general across datasets    \n",
+    " - `X_scimilarity`: embeddings from the model, which can be used to generate UMAPs in lieu of PCA and is generalized across datasets.   \n",
     "\n",
     "*Description of outputs*\n",
-    " - predictions: celltype label predictions.\n",
-    " - nn_idxs: indicies of cells in the SCimilarity reference. \n",
-    " - nn_dists: the minimum distance within k=50 nearest neighbors.\n",
-    " - nn_stats: a dataframe containing useful metrics such as: \n",
-    "     - hits: the distribution of celltypes in k=50 nearest neighbors."
+    " - `predictions`: cell type annotation predictions.\n",
+    " - `nn_idxs`: indicies of cells in the SCimilarity reference. \n",
+    " - `nn_dists`: the minimum distance within k=50 nearest neighbors.\n",
+    " - `nn_stats`: a dataframe containing useful metrics such as: \n",
+    "   - `hits`: the distribution of celltypes in k=50 nearest neighbors."
    ]
   },
   {
@@ -349,16 +347,18 @@
     }
    ],
    "source": [
-    "predictions, nn_idxs, nn_dists, nn_stats = ca.get_predictions_kNN(adams.obsm['X_scimilarity'])\n",
-    "adams.obs['predictions_unconstrained'] = predictions.values"
+    "predictions, nn_idxs, nn_dists, nn_stats = ca.get_predictions_kNN(\n",
+    "    adams.obsm[\"X_scimilarity\"]\n",
+    ")\n",
+    "adams.obs[\"predictions_unconstrained\"] = predictions.values"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "40001f20-cfbb-450e-9c05-c5dcf2664de2",
    "metadata": {},
    "source": [
-    "#### Since each cell is classified independently, there is higher classification noise, filtering out low count cells can reduce the noise in visualization"
+    "Since each cell is classified independently, there is higher classification noise, filtering out low count cells can reduce the noise in visualization."
    ]
   },
   {
@@ -387,11 +387,13 @@
    ],
    "source": [
     "celltype_counts = adams.obs.predictions_unconstrained.value_counts()\n",
-    "well_represented_celltypes = celltype_counts[celltype_counts>20].index\n",
+    "well_represented_celltypes = celltype_counts[celltype_counts > 20].index\n",
     "\n",
-    "sc.pl.umap(adams[adams.obs.predictions_unconstrained.isin(well_represented_celltypes)], \n",
-    "           color='predictions_unconstrained', \n",
-    "           legend_fontsize=5)"
+    "sc.pl.umap(\n",
+    "    adams[adams.obs.predictions_unconstrained.isin(well_represented_celltypes)],\n",
+    "    color=\"predictions_unconstrained\",\n",
+    "    legend_fontsize=5,\n",
+    ")"
    ]
   },
   {
@@ -409,11 +411,11 @@
    },
    "source": [
     "### Constrained classification\n",
-    "By classifying against the full reference, we can get redundant cell types, e.g. activated CD8-positive, alpha-beta T cell and CD8-positive, alpha-beta T cell.\n",
+    "By classifying against the full reference, we can get redundant cell types, such as activated CD8-positive, alpha-beta T cell and CD8-positive, alpha-beta T cell.\n",
     "\n",
-    "Alternatively we can subset the reference to just the cell types we want to classify to. This also reduces noise in cell type annotation.\n",
+    "Alternatively, we can subset the reference to just the cell types we want to classify to. This also reduces noise in cell type annotation.\n",
     "\n",
-    "Note: subsetting can slow classification speeds due kNN optimization for full reference classification."
+    "Note, subsetting can slow classification speeds as the kNN is optimized for the full reference."
    ]
   },
   {
@@ -425,14 +427,41 @@
    },
    "outputs": [],
    "source": [
-    "target_celltypes = ['alveolar macrophage', 'macrophage', 'natural killer cell', 'ciliated cell', 'mature NK T cell',\n",
-    "                    'B cell', 'fibroblast', 'classical monocyte', 'type II pneumocyte', 'endothelial cell of vascular tree',\n",
-    "                    'club cell', 'endothelial cell of lymphatic vessel', 'CD8-positive, alpha-beta T cell',\n",
-    "                    'respiratory basal cell', 'mast cell', 'type I pneumocyte', 'secretory cell', 'CD4-positive, alpha-beta T cell',\n",
-    "                    'lung macrophage', 'plasma cell', 'basal cell', 'non-classical monocyte', 'plasmacytoid dendritic cell',\n",
-    "                    'lung ciliated cell', 'vascular associated smooth muscle cell', 'conventional dendritic cell',\n",
-    "                    'goblet cell', 'smooth muscle cell', 'pericyte', 'regulatory T cell', 'myofibroblast cell',\n",
-    "                    'neuroendocrine cell', 'pulmonary ionocyte']\n",
+    "target_celltypes = [\n",
+    "    \"alveolar macrophage\",\n",
+    "    \"macrophage\",\n",
+    "    \"natural killer cell\",\n",
+    "    \"ciliated cell\",\n",
+    "    \"mature NK T cell\",\n",
+    "    \"B cell\",\n",
+    "    \"fibroblast\",\n",
+    "    \"classical monocyte\",\n",
+    "    \"type II pneumocyte\",\n",
+    "    \"endothelial cell of vascular tree\",\n",
+    "    \"club cell\",\n",
+    "    \"endothelial cell of lymphatic vessel\",\n",
+    "    \"CD8-positive, alpha-beta T cell\",\n",
+    "    \"respiratory basal cell\",\n",
+    "    \"mast cell\",\n",
+    "    \"type I pneumocyte\",\n",
+    "    \"secretory cell\",\n",
+    "    \"CD4-positive, alpha-beta T cell\",\n",
+    "    \"lung macrophage\",\n",
+    "    \"plasma cell\",\n",
+    "    \"basal cell\",\n",
+    "    \"non-classical monocyte\",\n",
+    "    \"plasmacytoid dendritic cell\",\n",
+    "    \"lung ciliated cell\",\n",
+    "    \"vascular associated smooth muscle cell\",\n",
+    "    \"conventional dendritic cell\",\n",
+    "    \"goblet cell\",\n",
+    "    \"smooth muscle cell\",\n",
+    "    \"pericyte\",\n",
+    "    \"regulatory T cell\",\n",
+    "    \"myofibroblast cell\",\n",
+    "    \"neuroendocrine cell\",\n",
+    "    \"pulmonary ionocyte\",\n",
+    "]\n",
     "\n",
     "ca.safelist_celltypes(target_celltypes)"
    ]
@@ -489,7 +518,7 @@
     }
    ],
    "source": [
-    "sc.pl.umap(adams, color='celltype_hint', legend_fontsize=5)"
+    "sc.pl.umap(adams, color=\"celltype_hint\", legend_fontsize=5)"
    ]
   },
   {
@@ -498,9 +527,9 @@
    "metadata": {},
    "source": [
     "### Annotation QC\n",
-    "Cell annotation also computes QC metrics for our annotations. One of which, `min_dist`, represents the minimum distance between a cell in the query dataset and all cells in the training set. The greater `min_dist`, (i.e. the further away from what the model has seen before) the less confidence we have in the model's prediction. \n",
+    "Cell annotation also computes QC metrics for our annotations. One of which, `min_dist`, represents the minimum distance between a cell in the query dataset and all cells in the training set. The greater `min_dist`, (i.e., the further away from what the model has seen before) the less confidence we have in the model's prediction. \n",
     "\n",
-    "Note: for different applications and questions different min_dist ranges have different implications."
+    "Note, for different applications and questions different `min_dist` ranges have different implications."
    ]
   },
   {
@@ -528,37 +557,29 @@
     }
    ],
    "source": [
-    "sc.pl.umap(adams, color='min_dist', vmax=.1)"
+    "sc.pl.umap(adams, color=\"min_dist\", vmax=0.1)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "9f68462f-66ef-4505-99db-7386f063733f",
    "metadata": {},
    "source": [
-    "## Conclusion: How to apply to your own datasets\n",
+    "## Conclusion\n",
     "This notebook outlines how to take a dataset and perform cell type annotation.\n",
     "\n",
-    " - Keep in mind that the datasets that you analyze with SCimilarity should fit the following criteria:\n",
-    "   - Data generated from 10X Chromium machine (models are trained using this data only).\n",
-    "   - Human scRNA-seq data.\n",
-    "   - Normalized from counts with SCimilarity functions or using the same process. Different normalizations will have poor results."
+    "Keep in mind that the datasets that you analyze with SCimilarity should fit the following criteria:\n",
+    "  - Data generated from the 10x Genomics Chromium platform (models are trained using this data only).\n",
+    "  - Human scRNA-seq data.\n",
+    "  - Counts normalized with SCimilarity functions or using the same process. Different normalizations will have poor results."
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3044acc6-ea02-49a5-89f6-f4cf5e9e29e4",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "gpy",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
-   "name": "gpy"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -570,7 +591,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.11.4"
   }
  },
  "nbformat": 4,