broadinstitute · shntnu · Mar 21, 2021 · May 30, 2020 · May 30, 2020 · May 30, 2020
diff --git a/README.md b/README.md
@@ -3,17 +3,18 @@
 The Library of Integrated Network-Based Cellular Signatures (LINCS) Project aims to create publicly available resources to characterize how cells respond to perturbation.
 This repository stores Cell Painting readouts and associated data-processing pipelines for the LINCS Cell Painting dataset.
 
+In this project, the [Connectivity Map](https://clue.io/team) team perturbed A549 cells with 1,571 compounds across 6 doses in 5 technical replicates.
 The data represent **a subset** of the [Broad Drug Repurposing Hub](https://clue.io/repurposing#home) collection of compounds.
 
-In this project, the [Connectivity Map](https://clue.io/team) team perturbed A549 cells with ~1,500 compounds across 6 doses in 5 technical replicates.
 We refer to this dataset as `LINCS Pilot 1`.
 We also include data for the second batch of LINCS Cell Painting data, which we refer to as `LKCP`.
 
 For a specific list of compounds tested, see [`metadata`](https://github.com/broadinstitute/lincs-cell-painting/tree/master/metadata).
 You can interactively explore information about the compounds in the [CLUE Repurposing app](https://clue.io/repurposing-app).
+
 The [Morphology Connectivity Hub](https://clue.io/morphology) is the primary source of this dataset.
 
-## Image-Based profiling
+## Image-based profiling
 
 We apply a unified, image-based profiling pipeline to all 136 384-well plates from `LINCS Pilot 1`, and all 135 384-well plates from `LKCP`.
 We use [pycytominer](https://github.com/cytomining/pycytominer) as the primary tool for image-based profiling.
@@ -27,6 +28,10 @@ For more details about image-based profiling in general, please refer to [Caiced
 
 We use [conda](https://docs.conda.io/en/latest/) to manage the computational environment.
 
+To install conda see [instructions](https://docs.conda.io/en/latest/miniconda.html).
+
+We recommend installing conda by downloading and executing the `.sh` file and accepting defaults.
+
 After installing conda, execute the following to install and navigate to the environment:
 
 ```bash

diff --git a/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz b/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
diff --git a/...nsus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median_dmso.csv.gz b/...nsus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median_dmso.csv.gz
diff --git a/...04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median_feature_select.csv.gz b/...04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median_feature_select.csv.gz
diff --git a/..._a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median_feature_select_dmso.csv.gz b/..._a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median_feature_select_dmso.csv.gz
diff --git a/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz b/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz
diff --git a/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_dmso.csv.gz b/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_dmso.csv.gz
diff --git a/...6_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_feature_select.csv.gz b/...6_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_feature_select.csv.gz
diff --git a/...01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_feature_select_dmso.csv.gz b/...01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_feature_select_dmso.csv.gz
diff --git a/consensus/README.md b/consensus/README.md
@@ -39,3 +39,21 @@ We then recode the dose points into ascending numerical levels and add a new met
 
 Note we generated per-well DMSO consensus signatures and per compound-dose pair consensus signatures for compounds.
 The per-well DMSO profiles can help to assess plate-associated batch effects.
+
+## Reproduce Pipeline
+
+The pipeline can be reproduced by executing the following:
+
+```bash
+# Make sure conda environment is activated
+conda activate lincs
+
+# Reproduce the pipeline for producing bulk signatures
+ipython scripts/nbconverted/build-consensus-signatures.py
+```
+
+`scripts/nbconverted/*.py` were created from the Jupyter notebooks in this folder, like this:
+
+```sh
+jupyter nbconvert --to=script --FilesWriter.build_directory=scripts/nbconverted *.ipynb
+```
diff --git a/consensus/build-consensus-signatures.ipynb b/consensus/build-consensus-signatures.ipynb
diff --git a/consensus/scripts/nbconverted/build-consensus-signatures.py b/consensus/scripts/nbconverted/build-consensus-signatures.py
@@ -6,14 +6,18 @@
 # Here, we generate consensus signatures for the LINCS Drug Repurposing Hub Cell Painting subset.
 # See the project [README.md](README.md) for more details.
 # 
-# This notebook generates four files; one per plate normalization and consensus normalization strategy.
+# This notebook generates eight files; one per plate normalization and consensus normalization strategy, with and without feature selection.
 # 
-# | Plate Normalization | Consensus Normalization | Consensus Suffix |
-# | :------------------: | :------------------------: | -----------------: |
-# | DMSO | Median | `<BATCH>_consensus_median_dmso.csv.gz` |
-# | DMSO | MODZ | `<BATCH>_consensus_modz_dmso.csv.gz` |
-# | Whole Plate | Median | `<BATCH>_consensus_median.csv.gz` |
-# | Whole Plate | MODZ | `<BATCH>_consensus_modz.csv.gz` |
+# |Feature selection | Plate Normalization | Consensus Normalization | Consensus Suffix |
+# |:---------------- | :------------------: | :------------------------: | -----------------: |
+# | No  | DMSO | Median | `<BATCH>_consensus_median_dmso.csv.gz` |
+# | No  | DMSO | MODZ | `<BATCH>_consensus_modz_dmso.csv.gz` |
+# | No  | Whole Plate | Median | `<BATCH>_consensus_median.csv.gz` |
+# | No  | Whole Plate | MODZ | `<BATCH>_consensus_modz.csv.gz` |
+# | Yes | DMSO | Median | `<BATCH>_consensus_median_feature_select_dmso.csv.gz` |
+# | Yes | DMSO | MODZ | `<BATCH>_consensus_modz_feature_select_dmso.csv.gz` |
+# | Yes | Whole Plate | Median | `<BATCH>_consensus_median_feature_select.csv.gz` |
+# | Yes | Whole Plate | MODZ | `<BATCH>_consensus_modz_feature_select.csv.gz` |
 
 # In[1]:
 
@@ -31,7 +35,7 @@
 
 from pycytominer.aggregate import aggregate
 from pycytominer.consensus import modz_base
-
+from pycytominer.feature_select import feature_select
 from pycytominer.cyto_utils import infer_cp_features
 
 
@@ -141,9 +145,9 @@ def consensus_apply(df, operation, cp_features, replicate_cols):
     del all_profiles_df
 
 
-# ## Create Consensus Profiles
+# ## Create Consensus Profiles, with and without feature selection
 # 
-# We generate two different consensus profiles for each of the normalization strategies. This generates four different files.
+# We generate two different consensus profiles for each of the normalization strategies, with and without feature selection. This generates eight different files.
 
 # In[7]:
 
@@ -155,12 +159,22 @@ def consensus_apply(df, operation, cp_features, replicate_cols):
     "Metadata_pert_well",
     "Metadata_mmoles_per_liter",
     "Metadata_dose_recode",
+    "Metadata_moa",
+    "Metadata_target",
 ]
 
 
 # In[8]:
 
 
+# feature selection operations
+feature_select_ops = [
+    "drop_na_columns",
+    "variance_threshold",
+    "correlation_threshold",
+    "blacklist",
+]
+
 all_consensus_dfs = {}
 for norm_strat in file_bases:
     all_profiles_df = all_profiles_dfs[norm_strat]
@@ -170,7 +184,9 @@ def consensus_apply(df, operation, cp_features, replicate_cols):
     for operation in operations:
         print(f"Now calculating {operation} consensus for {norm_strat} normalization")
 
-        consensus_profiles[operation] = consensus_apply(
+        consensus_profiles[operation] = {}
+
+        consensus_profiles[operation]["no_feat_select"] = consensus_apply(
             all_profiles_df,
             operation=operation,
             cp_features=cp_norm_features,
@@ -179,31 +195,77 @@ def consensus_apply(df, operation, cp_features, replicate_cols):
 
         # How many DMSO profiles per well?
         print(
-            f"There are {consensus_profiles[operation].shape[0]} {operation} consensus profiles for {norm_strat} normalization"
+            f"There are {consensus_profiles[operation]['no_feat_select'].shape[0]} {operation} consensus profiles for {norm_strat} normalization"
+        )
+
+        # feature selection
+        print(
+            f"Now feature selecting on {operation} consensus for {norm_strat} normalization"
+        )
+
+        consensus_profiles[operation]["feat_select"] = feature_select(
+            profiles=consensus_profiles[operation]["no_feat_select"],
+            features="infer",
+            operation=feature_select_ops,
+        )
+
+        # How many features in feature selected profile?
+        print(
+            f"There are {consensus_profiles[operation]['feat_select'].shape[1]} features in {operation} consensus profiles for {norm_strat} normalization"
         )
 
     all_consensus_dfs[norm_strat] = consensus_profiles
 
 
-# ## Merge and Output Consensus Signatures
+# ## Merge and Output Consensus Signatures, with and without feature selection
 
 # In[9]:
 
 
+float_format = "%5g"
+compression = "gzip"
+
 for norm_strat in file_bases:
     file_suffix = file_bases[norm_strat]["output_file_suffix"]
     for operation in operations:
+
+        # No feature selection
         consensus_file = f"{batch}_consensus_{operation}{file_suffix}"
         consensus_file = pathlib.Path(batch, consensus_file)
 
-        consensus_df = all_consensus_dfs[norm_strat][operation]
+        consensus_df = all_consensus_dfs[norm_strat][operation]["no_feat_select"]
 
         print(
-            f"Now Writing: Consensus Operation: {operation}; Norm Strategy: {norm_strat}\nFile: {consensus_file}"
+            f"Now Writing: Feature selection: No; Consensus Operation: {operation}; Norm Strategy: {norm_strat}\nFile: {consensus_file}"
         )
         print(consensus_df.shape)
 
         consensus_df.to_csv(
-            consensus_file, sep=",", compression="gzip", float_format="%5g", index=False
+            consensus_file,
+            sep=",",
+            compression=compression,
+            float_format=float_format,
+            index=False,
+        )
+
+        # With feature selection
+        consensus_feat_df = all_consensus_dfs[norm_strat][operation]["feat_select"]
+
+        consensus_feat_file = (
+            f"{batch}_consensus_{operation}_feature_select{file_suffix}"
+        )
+        consensus_feat_file = pathlib.Path(batch, consensus_feat_file)
+
+        print(
+            f"Now Writing: Feature selection: Yes; Consensus Operation: {operation}; Norm Strategy: {norm_strat}\nFile: {consensus_feat_file}"
+        )
+        print(consensus_feat_df.shape)
+
+        consensus_feat_df.to_csv(
+            consensus_feat_file,
+            sep=",",
+            compression=compression,
+            float_format=float_format,
+            index=False,
         )
 
diff --git a/environment.yml b/environment.yml
@@ -2,6 +2,7 @@ name: lincs
 channels:
 - conda-forge
 dependencies:
+- pip=21.0.1
 - conda-forge::pandas=1.0.1
 - conda-forge::tabulate=0.8.7
 - conda-forge::jupyter=1.0.0