Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for how we really run things #59

Merged
merged 33 commits into from
Jun 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a11b609
Create 01-overview.Rmd
bethac07 Jun 18, 2021
1d9d37f
Update 01-overview.Rmd
bethac07 Jun 18, 2021
0a37b22
Update and rename 01-config.Rmd to 02-config.Rmd
bethac07 Jun 18, 2021
35f75b0
Update 06-create-profiles.Rmd
bethac07 Jun 18, 2021
2b83270
Update and rename 02-config-for-image-analysis.Rmd to 03-config-for-i…
bethac07 Jun 18, 2021
50146e0
Update and rename 03-setup-pipelines.Rmd to 04-setup-images.Rmd
bethac07 Jun 18, 2021
428cd78
Delete 04-setup-jobs.Rmd
bethac07 Jun 18, 2021
f3921fa
Update 02-config.Rmd
bethac07 Jun 18, 2021
3f7d8b6
Update 02-config.Rmd
bethac07 Jun 18, 2021
25c2da4
Update 04-setup-images.Rmd
bethac07 Jun 18, 2021
c5e20d8
Update and rename 05-run-jobs.Rmd to 05-run-cellprofiler.Rmd
bethac07 Jun 19, 2021
560c2c0
Update 04-setup-images.Rmd
bethac07 Jun 19, 2021
07e7691
Update 04-setup-images.Rmd
bethac07 Jun 19, 2021
dbf154f
Added renv to make bookdown environment easy to set up
shntnu Jun 19, 2021
cae0340
Minor formatting, clarifications
shntnu Jun 19, 2021
a344b3c
Formatting, added TODOs
shntnu Jun 19, 2021
ff79a54
Add TODO, drop deprecated text
shntnu Jun 19, 2021
e8d966c
Formatting edits
shntnu Jun 19, 2021
5c662b0
Formatting, URLs
shntnu Jun 19, 2021
845104b
Formatting
shntnu Jun 19, 2021
132f2bd
Formatting
shntnu Jun 19, 2021
2835588
Drop deprecated text
shntnu Jun 19, 2021
3c72d3e
Update 06-create-profiles.Rmd
bethac07 Jun 23, 2021
5927bc9
Update 02-config.Rmd
bethac07 Jun 23, 2021
5b0ae66
Update 02-config.Rmd
bethac07 Jun 23, 2021
6d80bce
Delete 03-config-for-image-analysis.Rmd
bethac07 Jun 23, 2021
ed2f4c9
Rename 04-setup-images.Rmd to 03-setup-images.Rmd
bethac07 Jun 23, 2021
7fe6219
Rename 05-run-cellprofiler.Rmd to 04-run-cellprofiler.Rmd
bethac07 Jun 23, 2021
096f971
Update and rename 06-create-profiles.Rmd to 05-create-profiles.Rmd
bethac07 Jun 23, 2021
1658967
Update and rename 07-appendix.Rmd to 06-appendix.Rmd
bethac07 Jun 23, 2021
3590f30
Update 05-create-profiles.Rmd
bethac07 Jun 23, 2021
b787263
Update 05-create-profiles.Rmd
bethac07 Jun 23, 2021
57e0ad6
Update 05-create-profiles.Rmd
bethac07 Jun 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rprofile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
source("renv/activate.R")
131 changes: 0 additions & 131 deletions 01-config.Rmd

This file was deleted.

87 changes: 87 additions & 0 deletions 01-overview.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Introduction

This handbook describes the process of running a Cell Painting experiment.
While the code here will describe doing so in the context of running [Distributed-CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler) on AWS on images generated by a PerkinElmer microscope,
then collating the data with [cytominer-database](https://github.com/cytomining/cytominer-database) and analyzing it with [pycytominer](https://github.com/cytomining/pycytominer), the basic procedure for running a Cell Painting experiment is the same no matter the microscope or the processing platform.
Briefly, the steps are:

## Collect your software

For the specific use case here, this involves

- [Distributed-CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler)
- [pe2loaddata](https://github.com/broadinstitute/pe2loaddata)
- [cytominer-database](https://github.com/cytomining/cytominer-database)
- [pycytominer](https://github.com/cytomining/pycytominer)

along with their dependencies.
Almost certainly, you will need a locally-to-your-images installed version of GUI CellProfiler that matches the version you want to run on your cluster - locally might mean on a local machine or a VM.

> _TODO: Maybe clarify that users don't need to install these at this point because we provide instructions later?_

## Collect your pipelines

You will minimally require these pipelines

- `illum` (illumination correction)
- `analysis` (segmentation and feature extraction)

But you may also want pipelines for

- Z projection
- QC
- assay development

## Determine how to get your image lists to CellProfiler

CellProfiler needs to understand image sets, aka for each field-of-view that was captured how many channels were there, what would you like to name the channels were there, and what are the file names corresponding to each channel.
If you are using a Phenix Opera or Operetta, you can use the `pe2loaddata` program to generate a CSV that contains a list of image sets in an automated fashion, which can be passed to CellProfiler in via the LoadData module.
Otherwise, you have a couple of different options:

1. You can create a similar CSV using a script that you write yourself that handles the files from your microscope and makes a similar CSV.
Minimally, you need a `FileName` and `PathName` column for each channel (ie `FileName_OrigDNA`), and Metadata columns for each piece of metadata CellProfiler needs (ie `Metadata_Plate`, `Metadata_Well`, and `Metadata_Site`)
1. You can use a local copy of the files CellProfiler will be running, configure the 4 input modules of CellProfiler to create your image sets, then export CSVs using CellProfiler's "Export Image Set Listing" option, and feed those into the pipelines to be run on your cluster.
1. Alter all of your pipelines to, rather than the LoadData CSV module, use the 4 input modules of CellProfiler and add, configure, and run each pipeline with the CreateBatchFiles module and use batch files in your cluster environment.

These options are ordered in terms of most-scripting-proficiency-needed to least-scripting-proficiency-needed as well as least-CellProfiler-proficiency-needed to most-CellProfiler-proficiency-needed.

## Execute your CellProfiler pipelines

### (Optional) Z projection

If your images were taken with multiple planes, you will need to Z-project them.
All subsequent steps should be run on the projected images.

### (Optional) QC

You may want to run a quality control pipeline to determine your imaging plate quality.
You can choose to run this locally or on your cluster compute environment.
You will need to evaluate the results of this pipeline somehow, in CellProfiler-Analyst, KNIME, SpotFire, etc.
You may run illumination correction and assay development steps in the meantime, but should hold analysis steps until the results are evaluated.

### Illumination correction

You need to run a pipeline that is grouped by plate and creates an illumination correction function.
Since it is grouped by plate, you don't need very many CPUs to run this, but it will take 6-24 hours depending on settings and image size.
Assay development and analysis require this step to complete.

### (Optional) Assay Development

If desired, you can run a pipeline that executes on one image per well and carries out your segmentation but not measurement steps and makes (an) image(s) that you can use to evaluate the quality of the segmentation (either individually or by stitching them together first).
This is not required but allows you to ensure that your segmentation parameters look reasonable across the variety of phenotypes present in your data.
If being run, the final step should be held until this step can be evaluated.

## Analysis

This pipeline segments the cells and measures the whole images and cells, and creates output CSVs (or can dump to a MySQL host if configured).
This is typically run on each image site in parallel and thus can be sped up by using a large number of CPUs.

## Aggregate your data

Since the analysis is run in parallel, unless using a MySQL host you will have a number of sets of CSVs that need to be turned into a single file per plate.
This is currently done with the `cytominer-database` program.

## Create and manipulate per-well profiles.

The final step is to create per-well profiles, annotate them with metadata, and do steps such as plate normalization and feature selection.
These are accomplished via a "profiling recipe" using `pycytominer`.
62 changes: 0 additions & 62 deletions 02-config-for-image-analysis.Rmd

This file was deleted.

111 changes: 111 additions & 0 deletions 02-config.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# (PART) Configuration {-}

# Configure Environment for Full Profiling Pipeline

This workflow assumes you have already set up an AWS account with an S3 bucket and EFS, and created a VM per the instructions in the link below.

## Launch an AWS Virtual Machine for making CSVs and running Distributed-CellProfiler

Launch an EC2 node using AMI `cytomining/images/hvm-ssd/cytominer-bionic-trusty-18.04-amd64-server-*`, created using [cytominer-vm](https://github.com/cytomining/cytominer-vm).

You will need to create an AMI for your own infrastructure because the provisioning includes mounting S3 and EFS, which is account specific.
We recommend using an `m4.xlarge` instance, with an 8Gb EBS volume.

Note: Proper configuration is essential to mount the S3 bucket.
The following configuration provides an example, named `imaging-platform` (modifications will be necessary).

* Launch an ec2 instance on AWS
* AMI: `cytomining/images/hvm-ssd/cytominer-ubuntu-trusty-18.04-amd64-server-1529668435`
* Instance Type: m4.xlarge
* Network: vpc-35149752
* Subnet: Default (imaging platform terraform)
* IAM role: `s3-imaging-platform-role`
* No Tags
* Select Existing Security Group: `SSH_HTTP`
* Review and Launch
* `ssh -i <USER>.pem ubuntu@<Public DNS IPv4>`

After starting the instance, ensure that the S3 bucket is mounted on `~/bucket`.
If not, run `sudo mount -a`.


Log in to the EC2 instance.


Enter your AWS credentials

```sh
aws configure
```

The infrastructure is configured with one S3 bucket.
Mount this S3 bucket (if it is not automatically mounted)

```sh
sudo mount -a
```

Check that the bucket was was mounted.
This path should exist:

```sh
ls ~/bucket/projects
```

## Create a tmux session

You will want to retain environment variables once defined, and for processes to run when you are not connected, so you should create a tmux session to work in

```sh
tmux new -s sessionname
```

You can detach from this session at any time by typing `Ctl+b`, then `d`.
To reattach to an existing session, type `tmux a -t sessionname`

You can list existing sessions with `tmux list-sessions` and kill any poorly-behaving session with `tmux kill-session -t sessionname`

## Define Environment Variables

These variables will be used throughout the project to tag instances, logs etc so that you know which machines are working on what, which files to operate on, where your logs are, etc.

```sh
PROJECT_NAME=2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad

BATCH_ID=2016_04_01_a549_48hr_batch1

BUCKET=imaging-platform

MAXPROCS=3 # m4.xlarge has 4 cores; this should be # of cores on your instance - 1
```

## Create Directories

```sh
mkdir -p ~/efs/${PROJECT_NAME}/workspace/

cd ~/efs/${PROJECT_NAME}/workspace/

mkdir -p log/${BATCH_ID}
```

## Download Software

```sh
cd ~/efs/${PROJECT_NAME}/workspace/
mkdir software
cd software
git clone https://github.com/broadinstitute/pe2loaddata.git
git clone https://github.com/CellProfiler/Distributed-CellProfiler.git

cd ..
```

If these repos have already been cloned, `git pull` to make sure they are up to date.

This is the resulting structure of `software` on EFS (one level below `workspace`):
```
└── software
├── Distributed-CellProfiler
└── pe2loaddata
```
Loading