Update for how we really run things #59

bethac07 · 2021-06-19T00:26:43Z

Adds an overview section that even non-DCP users can follow, as well as (brief) instructions along the way for non-Phenix users.

…mage-analysis.Rmd

bethac07 · 2021-06-19T00:41:09Z

Currently, this draft needs addition of profile creation step recommendations- I still need to figure out how to handle specifically the SQLite creation step. Someone (presumably Niranj) can then add the post-aggregation step (with the recipe set to aggregate SQLites into per-well CSVs or not based on whether or not we're keeping cytominer-scripts below)

Running it with cytominer-scripts is one option, but as of now it won't work in the updated AMI (and the new pe2loaddata doesn't work on the old AMI, we tried but it's REALLY hard to get python3.8 on there), so we could
- send the AMI to each partner (which we CAN do, it's trivial, but we should consider if that's how we want to go, since it's kind of annoying to need a machine that just does literally that one step).
- We can try to figure out why cytominer_scripts is being annoying about making backends on an updated Ubuntu 18 AMI- I have sunk a couple hours into this without succeeding, but it's possible someone with R experience could do it faster. In that case, we can update the AMI to add R back in and then keep the instructions here more-or-less the same.
We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

shntnu · 2021-06-19T12:08:17Z

@bethac07 Hooray! And goodbye cellpainting_scripts!

I'm very much in favor of this option:

We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

At first, I thought this can live in pycytominer, perhaps in cyto_utils. A newcollate.py would download the CellProfiler ExportToSpreadsheet CSV files locally and then call cytominer-database on them, respecting the folder structure.

But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead. @gwaygenomics and @niranjchandrasekaran can decide – let us know what you think, folks (and sorry for the weekend ping! Please do ignore until next week)

bethac07 · 2021-06-19T13:37:11Z

I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script.

One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now.

shntnu · 2021-06-19T13:45:10Z

Good point. By recipe, I actually meant Niranj’s gdoc. But the script itself can live in the recipe repo, just that it won’t be run through the caller. Please go ahead

On Sat, Jun 19, 2021 at 9:37 AM Beth Cimini ***@***.***> wrote: I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script. One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJHQPAPGR43MCRCLOBTYXDTTSMRFANCNFSM466QN4TA> .

-- -Shantanu

gwaybio · 2021-06-21T10:58:53Z

We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead.

Agree too bespoke for pycytominer, but I think it's a better option than the recipe. IMO the recipe shouldn't contain any processing code, just an instruction set (recipe) on how to process the ingredients (data).

Perhaps the way forward is to add it to pycytominer.cyto_utils for now, and then spin it off into a new repo once its more mature. Another option is to write it as a new tool from the start. I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.

bethac07 · 2021-06-21T13:41:02Z

it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier

I mean, CellProfiler is absolutely capable of writing to a single SQLite file in the first place, but we would not be able to parallelize across the number of CPUs that we currently do - it's a choice in how we've decided to run the data. We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.

bethac07 · 2021-06-21T13:48:13Z

I've started a branch to do this in - https://github.com/cytomining/pycytominer/tree/jump

shntnu · 2021-06-21T14:11:37Z

We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.

Correct

shntnu · 2021-06-21T14:32:08Z

I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.

The details are pretty simple – collate.R is a wrapper for

downloading CSV files locally (because they are usually on S3)
calling cytominer-database
cleaning

We wouldn't need collate if

the files are not on S3, or
we had a way to directly call cytominer-database on S3 objects

So it comes down to the cytominer-database rewrite :)

For now, the plan Beth has sounds sensible to keep things moving. But eventually, the rewrite is what will fix this issue. When we do that, we might discover that there are some simple changes that can be made in ExportToSpreadsheet to make the new cytominer-database / cytominer-transport easier to write e.g. storing the CSVs in a certain way so that it is easy to read them as a Parquet dataset.

bethac07 · 2021-06-23T04:21:54Z

@shntnu, initial pass for all my parts is complete. We'll want to make some edits if/when collate.py and its associated changes gets pulled, and once it's in a more readable format I'll have my team propose edits, but our part is complete.

We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.

shntnu · 2021-06-23T20:04:08Z

We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.

Agreed, let's merge! Please do so, just in case you still have some pending commits to push.

Tagging @niranjchandrasekaran so he is aware that we should plan to move the gdoc to this handbook at some point in the near future (but not urgent for JUMP because the gdoc exists).

bethac07 added 13 commits June 18, 2021 17:30

Create 01-overview.Rmd

a11b609

Update 01-overview.Rmd

1d9d37f

Update and rename 01-config.Rmd to 02-config.Rmd

0a37b22

Update 06-create-profiles.Rmd

35f75b0

Update and rename 02-config-for-image-analysis.Rmd to 03-config-for-i…

2b83270

…mage-analysis.Rmd

Update and rename 03-setup-pipelines.Rmd to 04-setup-images.Rmd

50146e0

Delete 04-setup-jobs.Rmd

428cd78

Update 02-config.Rmd

f3921fa

Update 02-config.Rmd

3f7d8b6

Update 04-setup-images.Rmd

25c2da4

Update and rename 05-run-jobs.Rmd to 05-run-cellprofiler.Rmd

c5e20d8

Update 04-setup-images.Rmd

560c2c0

Update 04-setup-images.Rmd

07e7691

bethac07 mentioned this pull request Jun 19, 2021

Up to date AMI cytomining/cytominer-vm#6

Merged

shntnu added 7 commits June 19, 2021 06:55

Added renv to make bookdown environment easy to set up

dbf154f

Minor formatting, clarifications

cae0340

Formatting, added TODOs

a344b3c

Add TODO, drop deprecated text

ff79a54

Formatting edits

e8d966c

Formatting, URLs

5c662b0

Formatting

845104b

shntnu mentioned this pull request Jun 19, 2021

Update 05-run-jobs.Rmd for DCP2 #57

Closed

shntnu added 2 commits June 19, 2021 07:53

Formatting

132f2bd

Drop deprecated text

2835588

bethac07 added 10 commits June 22, 2021 21:39

Update 06-create-profiles.Rmd

3c72d3e

Update 02-config.Rmd

5927bc9

Update 02-config.Rmd

5b0ae66

Delete 03-config-for-image-analysis.Rmd

6d80bce

Rename 04-setup-images.Rmd to 03-setup-images.Rmd

ed2f4c9

Rename 05-run-cellprofiler.Rmd to 04-run-cellprofiler.Rmd

7fe6219

Update and rename 06-create-profiles.Rmd to 05-create-profiles.Rmd

096f971

Update and rename 07-appendix.Rmd to 06-appendix.Rmd

1658967

Update 05-create-profiles.Rmd

3590f30

Update 05-create-profiles.Rmd

b787263

Update 05-create-profiles.Rmd

57e0ad6

shntnu marked this pull request as ready for review June 23, 2021 20:04

shntnu approved these changes Jun 23, 2021

View reviewed changes

bethac07 merged commit 8ffd6c9 into master Jun 23, 2021

bethac07 deleted the jump branch June 23, 2021 20:57

bethac07 mentioned this pull request Jul 29, 2021

Add collate.py cytomining/pycytominer#160

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for how we really run things #59

Update for how we really run things #59

bethac07 commented Jun 19, 2021

bethac07 commented Jun 19, 2021

shntnu commented Jun 19, 2021 •

edited

Loading

bethac07 commented Jun 19, 2021

shntnu commented Jun 19, 2021 via email

gwaybio commented Jun 21, 2021

bethac07 commented Jun 21, 2021

bethac07 commented Jun 21, 2021

shntnu commented Jun 21, 2021

shntnu commented Jun 21, 2021 •

edited

Loading

bethac07 commented Jun 23, 2021

shntnu commented Jun 23, 2021

Update for how we really run things #59

Update for how we really run things #59

Conversation

bethac07 commented Jun 19, 2021

bethac07 commented Jun 19, 2021

shntnu commented Jun 19, 2021 • edited Loading

bethac07 commented Jun 19, 2021

shntnu commented Jun 19, 2021 via email

gwaybio commented Jun 21, 2021

bethac07 commented Jun 21, 2021

bethac07 commented Jun 21, 2021

shntnu commented Jun 21, 2021

shntnu commented Jun 21, 2021 • edited Loading

bethac07 commented Jun 23, 2021

shntnu commented Jun 23, 2021

shntnu commented Jun 19, 2021 •

edited

Loading

shntnu commented Jun 21, 2021 •

edited

Loading