Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for how we really run things #59

Merged
merged 33 commits into from
Jun 23, 2021
Merged

Update for how we really run things #59

merged 33 commits into from
Jun 23, 2021

Conversation

bethac07
Copy link
Member

Adds an overview section that even non-DCP users can follow, as well as (brief) instructions along the way for non-Phenix users.

@bethac07
Copy link
Member Author

Currently, this draft needs addition of profile creation step recommendations- I still need to figure out how to handle specifically the SQLite creation step. Someone (presumably Niranj) can then add the post-aggregation step (with the recipe set to aggregate SQLites into per-well CSVs or not based on whether or not we're keeping cytominer-scripts below)

  • Running it with cytominer-scripts is one option, but as of now it won't work in the updated AMI (and the new pe2loaddata doesn't work on the old AMI, we tried but it's REALLY hard to get python3.8 on there), so we could
    • send the AMI to each partner (which we CAN do, it's trivial, but we should consider if that's how we want to go, since it's kind of annoying to need a machine that just does literally that one step).
    • We can try to figure out why cytominer_scripts is being annoying about making backends on an updated Ubuntu 18 AMI- I have sunk a couple hours into this without succeeding, but it's possible someone with R experience could do it faster. In that case, we can update the AMI to add R back in and then keep the instructions here more-or-less the same.
  • We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

@shntnu
Copy link
Member

shntnu commented Jun 19, 2021

@bethac07 Hooray! And goodbye cellpainting_scripts!

I'm very much in favor of this option:

  • We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

At first, I thought this can live in pycytominer, perhaps in cyto_utils. A newcollate.py would download the CellProfiler ExportToSpreadsheet CSV files locally and then call cytominer-database on them, respecting the folder structure.

But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead. @gwaygenomics and @niranjchandrasekaran can decide – let us know what you think, folks (and sorry for the weekend ping! Please do ignore until next week)

@bethac07
Copy link
Member Author

I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script.

One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now.

@shntnu
Copy link
Member

shntnu commented Jun 19, 2021 via email

@gwaybio
Copy link
Member

gwaybio commented Jun 21, 2021

We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead.

Agree too bespoke for pycytominer, but I think it's a better option than the recipe. IMO the recipe shouldn't contain any processing code, just an instruction set (recipe) on how to process the ingredients (data).

Perhaps the way forward is to add it to pycytominer.cyto_utils for now, and then spin it off into a new repo once its more mature. Another option is to write it as a new tool from the start. I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.

@bethac07
Copy link
Member Author

it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier

I mean, CellProfiler is absolutely capable of writing to a single SQLite file in the first place, but we would not be able to parallelize across the number of CPUs that we currently do - it's a choice in how we've decided to run the data. We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.

@bethac07
Copy link
Member Author

I've started a branch to do this in - https://github.com/cytomining/pycytominer/tree/jump

@shntnu
Copy link
Member

shntnu commented Jun 21, 2021

We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.

Correct

@shntnu
Copy link
Member

shntnu commented Jun 21, 2021

I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.

The details are pretty simple – collate.R is a wrapper for

  • downloading CSV files locally (because they are usually on S3)
  • calling cytominer-database
  • cleaning

We wouldn't need collate if

  • the files are not on S3, or
  • we had a way to directly call cytominer-database on S3 objects

So it comes down to the cytominer-database rewrite :)

For now, the plan Beth has sounds sensible to keep things moving. But eventually, the rewrite is what will fix this issue. When we do that, we might discover that there are some simple changes that can be made in ExportToSpreadsheet to make the new cytominer-database / cytominer-transport easier to write e.g. storing the CSVs in a certain way so that it is easy to read them as a Parquet dataset.

@bethac07
Copy link
Member Author

@shntnu, initial pass for all my parts is complete. We'll want to make some edits if/when collate.py and its associated changes gets pulled, and once it's in a more readable format I'll have my team propose edits, but our part is complete.

We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.

@shntnu
Copy link
Member

shntnu commented Jun 23, 2021

We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.

Agreed, let's merge! Please do so, just in case you still have some pending commits to push.

Tagging @niranjchandrasekaran so he is aware that we should plan to move the gdoc to this handbook at some point in the near future (but not urgent for JUMP because the gdoc exists).

@shntnu shntnu marked this pull request as ready for review June 23, 2021 20:04
@bethac07 bethac07 merged commit 8ffd6c9 into master Jun 23, 2021
@bethac07 bethac07 deleted the jump branch June 23, 2021 20:57
@bethac07 bethac07 mentioned this pull request Jul 29, 2021
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants