pycytominer integration #2

gwaybio · 2020-08-14T14:19:14Z

In cytomining/pycytominer#78 I am working towards integrating DeepProfiler processing into pycytominer. Currently and by default, DeepProfiler outputs .npz files storing numpy arrays of single cell profiles. In cytomining/DeepProfiler#229 we discuss a potential update to the .npz file output to also include metadata information.

There are a couple of decision points that we need to make to move the integration forward, which will be partially driven by the goals in the DeepProfilerExperiments repo. In cytomining/DeepProfiler#229 (comment) I bring up two different points of consideration: 1) How to use index.csv and 2) Feature prefix style.

I think both of these decision points are relatively minor, and any pycytominer code will be flexible to handle multiple metadata options and enable a customizable feature prefix. The question about feature prefix is most directly related to what we think the default prefix should be (DP or DP_ are two options)

Additional topics

I think that these topics are more pressing than the first two listed above: Will the profiles be updated for each dataset to include the metadata .npz format? Or, will we proceed without recalculating? If we proceed without recalculating (which I think is the likely scenario), we need to settle on pycytominer strategy.

Strategy

I do not think that pycytominer should include code to parse plate, well, and site information from filenames. This is a very fragile way of storing these variables - I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.

However, since we probably won't recompute profiles, we require a strategy to incorporate metadata from file names. Therefore, I propose that we take multiple pycytominer steps to integrate these metadata (instead of dealing with all of the processing internally in pycytominer).

The proposed workflow is as follows:

Ingest current .npz files in pycytominer
Extract out plate, well, and site from file name
Append these metadata to a pycytominer load_npz() output
Reingest this file with metadata back into pycytominer and proceed with standard downstream processing

I will proceed with this strategy for now, but please do suggest alternatives! We can always pivot strategies later on if this ends up being clunky or doesn't reduce code.

The text was updated successfully, but these errors were encountered:

gwaybio · 2020-08-14T17:57:20Z

Adding a question closely related to cytomining/DeepProfiler#229 (comment)

The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have index.csv files.

For these files specifically (for example, the ones in DeepProfilerExperiments), is the only way to extract Plate, Well, and Site metadata is to parse the file name? Or is there a better way?

jccaicedo · 2020-08-20T17:10:01Z

We can easily recompute DeepProfiler features. We're happy to do so to test the new format. Computing features is not as expensive as training a model. Given that we are constantly training and evaluating features, the feature computation part is kind of routine and can be repeated any time.

So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have 🙂

I missed on the feature file list before. I think this needs additional implementation in DeepProfiler. I will add this comment to our other discussion.

gwaybio · 2020-08-20T18:13:42Z

So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have

Great! This will make the code in each DeepProfiler experiment notebook (for each dataset) much cleaner and more streamlined.

jccaicedo · 2020-09-07T14:22:19Z

We are recomputing features for Cell Painting datasets. I will make a note in this thread when features are available to start integrating pycytominer in the downstream analysis.

gwaybio · 2020-09-25T12:51:54Z

@jccaicedo I set aside time today to push the DeepProfiler-pycytominer integration further along. Two questions:

Are the example .npz file features here adding metadata information to saving .npz files cytomining/DeepProfiler#229 (comment) the finalized DeepProfiler output files?
If so, can I use one of them in the pycytominer tests?

Also, one quick note: I went through all of our existing discussion once more and it was fun

gwaybio · 2021-02-19T19:33:26Z

@jccaicedo - The DeepProfiler and CellProfiler comparison analysis keeps popping into my head. I am wondering if there are any updates?

I am writing in this thread b/c of the questions I had a couple months ago. I'd like to finalize the DeepProfiler integration in cytomining/pycytominer#78, and I think this is where I can contribute most to your project.

gwaybio · 2021-05-14T20:33:54Z

Hey @jccaicedo, @michaelbornholdt and I just chatted about the remaining steps to add DeepProfiler integration into pycytominer. I think we're very close! Here is a summary of our current plan - please feel free to modify.

You provide us permission to include the file Week1_22123_B02_s1.npz in pycytominer as an example data file for testing.
- I've not yet added the file in Adding functionality to aggregate and annotate DeepProfiler output cytomining/pycytominer#78, but it is ready to go here.
- If you cannot make that data public, can you suggest a more suitable alternative?
I will add it to that pull request
Michael will troubleshoot, add/modify tests, and complete the integration

Once these things happen, then Michael will be more readily able to benchmark his DeepProfiler comparison experiments!

jccaicedo · 2021-05-19T16:47:44Z

@gwaygenomics You can add that file to your tests. The plan looks great!

gwaybio · 2021-05-19T17:16:31Z

Awesome! I added the file in cytomining/pycytominer@d237b41

@michaelbornholdt - I just now realized that pycytominer.cyto_utils.DeepProfiler_processing.AggregateDeepProfiler requires testing in cytomining/pycytominer#78, which will likely require you to add the remaining .npz files and index.csv that I sent you on slack in the appropriate folder

jccaicedo mentioned this issue Aug 20, 2020

adding metadata information to saving .npz files cytomining/DeepProfiler#229

Closed

gwaybio mentioned this issue Sep 25, 2020

Adding functionality to aggregate and annotate DeepProfiler output cytomining/pycytominer#78

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycytominer integration #2

pycytominer integration #2

gwaybio commented Aug 14, 2020

gwaybio commented Aug 14, 2020

jccaicedo commented Aug 20, 2020

gwaybio commented Aug 20, 2020

jccaicedo commented Sep 7, 2020

gwaybio commented Sep 25, 2020

gwaybio commented Feb 19, 2021

gwaybio commented May 14, 2021 •

edited

Loading

jccaicedo commented May 19, 2021

gwaybio commented May 19, 2021

pycytominer integration #2

pycytominer integration #2

Comments

gwaybio commented Aug 14, 2020

Additional topics

Strategy

gwaybio commented Aug 14, 2020

jccaicedo commented Aug 20, 2020

gwaybio commented Aug 20, 2020

jccaicedo commented Sep 7, 2020

gwaybio commented Sep 25, 2020

gwaybio commented Feb 19, 2021

gwaybio commented May 14, 2021 • edited Loading

jccaicedo commented May 19, 2021

gwaybio commented May 19, 2021

gwaybio commented May 14, 2021 •

edited

Loading