Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pycytominer integration #2

Open
gwaybio opened this issue Aug 14, 2020 · 9 comments
Open

pycytominer integration #2

gwaybio opened this issue Aug 14, 2020 · 9 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Aug 14, 2020

In cytomining/pycytominer#78 I am working towards integrating DeepProfiler processing into pycytominer. Currently and by default, DeepProfiler outputs .npz files storing numpy arrays of single cell profiles. In cytomining/DeepProfiler#229 we discuss a potential update to the .npz file output to also include metadata information.

There are a couple of decision points that we need to make to move the integration forward, which will be partially driven by the goals in the DeepProfilerExperiments repo. In cytomining/DeepProfiler#229 (comment) I bring up two different points of consideration: 1) How to use index.csv and 2) Feature prefix style.

I think both of these decision points are relatively minor, and any pycytominer code will be flexible to handle multiple metadata options and enable a customizable feature prefix. The question about feature prefix is most directly related to what we think the default prefix should be (DP or DP_ are two options)

Additional topics

I think that these topics are more pressing than the first two listed above: Will the profiles be updated for each dataset to include the metadata .npz format? Or, will we proceed without recalculating? If we proceed without recalculating (which I think is the likely scenario), we need to settle on pycytominer strategy.

Strategy

I do not think that pycytominer should include code to parse plate, well, and site information from filenames. This is a very fragile way of storing these variables - I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.

However, since we probably won't recompute profiles, we require a strategy to incorporate metadata from file names. Therefore, I propose that we take multiple pycytominer steps to integrate these metadata (instead of dealing with all of the processing internally in pycytominer).

The proposed workflow is as follows:

  1. Ingest current .npz files in pycytominer
  2. Extract out plate, well, and site from file name
  3. Append these metadata to a pycytominer load_npz() output
  4. Reingest this file with metadata back into pycytominer and proceed with standard downstream processing

I will proceed with this strategy for now, but please do suggest alternatives! We can always pivot strategies later on if this ends up being clunky or doesn't reduce code.

@gwaybio
Copy link
Member Author

gwaybio commented Aug 14, 2020

Adding a question closely related to cytomining/DeepProfiler#229 (comment)

The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have index.csv files.

For these files specifically (for example, the ones in DeepProfilerExperiments), is the only way to extract Plate, Well, and Site metadata is to parse the file name? Or is there a better way?

@jccaicedo
Copy link
Collaborator

We can easily recompute DeepProfiler features. We're happy to do so to test the new format. Computing features is not as expensive as training a model. Given that we are constantly training and evaluating features, the feature computation part is kind of routine and can be repeated any time.

So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have 🙂

I missed on the feature file list before. I think this needs additional implementation in DeepProfiler. I will add this comment to our other discussion.

@gwaybio
Copy link
Member Author

gwaybio commented Aug 20, 2020

So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have

Great! This will make the code in each DeepProfiler experiment notebook (for each dataset) much cleaner and more streamlined.

@jccaicedo
Copy link
Collaborator

We are recomputing features for Cell Painting datasets. I will make a note in this thread when features are available to start integrating pycytominer in the downstream analysis.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 25, 2020

@jccaicedo I set aside time today to push the DeepProfiler-pycytominer integration further along. Two questions:

  1. Are the example .npz file features here adding metadata information to saving .npz files cytomining/DeepProfiler#229 (comment) the finalized DeepProfiler output files?
  2. If so, can I use one of them in the pycytominer tests?

Also, one quick note: I went through all of our existing discussion once more and it was fun

@gwaybio
Copy link
Member Author

gwaybio commented Feb 19, 2021

@jccaicedo - The DeepProfiler and CellProfiler comparison analysis keeps popping into my head. I am wondering if there are any updates?

I am writing in this thread b/c of the questions I had a couple months ago. I'd like to finalize the DeepProfiler integration in cytomining/pycytominer#78, and I think this is where I can contribute most to your project.

@gwaybio
Copy link
Member Author

gwaybio commented May 14, 2021

Hey @jccaicedo, @michaelbornholdt and I just chatted about the remaining steps to add DeepProfiler integration into pycytominer. I think we're very close! Here is a summary of our current plan - please feel free to modify.

  1. You provide us permission to include the file Week1_22123_B02_s1.npz in pycytominer as an example data file for testing.
  2. I will add it to that pull request
  3. Michael will troubleshoot, add/modify tests, and complete the integration

Once these things happen, then Michael will be more readily able to benchmark his DeepProfiler comparison experiments!

@jccaicedo
Copy link
Collaborator

@gwaygenomics You can add that file to your tests. The plan looks great!

@gwaybio
Copy link
Member Author

gwaybio commented May 19, 2021

Awesome! I added the file in cytomining/pycytominer@d237b41

@michaelbornholdt - I just now realized that pycytominer.cyto_utils.DeepProfiler_processing.AggregateDeepProfiler requires testing in cytomining/pycytominer#78, which will likely require you to add the remaining .npz files and index.csv that I sent you on slack in the appropriate folder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants