Add collate.py #160

shntnu · 2021-07-29T10:24:38Z

Description

Add collate.py, which will run cytominer-database, database file indexing, and aggregation per our standard workflow.
This replaces collate.R in the previous cytominer_scripts repository.
Collation is added as a callable function from within pycytominer, but to facilitate running multiple plates in parallel with GNU-parallel, a command line interface is also provided. This is necessary since this step can take 6-18 hours for an individual plate and we often want to run 20+ plates at a time.

See also cytomining/profiling-handbook#59 (comment)

Commits should be squashed before merging.

What is the nature of your change?

Enhancement (adds functionality).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

codecov-commenter · 2021-07-29T10:27:25Z

Codecov Report

Merging #160 (ead2547) into master (02d0522) will decrease coverage by 2.32%.
The diff coverage is 66.31%.

@@            Coverage Diff             @@
##           master     #160      +/-   ##
==========================================
- Coverage   98.04%   95.71%   -2.33%     
==========================================
  Files          50       53       +3     
  Lines        2403     2593     +190     
==========================================
+ Hits         2356     2482     +126     
- Misses         47      111      +64

Flag	Coverage Δ
unittests	`95.71% <66.31%> (-2.33%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pycytominer/cyto_utils/collate_cmd.py	`0.00% <0.00%> (ø)`
setup.py	`0.00% <ø> (ø)`
pycytominer/cyto_utils/collate.py	`54.73% <54.73%> (ø)`
pycytominer/tests/test_cyto_utils/test_util.py	`96.35% <90.00%> (-0.51%)`	⬇️
pycytominer/cyto_utils/__init__.py	`100.00% <100.00%> (ø)`
pycytominer/cyto_utils/util.py	`98.86% <100.00%> (+0.05%)`	⬆️
pycytominer/tests/test_cyto_utils/test_collate.py	`100.00% <100.00%> (ø)`

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

shntnu · 2021-07-29T10:36:16Z

Notes from Slack below

Gregory Way Today at 6:14 AM
this is frightening to me 🙀
image.png
image.png

18 replies

Gregory Way 20 minutes ago
well, maybe starting to frighten

Gregory Way 18 minutes ago
I think JUMP is good to drive development (very good!), the fear is driving too fast without seatbelts and not considering other cars on the road that might benefit from various improvements

Gregory Way 17 minutes ago
but maybe the rules of the road aren't conducive to these kinds of equitable and safe improvements?

Gregory Way 16 minutes ago
if so, then we need to figure out a plan to make sure pycytominer has a sustainable future 🙂

Gregory Way 15 minutes ago
it's also possible JUMP has a merge plan - if so, then my fears can easily be calmed!

Shantanu Singh 10 minutes ago
It's a single file (mostly) https://github.com/cytomining/pycytominer/pull/160/files

Shantanu Singh 9 minutes ago
https://broadinstitute.slack.com/archives/C3QFQ3WQM/p1624420346066300

Beth Cimini
I'm having what I'm sure is probably a dumb pycytominer error- I'm trying to add the aggregation steps to collate.py so that we have a cleaner drop in replacement for the old collate.R . Issue in thread.
Thread in ip-profiling | Jun 22nd | View message

Shantanu Singh 8 minutes ago
^^ Copied that chat because that's some (but not all) the context for the contrib
👍
1

Gregory Way 7 minutes ago
cool, ok, less worried 🙂

Gregory Way 7 minutes ago
i trust that means there is also a merge plan

Shantanu Singh 7 minutes ago
Essentially, the idea is to replace collate.R and create these news instructions here https://cytomining.github.io/profiling-handbook/create-profiles.html#create-database-backend

cytomining.github.io
Chapter 5 Create Profiles | Image-based Profiling Handbook
This is a handbook for processing image-based profiling datasets using CellProfiler and pycytominer

Shantanu Singh 7 minutes ago
i.e.
python3 pycytominer/cyto_utils/collate.py ${BATCH_ID} pycytominer/cyto_utils/ingest_config.ini {1} \

Shantanu Singh 3 minutes ago
Regarding the merge plan – it's quite possible we will move it out of there and into the future cytominer-database replacement because it doesn't quite fit into the pycytominer framework... but maybe it does fit in, given that pycytominer is kinda monolithic. Hm

Gregory Way 3 minutes ago
gotcha! This seems to be a pretty big API change too. Pycytominer hasn't supported command line interaction in the past (which is fine to introduce, it just complicates things)

Shantanu Singh 3 minutes ago
Exactly

Gregory Way 2 minutes ago
cool, glad to know there's a plan!

Shantanu Singh < 1 minute ago
Although the command line part can be addressed easily – we needn't have

pycytominer/pycytominer/cyto_utils/collate.py

Lines 141 to 157 in a2fa463

    
           if __name__ =='__main__': 
        
               import argparse 
        
               parser = argparse.ArgumentParser(description='Collate CSVs') 
        
               parser.add_argument('batch', help='Batch name to process') 
        
               parser.add_argument('config', help='config file to pass to cytominer-database') 
        
               parser.add_argument('plate', help='Plate name to process') 
        
               parser.add_argument('--base','--base-directory', dest='base_directory',default='../..',help='Base directory where the CSV files will be located') 
        
               parser.add_argument('--column', default=None,help='An existing column to be explicitly copied to a Metadata_Plate column if Metadata_Plate was not set') 
        
               parser.add_argument('--munge', action='store_true', default=False,help='Whether munge should be passed to cytominer-database, if True will break a single object CSV down by objects') 
        
               parser.add_argument('--pipeline', default='analysis',help='A string used in path creation') 
        
               parser.add_argument('--remote', default=None,help='A remote AWS directory, if set CSV files will be synced down from at the beginning and to which SQLite files will be synced up at the end of the run') 
        
               parser.add_argument('--temp', default='/tmp',help='The temporary directory to be used by cytominer-databases for output') 
        
               parser.add_argument('--overwrite', action='store_true', default=False,help='Whether or not to overwrite an sqlite that exists in the temporary directory if it already exists') 
        
               args = parser.parse_args() 
        
               collate(args.batch, args.config, args.plate, base_directory=args.base_directory, column=args.column, munge=args.munge, pipeline=args.pipeline, remote=args.remote, temp=args.temp, overwrite=args.overwrite)

in the code, and instead move that out into a standalone using https://github.com/google/python-fire

collate.py
https://github.com/cytomining/pycytominer|cytomining/pycytominercytomining/pycytominer | Added by GitHub

google/python-fire
Stars
19850
Language
Python
Added by GitHub

Shantanu Singh < 1 minute ago
I'll copy these comments to the PR so we have notes there

bethac07 · 2021-07-29T12:20:46Z

Additional insight- cytomining/profiling-handbook#59 (comment)

shntnu · 2022-03-31T21:18:27Z

@bethac07 do you have any thoughts on whether this (collate) should exist as a separate tool, or is it worth wrapping up this PR? I know you are booked out but wondering what crumbs should be left on this PR for anyone who has the capacity to work on this

bethac07 · 2022-03-31T21:26:15Z

So all this needs is tests. I can imagine a couple of things

We decide we can live without tests, since cyto_utils is a bit more wild-west, and then we rebase and pull
I or somebody else writes some tests, and then we rebase and pull
It moves to its own repo, which seems a bit over the top but fine, whatever, I don't actually care
It moves somewhere else - cytominer-database repo? I don't love c-d specifically because it's got all that not-quite-working parquet stuff in it (this version uses the non-parquet version) but I can't think of anywhere else appropriate
We work on adding parallelization to the recipe, which is essentially the only problem that this solves (that backends take 8-12 hours apiece and the recipe doesn't currently support parallelization so for ie 20 plates we'd much rather have this in parallel than in series) and then we move anything useful (like auto-file-downloads) to the recipe repo and then close this without merging.

@niranjchandrasekaran can comment on the value of the last one. I can write some dumb tests quickly, but presumably we want non-dumb tests.

shntnu · 2022-03-31T21:37:30Z

After a quick skim, I think it’s fine for this to live here in this repo. Greg’s concern was that this command line functionality would be a break from the API but I think that alone can be yanked into a separate repo if that bothers us too much.

shntnu · 2022-04-01T18:14:33Z

I can help filter

We work on adding parallelization to the recipe, which is essentially the only problem that this solves (that backends take 8-12 hours apiece and the recipe doesn't currently support parallelization so for ie 20 plates we'd much rather have this in parallel than in series) and then we move anything useful (like auto-file-downloads) to the recipe repo and then close this without merging.

I think this is the major deciding factor and @niranjchandrasekaran is the best positioned to decide if this is practical. If it is, then this seems like the best solution to me.

If not, then we can exclude the following two options below right away because I think it is fine for collate to live in pycytominer; Greg’s concern was that this command line functionality would be a break from the API but I think that alone can be yanked into a separate repo if that bothers us too much. Or, we can delete the command-line functionality and use https://github.com/google/python-fire to create a command-line (and do this bit in the recipe) if @bethac07 thinks that is a fine way to go in general in such cases (we have a function for which we want to create a command-line tool)

It moves to its own repo

It moves somewhere else - cytominer-database repo?

The next q is, assuming it lives in pycytominer, does it really need to have a test? I think yes, so I'd go with 2. I say this because SQLite is often the point where things fail; testing that the index exists in the SQLite file would be a very valuable test (admittedly run_check_errors in the code already goes some distance to flag failure)

HOWEVER, if that ends up being a blocker (no one has the capacity) we can bump the decision to the BDFL of this package to decide if we declare cyto_utils to be a wild-west :) The only concern I have is that it might set a bad precedent that will eventually drive down code coverage

We decide we can live without tests, since cyto_utils is a bit more wild-west, and then we rebase and pull

I or somebody else writes some tests, and then we rebase and pull

PS – If we do write a test, it is perfectly ok to test only for remote=None

niranjchandrasekaran · 2022-04-01T23:35:47Z

I think merging collate.py and the recipe is what we should do. The recipe will benefit from collate.py's parallelization and it think it fits better in the recipe repo. Some more context in cytomining/profiling-recipe#30 (look for "Combining collate.py and the recipe"). We can continue the discussion there.

bethac07 · 2022-05-23T02:43:08Z

Finally got around to writing some tests; I also tracked down why adding image features outside of the recipe wasn't working.

I know the thought was that this will eventually move to the recipe, and I still think there's a reasonable argument that it should, but I also think that it's not unreasonable that someone using pycytominer outside of the recipe may want to concatenate data, plus it may be a while until we get around to adding parallelization to the recipe. Since the CLI aspect was the part with concerns, I've separated that out; once we have a nice way to run this in the recipe, that file can just be deleted if need be.

gwaybio

Looks great - I am indeed less concerned about merging this code in with tests. Although I will note that the test coverage drops about 5% (~50% test coverage of collate) - do you feel comfortable moving forward with this coverage?

I've made several specific comments and suggestions in line, that we should also discuss before merging.

I'll also outline some broad strokes comments below:

The collate function makes several SQLite commands. I wonder if it would be easier to read/maintain the script if you abstract these calls to some sort of a SQLite command builder (like you have already with run_check_errors)
We should consider removing cytominer-database from a required dependency. I am just too shakey when it comes to this code base. Then again, its been stable for several years and hasn't broken... but on the flip side, adding it to the requirements explicitly introduces a versioned bond that might prevent (or stall) improvements to pycytominer.
- In the same vein, do we need to specifically mention the sqlite dependency somewhere?

Vision

Once we're happy with the PR, I'm happy to merge this contribution into pycytominer. It's a solid short-term solution that works, and is practical given our current limitations (funding, long-term software maintenance support, etc.)

As @niranjchandrasekaran has previously mentioned, this code actually belongs in some sort of profiling recipe. This remains our long-term goal.

Another important note is that we may want to release a pycytominer version 0.1.5 prior to merging this contribution (this contribution can be pycytominer version 0.2), just to make sure pycytominer is completely up-to-date functionally, in an intermediate, but stable version.

I'm definitely interested in your thoughts on all aspects of this. Thanks for the PR!

pycytominer/cyto_utils/collate.py

pycytominer/cyto_utils/util.py

.gitignore

pycytominer/tests/test_cyto_utils/test_collate.py

requirements.txt

Co-authored-by: Gregory Way <gregory.way@gmail.com>

bethac07 · 2022-06-08T16:45:21Z

I'm not sure which version you were looking at where you saw 5% code coverage drop, but the current one says it's about a 2.3% drop. I am personally fine with the amount of test coverage; I'd like to in a perfect world be able to test the aws download functionality but because we sync whole plates down at a time we would need to host the 4 test data sites somewhere public as a "pseudo-plate" which I think is probably overkill.

I think all other comments here are addressed.

gwaybio

Thanks @bethac07 !

I made a couple more comments below. Let's resolve these before merge.

One last thing: What do you think about?

Another important note is that we may want to release a pycytominer version 0.1.5 prior to merging this contribution (this contribution can be pycytominer version 0.2), just to make sure pycytominer is completely up-to-date functionally, in an intermediate, but stable version.

I'll also loop in @d33bs about this decision point. Dave, does this release schedule sound like a reasonable strategy? (Beth, note that @d33bs's #203 PR is to a develop branch, but his #197 is to master.)

If the release strategy looks kosher, then we'll likely want to:

merge Improve Memory Performance Within merge_single_cells #197
Release v0.1.5
merge this PR
Release v0.2
Work on develop for future releases

README.md

pycytominer/cyto_utils/collate.py

setup.py

gwaybio · 2022-06-08T17:24:19Z

@niranjchandrasekaran - would you like to take a peek at this prior to merge? It's not necessary since we only require one maintainer approval, but I thought since you work with the profiling recipe more often than I do these days, that you might have some unique insights.

bethac07 · 2022-06-08T17:46:04Z

merge #197
Release v0.1.5
merge this PR
Release v0.2
Work on develop for future releases

This PR really doesn't touch any of the rest of pycytominer, so I don't think you necessarily need to version-pull-version (and I'm not actually sure that just the addition of the functionality is a full-blow 0.X.0 version bump), but your repo, your rules :)

niranjchandrasekaran

Took a brief look at the PR and every looks good to me!

gwaybio

I made some minor edits, which i will commit now

one-sentence-per-line markdown convention
fixing one or two spacing and punctuation typos

Let's follow #207 prior to merging (thanks Niranj for the quick scan!)

README.md

gwaybio · 2022-06-17T22:16:03Z

merging now! 🎉

Updated instructions for when collating is pulled

shntnu changed the title ~~Jump~~ Add collate.py Jul 29, 2021

reboot, with tests

ea1661d

bethac07 force-pushed the jump branch from 9447556 to ea1661d Compare May 23, 2022 02:30

bethac07 marked this pull request as ready for review May 23, 2022 02:32

bethac07 added 3 commits May 22, 2022 22:36

Update setup.py

6a60e8e

Update requirements.txt

38d580d

Update setup.py

653eba1

bethac07 mentioned this pull request May 23, 2022

Updated instructions for when collating is pulled cytomining/profiling-handbook#69

Merged

bethac07 and others added 4 commits May 23, 2022 07:20

Update __init__.py

eabfea2

Update util.py

4d99459

Add utils test, move import to fix cells tests

9d9c36f

black

69748a8

bethac07 requested review from niranjchandrasekaran and gwaybio May 23, 2022 13:01

bethac07 added 5 commits May 23, 2022 15:59

Update collate.py

60cd5f2

Update collate_cmd.py

add06f3

Update collate_cmd.py

fcdda11

Update collate.py

21548fc

Update test_collate.py

49eab8b

gwaybio requested changes May 25, 2022

View reviewed changes

update .gitignore

75dcac6

bethac07 and others added 5 commits June 8, 2022 09:28

Move cytominer-database to extras

9ddffe0

install extras during build/test workflows

b1dcf61

Move ingest config

b9bdd5c

Apply suggestions from code review

486fc16

Co-authored-by: Gregory Way <gregory.way@gmail.com>

address fstrings, pathlib, suggested name changes

ead2547

gwaybio self-requested a review June 8, 2022 17:03

gwaybio mentioned this pull request Jun 8, 2022

Parameterized compartment names in collate.py #204

Open

gwaybio reviewed Jun 8, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

pycytominer/cyto_utils/collate.py Show resolved Hide resolved

setup.py Show resolved Hide resolved

Update README.md

ff2dda6

This was referenced Jun 8, 2022

Conda recipe to support optional install of cytominer-database conda-forge/pycytominer-feedstock#1

Open

Next pycytominer release #207

Closed

niranjchandrasekaran approved these changes Jun 8, 2022

View reviewed changes

gwaybio approved these changes Jun 9, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

gwaybio added 3 commits June 9, 2022 08:53

Update README.md

8326488

Update README.md

713e1cd

Update README.md

1e32392

gwaybio merged commit f8ce3b4 into master Jun 17, 2022

gwaybio deleted the jump branch June 17, 2022 22:16

gwaybio mentioned this pull request Aug 29, 2022

Add to Function Converting SQLite to Pandas DataFrame and Merging of Additional Metadata #228

Merged

13 tasks

shntnu restored the jump branch October 18, 2022 20:07

gwaybio mentioned this pull request Mar 28, 2023

New cyto tool: create cell locations file #257

Merged

13 tasks

kenibrewer deleted the jump branch November 7, 2023 13:31

shntnu added a commit to cytomining/profiling-handbook that referenced this pull request Mar 21, 2024

Merge pull request #69 from cytomining/pycytominer/issues/160

d8a3b69

Updated instructions for when collating is pulled

gwaybio mentioned this pull request Sep 16, 2024

Increase test coverage for collate #442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add collate.py #160

Add collate.py #160

shntnu commented Jul 29, 2021 •

edited by bethac07

Loading

codecov-commenter commented Jul 29, 2021 •

edited

Loading

shntnu commented Jul 29, 2021

bethac07 commented Jul 29, 2021

shntnu commented Mar 31, 2022 •

edited

Loading

bethac07 commented Mar 31, 2022 •

edited

Loading

shntnu commented Mar 31, 2022 via email •

edited

Loading

shntnu commented Apr 1, 2022 •

edited

Loading

niranjchandrasekaran commented Apr 1, 2022

bethac07 commented May 23, 2022 •

edited

Loading

gwaybio left a comment •

edited

Loading

bethac07 commented Jun 8, 2022

gwaybio left a comment •

edited

Loading

gwaybio commented Jun 8, 2022

bethac07 commented Jun 8, 2022

niranjchandrasekaran left a comment

gwaybio left a comment

gwaybio commented Jun 17, 2022

Add collate.py #160

Add collate.py #160

Conversation

shntnu commented Jul 29, 2021 • edited by bethac07 Loading

Description

What is the nature of your change?

Checklist

codecov-commenter commented Jul 29, 2021 • edited Loading

Codecov Report

shntnu commented Jul 29, 2021

bethac07 commented Jul 29, 2021

shntnu commented Mar 31, 2022 • edited Loading

bethac07 commented Mar 31, 2022 • edited Loading

shntnu commented Mar 31, 2022 via email • edited Loading

shntnu commented Apr 1, 2022 • edited Loading

niranjchandrasekaran commented Apr 1, 2022

bethac07 commented May 23, 2022 • edited Loading

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

Vision

bethac07 commented Jun 8, 2022

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

gwaybio commented Jun 8, 2022

bethac07 commented Jun 8, 2022

niranjchandrasekaran left a comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio commented Jun 17, 2022

shntnu commented Jul 29, 2021 •

edited by bethac07

Loading

codecov-commenter commented Jul 29, 2021 •

edited

Loading

shntnu commented Mar 31, 2022 •

edited

Loading

bethac07 commented Mar 31, 2022 •

edited

Loading

shntnu commented Mar 31, 2022 via email •

edited

Loading

shntnu commented Apr 1, 2022 •

edited

Loading

bethac07 commented May 23, 2022 •

edited

Loading

gwaybio left a comment •

edited

Loading

gwaybio left a comment •

edited

Loading