Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cyto tool: create cell locations file #257

Merged
merged 83 commits into from
Apr 7, 2023

Conversation

shntnu
Copy link
Member

@shntnu shntnu commented Feb 24, 2023

modified from EmbeddedArtistry

Description

This introduces a new tool to append X, Y locations of all cells in an image to each row of a LoadData CSV file. This will make it a lot easier for deep learning feature extraction workflows to do their thing (e.g. DeepProfiler will no longer need to download single-cell SQLite files to extract locations, etc.)

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@shntnu
Copy link
Member Author

shntnu commented Feb 24, 2023

python \
  pycytominer/cyto_utils/cell_locations_cmd.py \
  --input_parquet_file pycytominer/tests/test_data/cell_locations_example_data/load_data_with_illum_subset.parquet \
  --sqlite_file pycytominer/tests/test_data/cell_locations_example_data/BR00126114_subset.sqlite \
  --output_parquet_file \
  ~/Desktop/load_data_with_illum_and_cell_location_subset.parquet 

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2023

Codecov Report

Merging #257 (e77c6cf) into master (7e57132) will decrease coverage by 0.25%.
The diff coverage is 89.36%.

@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
- Coverage   95.71%   95.47%   -0.25%     
==========================================
  Files          53       57       +4     
  Lines        2826     3048     +222     
==========================================
+ Hits         2705     2910     +205     
- Misses        121      138      +17     
Flag Coverage Δ
unittests 95.47% <89.36%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/cyto_utils/cell_locations_cmd.py 0.00% <0.00%> (ø)
pycytominer/cyto_utils/collate.py 54.73% <0.00%> (ø)
setup.py 0.00% <ø> (ø)
pycytominer/cyto_utils/cell_locations.py 84.16% <84.16%> (ø)
pycytominer/cyto_utils/load.py 87.34% <93.75%> (+0.77%) ⬆️
pycytominer/tests/test_cyto_utils/conftest.py 100.00% <100.00%> (ø)
...miner/tests/test_cyto_utils/test_cell_locations.py 100.00% <100.00%> (ø)
pycytominer/tests/test_cyto_utils/test_load.py 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@shntnu
Copy link
Member Author

shntnu commented Feb 25, 2023

This can be run on command line using fire:

metadata_input="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/load_data_csv/2021_08_23_Batch12/BR00126114/load_data_with_illum_subset.parquet"
single_single_cell_input="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114_subset.sqlite"
augmented_metadata_output="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/load_data_csv/2021_08_23_Batch12/BR00126114/load_data_with_illum_and_cell_location_subset.parquet"

python \
    pycytominer/cyto_utils/cell_locations_cmd.py \
    --metadata_input ${metadata_input} \
    --single_cell_input ${single_single_cell_input}   \
    --augmented_metadata_output ${augmented_metadata_output} \
    add_cell_location

load_data_with_illum_and_cell_location_subset.parquet.zip

@shntnu
Copy link
Member Author

shntnu commented Feb 27, 2023

Currently, one of the test fixtures (cell_loc3) requires two S3 objects to be available. I didn't have the time to figure out how to use moto to avoid this. I've added a snippet below that would go into conftest.py, but more work is needed to replace the test entirely.

For now, cell_loc3 is being run using objects on an S3 bucket, but this is not reliable. Further, I am not checking for output to be on S3.

import boto3
from typing import List
from moto import mock_s3
from moto.server import ThreadedMotoServer

# moto patterns are from
# https://github.com/cytomining/CytoTable/blob/415f2ecd94b66c979bfda595448264089f9a1c80/tests/conftest.py
# credits: @gwaygenomics and @d33b


@pytest.fixture(scope="session", name="s3_session")
def fixture_s3_session() -> boto3.session.Session:
    """
    Yield a mocked boto session for s3 tests.
    """

    # start a moto server for use in testing
    server = ThreadedMotoServer()
    server.start()

    with mock_s3():
        yield boto3.session.Session()


@pytest.fixture(name="example_local_sources")
def fixture_example_local_sources() -> List[pathlib.Path]:
    """
    Provide a list of example sources
    """
    return [
        pathlib.Path(__file__).parent.parent
        / "test_data"
        / "cell_locations_example_data"
        / "BR00126114_subset.sqlite",
        pathlib.Path(__file__).parent.parent
        / "test_data"
        / "cell_locations_example_data"
        / "load_data_with_illum_subset.parquet",
    ]


@pytest.fixture()
def example_s3_endpoint(
    s3_session: boto3.session.Session,
    example_local_sources: List[pathlib.Path],
) -> str:
    """
    Create a mocked bucket which includes example sources
    """
    # s3 is a fixture defined above that yields a boto3 s3 client.
    # Feel free to instantiate another boto3 S3 client -- Keep note of the region though.
    endpoint_url = "http://localhost:5000"
    bucket_name = "example"

    # create s3 client
    s3_client = s3_session.client("s3", endpoint_url=endpoint_url)

    # create a bucket for content to land in
    s3_client.create_bucket(Bucket=bucket_name)

    # upload each example file to the mock bucket
    for source_path in example_local_sources:
        s3_client.upload_file(
            Filename=str(source_path),
            Bucket=bucket_name,
            Key=source_path.name,
        )

    # return endpoint url for use in testing
    return endpoint_url


@pytest.fixture(name="single_cell_input_file_s3")
def fixture_single_cell_input_file_s3(example_s3_endpoint) -> str:
    """
    Provide a single cell input file for cell_locations test data
    """
    return f"{example_s3_endpoint}/example/BR00126114_subset.sqlite"

@shntnu
Copy link
Member Author

shntnu commented Feb 27, 2023

Meanwhile, @Arkkienkeli, can you inspect the parquet file attached in #257 (comment) to see if this kind of output works for you (for DeepProfiler) instead of creating a locations file? That is, if such a (parquet) file were available, would you no longer need to create a locations file?

See the logic in https://github.com/shntnu/pycytominer/blob/bc2f01dfa0252bba5f3aa04cd8488f8dc7fe5de9/pycytominer/cyto_utils/cell_locations.py#L14-L38 to figure out what's happening

@Arkkienkeli
Copy link
Member

Hi @shntnu,
Just to clarify, DeepProfiler cannot work with parquet files at the moment.

In the example you shared, Nuclei_Location_Center_X and Nuclei_Location_Center_Y are arrays, though DeepProfiler reads files with single location per line, for example:

Nuclei_Location_Center_X,Nuclei_Location_Center_Y
219,34
175,45
1065,51
227,58
784,67

So either we update DeepProfiler or I will need to convert those files into our format.

@shntnu
Copy link
Member Author

shntnu commented Apr 2, 2023

I'm requesting some changes and would be interested in your thoughts and input based on the comments. Thank you so much for your efforts!

Thank you so much for the detailed review -- I learned a lot!

I've resolved all your comments except this one, related to cell_locations.py:

I like the placement of this file within the repository and it makes me wonder if it'd be a good idea to extend the SingleCells class at some point via CellLocation (for example, CellLocation(SingleCells) or something of that nature). Doing this would mean one would have a unified way to access features and capabilities together. Changes to this code might be sizable, so perhaps this is best encapsulated as a new issue? Interested in your thoughts.

I skimmed the SingleCells class, and my initial impression was that the overlaps were fairly minimal.

But it's possible that refactoring SingleCells would expose more reusable functionality that CellLocations could benefit from.

For example, some of these CellLocations methods could be moved into SingleCells or into a new utils module:

Move to a utils module:

  • _expanduser
  • _parse_s3_path
  • _s3_file_exists
  • _download_s3

Move to SingleCells:

  • _get_single_cell_engine
  • _check_single_cell_correctness (needs to be generalized further)
  • _get_joined_image_nuclei_tables and _load_single_cell (maybe combine with merge_single_cells)

@shntnu shntnu requested a review from d33bs April 2, 2023 13:17
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thank you for making changes and acknowledging my comments @shntnu ! I've provided some additional comments and would like to request further changes. I especially feel the example in the readme and anonymous S3 access items should be addressed (or documented).

Responding to your comment:

"But it's possible that refactoring SingleCells would expose more reusable functionality that CellLocations could benefit from."

Thank you for addressing this! Based on what you mentioned I feel this may be reaching the limits or out of scope for this PR. Perhaps this is a new issue for the project to explore later on?

.github/workflows/codecov.yml Show resolved Hide resolved
.github/workflows/codecov.yml Show resolved Hide resolved
.github/workflows/python-app.yml Show resolved Hide resolved
pycytominer/cyto_utils/cell_locations.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cell_locations.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cell_locations.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@shntnu
Copy link
Member Author

shntnu commented Apr 5, 2023

Thanks for the excellent feedback @d33bs

All set now!

This resulted in 4 new issues #266 #267 #268 #269

@shntnu shntnu requested a review from d33bs April 5, 2023 20:38
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the changes and for raising the additional issues @shntnu ! I only had one comment which could benefit from feedback with this review and overall felt this LGTM.

README.md Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Show resolved Hide resolved
pycytominer/cyto_utils/cell_locations.py Show resolved Hide resolved
@shntnu
Copy link
Member Author

shntnu commented Apr 7, 2023

Thanks so much for the changes and for raising the additional issues @shntnu !

Your feedback was really motivating!

I only had one comment which could benefit from feedback with this review and overall felt this LGTM.

All set now

@shntnu
Copy link
Member Author

shntnu commented Apr 7, 2023

@d33bs If there's nothing further, please go ahead and merge and we'll launch a bunch of jobs that are waiting on this.

Thank you once again for all the feedback along the way!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this contribution @shntnu and you're review @d33bs ! I am very happy to approve and merge

@gwaybio gwaybio merged commit a5ae6c8 into cytomining:master Apr 7, 2023
@shntnu shntnu deleted the ss_cell_locations branch April 16, 2023 19:33
@shntnu shntnu mentioned this pull request Mar 8, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants