Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix of .merge_single_cells() to Load Single-Cell Data into Dataframes #219

Merged
merged 9 commits into from
Aug 16, 2022
Merged

Fix of .merge_single_cells() to Load Single-Cell Data into Dataframes #219

merged 9 commits into from
Aug 16, 2022

Conversation

bunnech
Copy link
Contributor

@bunnech bunnech commented Aug 16, 2022

Description

This pull request addresses issue #195, concerning a potential memory leak in the .merge_single_cells() function as mentioned in pull request #194 and issue #215.

The updated function .merge_single_cells() combined different compartments, image data, and metadata and loads them into a Pandas dataframes. The old version did not finish converting a >10 Gb .sqlite files into dataframes within 4 hours. The current fix finishes the task in < 15 minutes.
This was achieved by removing the dependency to pd.read_sql (as suggested by @johnarevalo) and by getting rid of temporarily created dataframes of compartments to be merged which took additional memory. There might be better solutions, but this is working well for me currently.

Let me know what you think!

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@gwaybio gwaybio self-requested a review August 16, 2022 14:39
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! I made several comments throughout that we should discuss prior to merging.

I have two additional discussion items as well. Can you comment on:

  1. I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?
  2. I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.

pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
@bunnech
Copy link
Contributor Author

bunnech commented Aug 16, 2022

  1. I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?

We are required to load the morphology features as a np.array as this is much faster than adding it via a pd.DataFrame. Then I merge both into a joint pd.DataFrame, which I return per compartment. Metadata features might contain strings or objects, so a np.array here is not possible. That's why we separate them.

  1. I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.

I removed is_feature_col now. The other two functions are tested via pytests on load_compartment. Not sure which additional test to design.

@bunnech
Copy link
Contributor Author

bunnech commented Aug 16, 2022

@gwaybio, I pushed the edits suggested by you. Let me know if anything else is required!

Thanks again!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?

We are required to load the morphology features as a np.array as this is much faster than adding it via a pd.DataFrame. Then I merge both into a joint pd.DataFrame, which I return per compartment. Metadata features might contain strings or objects, so a np.array here is not possible. That's why we separate them.

Got it! Makes perfect sense, thanks for your explanation.

I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.

I removed is_feature_col now. The other two functions are tested via pytests on load_compartment. Not sure which additional test to design.

I like the elegance of removing is_feature_col. The other two functions I think can be tested rather easily:

count_sql_table_rows()

# INSERT TESTING FUNCTION DEFINITION
def test_sc_count_sql_table():  # or something
    # Some python pseudocode testing logic:
    sc = SingleCells()  # Initiate class equivalently as we do in other

    # Iterate over initialized compartments
    for compartment in sc.compartments:
         result_row_count = sc.count_sql_table_rows(table=compartment)
         assert  result_row_count == whatever_example_initialized as

get_sql_table_col_names()

# INSERT TESTING FUNCTION DEFINITION
def test_get_sql_table_col_names():  # or something
    # Some python pseudocode testing logic:
    sc = SingleCells()  # Initiate class equivalently as we do in other

    # Iterate over initialized compartments
    for compartment in sc.compartments:
         meta_cols, feat_cols = sc.get_sql_table_col_names(table=compartment)
         assert  meta_cols == expected_meta_cols
         assert  feat_cols == feat_cols

@bunnech
Copy link
Contributor Author

bunnech commented Aug 16, 2022

The other two functions I think can be tested rather easily:

Thanks! Added some pytests!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions, one additional request, and one unresolved comment.

After you address these three things, I will merge (pending passing tests)

pycytominer/tests/test_cyto_utils/test_cells.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_cells.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #219 (f41d72a) into master (7fe47c9) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head f41d72a differs from pull request most recent head 1c6c000. Consider uploading reports for the commit 1c6c000 to get more accurate results

@@            Coverage Diff             @@
##           master     #219      +/-   ##
==========================================
+ Coverage   95.51%   95.56%   +0.04%     
==========================================
  Files          53       53              
  Lines        2697     2727      +30     
==========================================
+ Hits         2576     2606      +30     
  Misses        121      121              
Flag Coverage Δ
unittests 95.56% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/cyto_utils/cells.py 96.61% <100.00%> (+0.34%) ⬆️
pycytominer/tests/test_cyto_utils/test_cells.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bunnech
Copy link
Contributor Author

bunnech commented Aug 16, 2022

Two suggestions, one additional request, and one unresolved comment.

Addressed them! Thanks a lot for the feedback and guidance!

@gwaybio
Copy link
Member

gwaybio commented Aug 16, 2022

closes #220

Thanks again @bunnech !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants