Fix of `.merge_single_cells()` to Load Single-Cell Data into Dataframes #219

bunnech · 2022-08-16T14:08:28Z

Description

This pull request addresses issue #195, concerning a potential memory leak in the .merge_single_cells() function as mentioned in pull request #194 and issue #215.

The updated function .merge_single_cells() combined different compartments, image data, and metadata and loads them into a Pandas dataframes. The old version did not finish converting a >10 Gb .sqlite files into dataframes within 4 hours. The current fix finishes the task in < 15 minutes.
This was achieved by removing the dependency to pd.read_sql (as suggested by @johnarevalo) and by getting rid of temporarily created dataframes of compartments to be merged which took additional memory. There might be better solutions, but this is working well for me currently.

Let me know what you think!

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

gwaybio

Thanks for this contribution! I made several comments throughout that we should discuss prior to merging.

I have two additional discussion items as well. Can you comment on:

I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?
I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.

pycytominer/cyto_utils/cells.py

pycytominer/tests/test_cyto_utils/test_cells.py

pycytominer/cyto_utils/cells.py

bunnech · 2022-08-16T17:59:10Z

I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?

We are required to load the morphology features as a np.array as this is much faster than adding it via a pd.DataFrame. Then I merge both into a joint pd.DataFrame, which I return per compartment. Metadata features might contain strings or objects, so a np.array here is not possible. That's why we separate them.

I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.

I removed is_feature_col now. The other two functions are tested via pytests on load_compartment. Not sure which additional test to design.

bunnech · 2022-08-16T18:05:09Z

@gwaybio, I pushed the edits suggested by you. Let me know if anything else is required!

Thanks again!

gwaybio

I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?

We are required to load the morphology features as a np.array as this is much faster than adding it via a pd.DataFrame. Then I merge both into a joint pd.DataFrame, which I return per compartment. Metadata features might contain strings or objects, so a np.array here is not possible. That's why we separate them.

Got it! Makes perfect sense, thanks for your explanation.

I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.

I removed is_feature_col now. The other two functions are tested via pytests on load_compartment. Not sure which additional test to design.

I like the elegance of removing is_feature_col. The other two functions I think can be tested rather easily:

`count_sql_table_rows()`

# INSERT TESTING FUNCTION DEFINITION
def test_sc_count_sql_table():  # or something
    # Some python pseudocode testing logic:
    sc = SingleCells()  # Initiate class equivalently as we do in other

    # Iterate over initialized compartments
    for compartment in sc.compartments:
         result_row_count = sc.count_sql_table_rows(table=compartment)
         assert  result_row_count == whatever_example_initialized as

`get_sql_table_col_names()`

# INSERT TESTING FUNCTION DEFINITION
def test_get_sql_table_col_names():  # or something
    # Some python pseudocode testing logic:
    sc = SingleCells()  # Initiate class equivalently as we do in other

    # Iterate over initialized compartments
    for compartment in sc.compartments:
         meta_cols, feat_cols = sc.get_sql_table_col_names(table=compartment)
         assert  meta_cols == expected_meta_cols
         assert  feat_cols == feat_cols

bunnech · 2022-08-16T18:47:25Z

The other two functions I think can be tested rather easily:

Thanks! Added some pytests!

gwaybio

Two suggestions, one additional request, and one unresolved comment.

After you address these three things, I will merge (pending passing tests)

pycytominer/tests/test_cyto_utils/test_cells.py

codecov-commenter · 2022-08-16T19:06:47Z

Codecov Report

Merging #219 (f41d72a) into master (7fe47c9) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head f41d72a differs from pull request most recent head 1c6c000. Consider uploading reports for the commit 1c6c000 to get more accurate results

@@            Coverage Diff             @@
##           master     #219      +/-   ##
==========================================
+ Coverage   95.51%   95.56%   +0.04%     
==========================================
  Files          53       53              
  Lines        2697     2727      +30     
==========================================
+ Hits         2576     2606      +30     
  Misses        121      121

Flag	Coverage Δ
unittests	`95.56% <100.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pycytominer/cyto_utils/cells.py	`96.61% <100.00%> (+0.34%)`	⬆️
pycytominer/tests/test_cyto_utils/test_cells.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bunnech · 2022-08-16T19:12:29Z

Two suggestions, one additional request, and one unresolved comment.

Addressed them! Thanks a lot for the feedback and guidance!

…ogical features.

gwaybio · 2022-08-16T19:43:00Z

closes #220

Thanks again @bunnech !

bunnech added 4 commits August 13, 2022 17:47

Remove temporary df in merge function and add test mode.

852d862

Remove test environment and replace Pandas read_sql function.

766e4c2

Dump to parquet and add filename.

0118ccd

Correct dtype and adapt pytest.

abb2774

gwaybio self-requested a review August 16, 2022 14:39

gwaybio requested changes Aug 16, 2022

View reviewed changes

Add feedback.

0ef040e

gwaybio reviewed Aug 16, 2022

View reviewed changes

Add additional pytests.

f41d72a

gwaybio approved these changes Aug 16, 2022

View reviewed changes

pycytominer/tests/test_cyto_utils/test_cells.py Outdated Show resolved Hide resolved

pycytominer/tests/test_cyto_utils/test_cells.py Outdated Show resolved Hide resolved

Add additional pytests.

1c6c000

Add additional pytests.

e9d9edb

This was referenced Aug 16, 2022

Reorder singleCells() load_compartment features (metadata before morphology features) #220

Closed

Address Pandas Dataframe Merge and Other Performance Issues #199

Closed

Change existing pytests to consider order meta features, then morphol…

b8ad4c8

…ogical features.

gwaybio mentioned this pull request Aug 16, 2022

Potential memory leak in SingleCell's .merge_single_cells() method #195

Closed

gwaybio merged commit 338475b into cytomining:master Aug 16, 2022

d33bs mentioned this pull request Aug 22, 2022

Build SQLite conversion tool #205

Closed

axiomcura mentioned this pull request Sep 30, 2022

Operating system kernel kills merge_single_cells() process due to out of memory error #233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix of `.merge_single_cells()` to Load Single-Cell Data into Dataframes #219

Fix of `.merge_single_cells()` to Load Single-Cell Data into Dataframes #219

bunnech commented Aug 16, 2022

gwaybio left a comment

bunnech commented Aug 16, 2022 •

edited

Loading

bunnech commented Aug 16, 2022

gwaybio left a comment

bunnech commented Aug 16, 2022

gwaybio left a comment

codecov-commenter commented Aug 16, 2022

bunnech commented Aug 16, 2022

gwaybio commented Aug 16, 2022

Fix of .merge_single_cells() to Load Single-Cell Data into Dataframes #219

Fix of .merge_single_cells() to Load Single-Cell Data into Dataframes #219

Conversation

bunnech commented Aug 16, 2022

Description

What is the nature of your change?

Checklist

gwaybio left a comment

Choose a reason for hiding this comment

bunnech commented Aug 16, 2022 • edited Loading

bunnech commented Aug 16, 2022

gwaybio left a comment

Choose a reason for hiding this comment

count_sql_table_rows()

get_sql_table_col_names()

bunnech commented Aug 16, 2022

gwaybio left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 16, 2022

Codecov Report

bunnech commented Aug 16, 2022

gwaybio commented Aug 16, 2022

Fix of `.merge_single_cells()` to Load Single-Cell Data into Dataframes #219

Fix of `.merge_single_cells()` to Load Single-Cell Data into Dataframes #219

bunnech commented Aug 16, 2022 •

edited

Loading

`count_sql_table_rows()`

`get_sql_table_col_names()`