-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix of .merge_single_cells()
to Load Single-Cell Data into Dataframes
#219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! I made several comments throughout that we should discuss prior to merging.
I have two additional discussion items as well. Can you comment on:
- I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?
- I am inclined to ask for tests for the three new functions you added (prior to name change,
is_feature_col
,count
, andget_columns
). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.
We are required to load the morphology features as a
I removed |
@gwaybio, I pushed the edits suggested by you. Let me know if anything else is required! Thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why we need to separate metadata features and morphology features at all. Would it make sense to process all features the same way? Ah, but maybe if we do that they won't be in the correct order in the final sc_df?
We are required to load the morphology features as a np.array as this is much faster than adding it via a pd.DataFrame. Then I merge both into a joint pd.DataFrame, which I return per compartment. Metadata features might contain strings or objects, so a np.array here is not possible. That's why we separate them.
Got it! Makes perfect sense, thanks for your explanation.
I am inclined to ask for tests for the three new functions you added (prior to name change, is_feature_col, count, and get_columns). I know that they are tested through the other, existing tests, but I think they will be more robust to future changes if they themselves are tested.
I removed is_feature_col now. The other two functions are tested via pytests on load_compartment. Not sure which additional test to design.
I like the elegance of removing is_feature_col
. The other two functions I think can be tested rather easily:
count_sql_table_rows()
# INSERT TESTING FUNCTION DEFINITION
def test_sc_count_sql_table(): # or something
# Some python pseudocode testing logic:
sc = SingleCells() # Initiate class equivalently as we do in other
# Iterate over initialized compartments
for compartment in sc.compartments:
result_row_count = sc.count_sql_table_rows(table=compartment)
assert result_row_count == whatever_example_initialized as
get_sql_table_col_names()
# INSERT TESTING FUNCTION DEFINITION
def test_get_sql_table_col_names(): # or something
# Some python pseudocode testing logic:
sc = SingleCells() # Initiate class equivalently as we do in other
# Iterate over initialized compartments
for compartment in sc.compartments:
meta_cols, feat_cols = sc.get_sql_table_col_names(table=compartment)
assert meta_cols == expected_meta_cols
assert feat_cols == feat_cols
Thanks! Added some pytests! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions, one additional request, and one unresolved comment.
After you address these three things, I will merge (pending passing tests)
Codecov Report
@@ Coverage Diff @@
## master #219 +/- ##
==========================================
+ Coverage 95.51% 95.56% +0.04%
==========================================
Files 53 53
Lines 2697 2727 +30
==========================================
+ Hits 2576 2606 +30
Misses 121 121
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Addressed them! Thanks a lot for the feedback and guidance! |
Description
This pull request addresses issue #195, concerning a potential memory leak in the
.merge_single_cells()
function as mentioned in pull request #194 and issue #215.The updated function
.merge_single_cells()
combined different compartments, image data, and metadata and loads them into a Pandas dataframes. The old version did not finish converting a >10 Gb.sqlite
files into dataframes within 4 hours. The current fix finishes the task in < 15 minutes.This was achieved by removing the dependency to
pd.read_sql
(as suggested by @johnarevalo) and by getting rid of temporarily created dataframes of compartments to be merged which took additional memory. There might be better solutions, but this is working well for me currently.Let me know what you think!
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.