Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added check before dropping well column #241

Merged
merged 8 commits into from
Dec 16, 2022
Merged

Added check before dropping well column #241

merged 8 commits into from
Dec 16, 2022

Conversation

kenibrewer
Copy link
Member

Description

Simple bugfix for #240 where a missing check for identical column names results in an inappropriately dropped column.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.

@gwaybio
Copy link
Member

gwaybio commented Nov 15, 2022

Thanks for your contribution @kenibrewer - can you add a small test that would have failed the previous functionality that this contribution now fixes?

@gwaybio
Copy link
Member

gwaybio commented Nov 15, 2022

Actually, maybe the simpler solution would be to drop the expression .drop(join_on[0], axis="columns"

But, then we run into the issue when join_on[0] == join_on[1] with auto suffixes changing both column names. We'd have to add something like: suffixes=("_platemap", "")

What do you think?

@codecov-commenter
Copy link

Codecov Report

Merging #241 (c91d099) into master (e43be52) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #241   +/-   ##
=======================================
  Coverage   95.77%   95.77%           
=======================================
  Files          53       53           
  Lines        2794     2796    +2     
=======================================
+ Hits         2676     2678    +2     
  Misses        118      118           
Flag Coverage Δ
unittests 95.77% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/cyto_utils/features.py 89.79% <ø> (ø)
pycytominer/feature_select.py 94.59% <ø> (ø)
pycytominer/annotate.py 100.00% <100.00%> (ø)
...tests/test_cyto_utils/test_feature_drop_outlier.py 100.00% <100.00%> (ø)
pycytominer/tests/test_feature_select.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kenibrewer
Copy link
Member Author

@gwaybio I added an example to the related issue page that illustrates the problem. I also added a test_case to test_annotate.py that crashes with the old code and works with the new code. In the process, I also discovered that a failure to work on fresh dataframe copies in individual tests was causing operations in one test to affect the others. So I fixed that too.

Actually, maybe the simpler solution would be to drop the expression .drop(join_on[0], axis="columns"

But, then we run into the issue when join_on[0] == join_on[1] with auto suffixes changing both column names. We'd have to add something like: suffixes=("_platemap", "")

What do you think?

I think there is value to eliminating redundant information by having dropping the platemap version of the well column name. So I think we should still drop the column when appropriate. However, even if we drop the expression .drop(join_on[0], axis="columns", we won't get any additional columns with auto-suffixes than we would anyways.

Addressing the auto_suffix problem separately, I actually do really like the idea of adding the suffixes=("_platemap", "") parameter regardless. An additional nice feature would be if after the merge any new '_platemap' columns are checked against their matching data column and are dropped if they are identical.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple discussion questions and required changes prior to review. Thanks again for your contribution!

pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
@gwaybio
Copy link
Member

gwaybio commented Dec 15, 2022

@kenibrewer - please correct me if I'm wrong, but I think we're waiting on your reply to this comment prior to moving forward with merging this PR. Thanks again!

@kenibrewer
Copy link
Member Author

Hi @gwaybio,
You are correct that you are waiting on me to submit some additional changes prior to merging. I had some other priorities the past few weeks but have some time to work on this now.
Related to this bug. I've think I've discovered some other bugs related to missing .copy(deep=True) in other pycytominer functions. I'll submit those as a separate issue though.

@kenibrewer
Copy link
Member Author

kenibrewer commented Dec 16, 2022

Revisiting your suggestion about adding suffixes, I think it would be good to add suffixes = ["", "platemap"] to the platemap merge and suffixes = ["", "external"] to the external metadata merge. Both of those changes would result in much more readable and predictable outputs than the default "_x" and "_y" suffixes.

The one problem with this change, however, is that it would likely break pipelines/code (including a few of my own projects), that already include logic around renaming/handling the "_x" and "_y" columns that are generated by the current codebase. I still think the change is worthwhile, but I'm not sure how you handle potentially breaking changes.

@gwaybio
Copy link
Member

gwaybio commented Dec 16, 2022

Thanks @kenibrewer - thanks for following up and I'm excited you're able to work on this PR again. I'll respond to a couple of your thoughts and then provide a code review.

I've think I've discovered some other bugs related to missing .copy(deep=True) in other pycytominer functions. I'll submit those as a separate issue though.

Fantastic - thanks! Once you create a new issue, please link this PR

I think it would be good to add suffixes = ["", "platemap"] to the platemap merge and suffixes = ["", "external"] to the external metadata merge.

Sounds good. My sense is that this change belongs in a separate PR. Let's focus on dropping the well column (and fixing any immediate bugs that arise from this closer investigation) here, and save this potentially breaking change for a subsequent PR.

I'm not sure how you handle potentially breaking changes.

We haven't had a hard and fast policy, but we typically mint a new release version which will kickoff a new pypi and conda build. If you're using a specific pycytominer version (or github hash) in your previous pipelines, then this should have no impact on past work other than needing to modify code for future projects if new pycytominer versions are good enough to warrant a version upgrade.

@gwaybio gwaybio self-requested a review December 16, 2022 16:26
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'll kick off tests now. Please review my minor suggestions and then, once the tests pass, I'll check in with you and merge.

Thanks again for your contribution!

pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_annotate.py Outdated Show resolved Hide resolved
@gwaybio
Copy link
Member

gwaybio commented Dec 16, 2022

Looks like the tests will pass - are we set to merge this in @kenibrewer ?

@kenibrewer
Copy link
Member Author

@gwaybio Yep. Ready for merge. 🚀

@gwaybio gwaybio self-requested a review December 16, 2022 19:00
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@gwaybio gwaybio merged commit 0d75bc6 into cytomining:master Dec 16, 2022
@gwaybio
Copy link
Member

gwaybio commented Dec 16, 2022

Thanks again @kenibrewer

When you get a chance, please create new issues in https://github.com/cytomining/pycytominer/issues. I believe they involved 1) copy() based bugs and 2) specific merge suffixes. I don't want to lose track of these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants