Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "noise removal" feature selection #153

Merged
merged 15 commits into from
Aug 27, 2021

Conversation

ruifanp
Copy link
Contributor

@ruifanp ruifanp commented Jul 7, 2021

modified from EmbeddedArtistry

Description

Added noise removal feature selection. Demonstrated to make features more informative in multiple sets.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@ruifanp
Copy link
Contributor Author

ruifanp commented Jul 27, 2021

This adds the noise removal feature selection. Prior to using this step, the data must be normalized first. The operation works by repeating the following calculation on each feature:

  1. For each unique perturbation group, find the standard deviation for the feature.
  2. Find the weighted mean (based on number of replicates) of the standard deviations.
  3. If the weighted mean is higher than the cutoff, the feature is discarded.

image

@ruifanp
Copy link
Contributor Author

ruifanp commented Jul 27, 2021

Example of this in action:

df = pd.DataFrame(
{
"x": [1, 3, 8, 5, 2, 2],
"y": [1, 2, 8, 5, 2, 1],
"z": [9, 3, 8, 9, 2, 9],
"zz": [0, -3, 8, 9, 6, 9],
}
).reset_index(drop=True)

Suppose the first 3 are replicates of perturbation a and the last 3 are of perturbation b
perturb_list = ['a','a', 'a', 'b', 'b', 'b']

Run feature selection
feature_select(df, features=df.columns.tolist(), operation='noise_removal', perturb_list=perturb_list, stdev_cutoff=3)

Note that features can be inferred, but you must always provide the perturbation list.

@ruifanp ruifanp marked this pull request as ready for review July 27, 2021 18:57
Copy link
Member

@niranjchandrasekaran niranjchandrasekaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruifanp I am looking forward to using this approach for the JUMP datasets. I have made a few suggestions. I hope they make sense and are useful to you.

I am also tagging @gwaygenomics to take a look at this PR. His reviews are legendary 😄 and one can learn a lot from his suggestions.

pycytominer/feature_select.py Outdated Show resolved Hide resolved
pycytominer/feature_select.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
@gwaybio
Copy link
Member

gwaybio commented Jul 28, 2021

@ruifanp - amazing to have this contribution! I will provide my review once you address @niranjchandrasekaran 's comments.

BTW, in the future, I don't think that both of us will need to review. Certainly supplemental comments can be helpful, but our CONTRIBUTING document only specifies one required maintainer acceptance:

All pull requests must be reviewed and approved by at least one project maintainer in order to be merged.

From https://github.com/cytomining/pycytominer/blob/372f1086318841670a916f20334333b06e6b84c9/CONTRIBUTING.md#pull-requests

@gwaybio
Copy link
Member

gwaybio commented Jul 28, 2021

I'll run the tests now, to see if anything obvious needs fixing

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this contribution @ruifanp - it's especially awesome considering JUMP is likely to benefit from it immediately.

My two main comments are:

  1. An enhancement to allow a user to also specify a metadata column
  2. A clarification on what you're testing for with data_unique_test_df - is this different than what you're testing for earlier in that function? If it is different, then I think you need to at the very least add comments on what each line is doing - very hard to read!

pycytominer/feature_select.py Outdated Show resolved Hide resolved
pycytominer/feature_select.py Outdated Show resolved Hide resolved
pycytominer/feature_select.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
pycytominer/operations/noise_removal.py Outdated Show resolved Hide resolved
pycytominer/tests/test_feature_select.py Show resolved Hide resolved
pycytominer/tests/test_feature_select.py Show resolved Hide resolved
pycytominer/tests/test_feature_select.py Outdated Show resolved Hide resolved
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One required simplification, one question, one typo fix.

Very close!

pycytominer/operations/noise_removal.py Show resolved Hide resolved
pycytominer/tests/test_feature_select.py Outdated Show resolved Hide resolved
pycytominer/tests/test_feature_select.py Show resolved Hide resolved
pycytominer/tests/test_feature_select.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #153 (24ce3b6) into master (372f108) will decrease coverage by 0.03%.
The diff coverage is 96.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #153      +/-   ##
==========================================
- Coverage   98.09%   98.05%   -0.04%     
==========================================
  Files          49       50       +1     
  Lines        2253     2313      +60     
==========================================
+ Hits         2210     2268      +58     
- Misses         43       45       +2     
Flag Coverage Δ
unittests 98.05% <96.66%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/operations/noise_removal.py 90.00% <90.00%> (ø)
pycytominer/feature_select.py 94.59% <100.00%> (+0.30%) ⬆️
pycytominer/operations/__init__.py 100.00% <100.00%> (ø)
pycytominer/tests/test_feature_select.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 372f108...24ce3b6. Read the comment docs.

Copy link
Member

@niranjchandrasekaran niranjchandrasekaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @ruifanp!

@gwaybio
Copy link
Member

gwaybio commented Aug 27, 2021

with one maintainer approval, it looks like this can be merged! Niranj, are you all set to merge Ruifan's contribution? (sometimes it makes sense to hold off a couple days, but I am not sure that is the case here)

@niranjchandrasekaran
Copy link
Member

Niranj, are you all set to merge Ruifan's contribution? (sometimes it makes sense to hold off a couple days, but I am not sure that is the case here)

Yes, will merge now.

@niranjchandrasekaran niranjchandrasekaran merged commit 9a55470 into cytomining:master Aug 27, 2021
@gwaybio gwaybio mentioned this pull request Oct 7, 2021
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants