-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/exe 1769 dsb implementation #147
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #147 +/- ##
==========================================
+ Coverage 81.80% 81.83% +0.02%
==========================================
Files 118 119 +1
Lines 6568 6605 +37
==========================================
+ Hits 5373 5405 +32
- Misses 1195 1200 +5 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! I have focuses on the python aspect here. I think @maxkarlsson is more qualified to comment on the actual implementation of the method.
Very nice and clear code. I had some minor suggestions.
def dsb_normalize( | ||
raw_expression: pd.DataFrame, isotype_controls: Union[List, None] = None | ||
): | ||
"""empty-droplet-free method as implemented in the dsb package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a reference to the original package here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mulè, M.P., Martins, A.J. & Tsang, J.S. Normalizing and denoising protein expression data from droplet-based single cell profiling. Nat Commun 13, 2099 (2022).
https://doi.org/10.1038/s41467-022-29356-8
…randomizations for the same seed across platforms.
Changed terminology somewhat. expression --> abundance limma --> _regress_out_confounder baseline --> background
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I have some comments and a control question for the PCA step. I have a little trouble understanding whether the linear regression is equivalent, mostly because I am not used to the syntax. I would suggest that we generate data for both versions of the normalization and verify that the result is close to exactly the same. Great and quick work!
I also made some commits with terminology changes. Please hava a look and see if you agree with the changes.
current_axis = dataframe.loc[i, :] if axis == 0 else dataframe.loc[:, i] | ||
gmm = gmm.fit(current_axis.to_frame()) | ||
background[i] = np.min(gmm.means_) | ||
scores[i] = np.abs(gmm.means_[1] - gmm.means_[0]) / np.sum(gmm.covariances_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is done at this line? Scores for a marker is calculated as how many standard deviations the positive population mean is from the negative population mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! right now the score is not used anywhere as we do the baseline subtraction for all markers, but could later be used as a check for existence of negative population.
|
||
def _get_background_abundance(dataframe: pd.DataFrame, axis=0): | ||
"""Fit a double gaussian distribution to the abundance data and return the mean | ||
of the first gaussian as an estimation of the background level.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also return some scores, right?
log_abundance = log_abundance - marker_background | ||
component_background, _ = _get_background_abundance(log_abundance, axis=0) | ||
|
||
if isotype_controls is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should force the user to input isotypes, since dsb relies on those.
if isotype_controls is not None: | ||
control_signals = log_abundance.loc[:, isotype_controls] | ||
control_signals["component_background"] = component_background | ||
pheno = PCA(n_components=1).fit_transform(control_signals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are variables in control_signals
scaled to unit variance before PCA?
…dsb_normalize to be called with at least one isotype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Description
Added dsb_normalization as part of the new normalization package in the analysis stage. dsb normalization has been shown to effectively compensate for noise effects in expression data.
Fixes: #(issue number)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
It has been tested against a similar function that is already implemented in R and the results are almost identical (A slight variation is expected given that the gaussian mixture model stage is indeterministic).
A corresponding test function is implemented to compare against future changes to the function.
PR checklist: