Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/exe 1769 dsb implementation #147

Merged
merged 14 commits into from
Jun 3, 2024
Merged

Conversation

ptajvar
Copy link
Contributor

@ptajvar ptajvar commented May 31, 2024

Description

Added dsb_normalization as part of the new normalization package in the analysis stage. dsb normalization has been shown to effectively compensate for noise effects in expression data.

Fixes: #(issue number)

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

It has been tested against a similar function that is already implemented in R and the results are almost identical (A slight variation is expected given that the gaussian mixture model stage is indeterministic).
A corresponding test function is implemented to compare against future changes to the function.

PR checklist:

  • This comment contains a description of changes (with reason).
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • I have checked my code and documentation and corrected any misspellings
  • I have documented any significant changes to the code in CHANGELOG.md

@ptajvar ptajvar self-assigned this May 31, 2024
Copy link

codecov bot commented May 31, 2024

Codecov Report

Attention: Patch coverage is 94.59459% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 81.83%. Comparing base (7c69253) to head (9562e38).

Files Patch % Lines
src/pixelator/analysis/normalization/__init__.py 94.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #147      +/-   ##
==========================================
+ Coverage   81.80%   81.83%   +0.02%     
==========================================
  Files         118      119       +1     
  Lines        6568     6605      +37     
==========================================
+ Hits         5373     5405      +32     
- Misses       1195     1200       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@johandahlberg johandahlberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I have focuses on the python aspect here. I think @maxkarlsson is more qualified to comment on the actual implementation of the method.

Very nice and clear code. I had some minor suggestions.

src/pixelator/analysis/normalization/__init__.py Outdated Show resolved Hide resolved
src/pixelator/analysis/normalization/__init__.py Outdated Show resolved Hide resolved
def dsb_normalize(
raw_expression: pd.DataFrame, isotype_controls: Union[List, None] = None
):
"""empty-droplet-free method as implemented in the dsb package.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a reference to the original package here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mulè, M.P., Martins, A.J. & Tsang, J.S. Normalizing and denoising protein expression data from droplet-based single cell profiling. Nat Commun 13, 2099 (2022).
https://doi.org/10.1038/s41467-022-29356-8

Copy link
Contributor

@maxkarlsson maxkarlsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I have some comments and a control question for the PCA step. I have a little trouble understanding whether the linear regression is equivalent, mostly because I am not used to the syntax. I would suggest that we generate data for both versions of the normalization and verify that the result is close to exactly the same. Great and quick work!

I also made some commits with terminology changes. Please hava a look and see if you agree with the changes.

src/pixelator/analysis/normalization/__init__.py Outdated Show resolved Hide resolved
src/pixelator/analysis/normalization/__init__.py Outdated Show resolved Hide resolved
src/pixelator/analysis/normalization/__init__.py Outdated Show resolved Hide resolved
current_axis = dataframe.loc[i, :] if axis == 0 else dataframe.loc[:, i]
gmm = gmm.fit(current_axis.to_frame())
background[i] = np.min(gmm.means_)
scores[i] = np.abs(gmm.means_[1] - gmm.means_[0]) / np.sum(gmm.covariances_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is done at this line? Scores for a marker is calculated as how many standard deviations the positive population mean is from the negative population mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! right now the score is not used anywhere as we do the baseline subtraction for all markers, but could later be used as a check for existence of negative population.


def _get_background_abundance(dataframe: pd.DataFrame, axis=0):
"""Fit a double gaussian distribution to the abundance data and return the mean
of the first gaussian as an estimation of the background level."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also return some scores, right?

log_abundance = log_abundance - marker_background
component_background, _ = _get_background_abundance(log_abundance, axis=0)

if isotype_controls is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should force the user to input isotypes, since dsb relies on those.

if isotype_controls is not None:
control_signals = log_abundance.loc[:, isotype_controls]
control_signals["component_background"] = component_background
pheno = PCA(n_components=1).fit_transform(control_signals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are variables in control_signals scaled to unit variance before PCA?

maxkarlsson and others added 2 commits June 3, 2024 11:20
@ptajvar ptajvar requested a review from maxkarlsson June 3, 2024 11:00
Copy link
Contributor

@maxkarlsson maxkarlsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@ptajvar ptajvar merged commit 206c806 into dev Jun 3, 2024
15 checks passed
@ptajvar ptajvar deleted the feature/exe-1769-dsb-implementation branch June 3, 2024 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants