-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regressing out cell count #15
Comments
Update 5/31: results and visualizations were re-generated after removing some data after QC and making sure index isn't used as a feature when calculating metrics (see #16). Overall, the conclusions hold: cell count adjustment seems to help, while subtracting well mean on top of that can hurt the replicate retrieval performance. |
🎉 Thanks for getting that done lightning fast! Can we visualize what the
pattern looks like for the mean in each well position? I’m curious if it’s
the left right pattern.
It’s surprising to me that this correction worsens outcomes!
--
Sent from my mobile phone
|
Questions from Arnaud
For the first Q: we've indeed been using something downstream for evaluation: ability to retrieve the same ORF reagent in a different well position (and usually different batch) - which we want to be high. And, ability to retrieve a different ORF in the same well position - which we want to be low. So far our results look bad and confusing and we are trying to figure out what is going on! perhaps @alxndrkalinin can address your other two Qs. |
We argued that regression should be done as early in the processing pipeline to capture relationships between variables. Adjusting for plate-to-plate variation before regression does not seem desirable (not thought this through too deeply) |
On Alex's list is to expand the evaluation from the 37 duplicated ORF reagents (because who knows, perhaps there is something systematically weird about these reagents) to same-gene ORF reagents and even to same-GO-term reagents. It occurred to me that perhaps we ought to be using MOA matching in Target2 plates in each batch as another evaluation? It's compounds rather than ORFs but many technical variations/artifacts should have affected them the same (except the initial steps of virus production, etc) |
Update 6/6
We have previously said that
After we discussed normalization, I updated plate visualizations of feature means above using normalized data. I also made additional per-position visualizations. We discussed with @shntnu that cell count adjustment seems to mostly reduce top-down effect, but not the left-right one.
Since features are comparable when normalized, I also made plot for all features averaged per well position:
I also made similar visualization per feature type: Cell features
Nuclei features
Cytoplasm features
Image features
Given that there aren't many more unique gene symbols vs ORFs, plots and metrics for
PCA results
|
@alxndrkalinin Do you mean the code doesn't separate the profiles by plates? If so, it is true that pycytominer does not normalize by plate in the typical profiling workflow. We separate the profiles by plate before doing |
Update 6/12
I changed pre-processing to perform normalization per-plate and recalculated all downstream results. Note, that doing so resulted in some
I updated plate visualizations of feature means using per-plate normalized data. All features averaged
Cell features
Nuclei features
Cytoplasm features
Image features
WDN–whole dataset normalization 3.1 Same well, different ORF/gene (lower is better, N=368)
3.2v1 Same ORF, different well (higher is better, N=37)
3.2v2 Same gene, different well (higher is better, N=264)
3.3 Same ORF/gene, same well (expected to drop, but not too much, N=3653)
Other experiments included PCA with and w/o image features and Cosine kernel PCA, but did not show improvement over the results above. |
In the See an example
Whereas in the
An example of a gene with all replicates in the same batch
An example of a gene with 3 different batches and 4 well positions
|
(DRAFT) Testing DINO features here: https://github.com/jump-cellpainting/morphmap/issues/91#issuecomment-1595712135 |
1. Motivation
Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.
2. Approach
To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.
2.1 Constant and low count features
Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.
Effects of regressing out cell count on a constant feature
We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.
Visualizing # of unique feature values vs cell count
2.2 Adding cell count back as a feature
After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the
feature_select
step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio~0.06
(see visualization below), which is below the cutoff valueunique_cut=0.1
that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of0.01
, which later was replaced by0.1
, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we usefeature_selection
withunique_cut=0.01
, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.Cell count unique values / sample size ratio
3. Results
3.1 Same well, different ORF
Same well, different ORF plots
3.2 Same ORF, different well
Same ORF, different well plots
3.2 Same ORF, same well
Same ORF, same well plots
Observations:
The text was updated successfully, but these errors were encountered: