Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressing out cell count #15

Open
alxndrkalinin opened this issue May 23, 2023 · 11 comments
Open

Regressing out cell count #15

alxndrkalinin opened this issue May 23, 2023 · 11 comments

Comments

@alxndrkalinin
Copy link
Member

alxndrkalinin commented May 23, 2023

1. Motivation

Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.

2. Approach

To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.

2.1 Constant and low count features

Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.

Effects of regressing out cell count on a constant feature
Before After
before after

We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.

Visualizing # of unique feature values vs cell count

unique_vs_cc

2.2 Adding cell count back as a feature

After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the feature_select step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio ~0.06 (see visualization below), which is below the cutoff value unique_cut=0.1 that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of 0.01, which later was replaced by 0.1, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we use feature_selection with unique_cut=0.01, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.

Cell count unique values / sample size ratio

unique_size_ratio

3. Results

3.1 Same well, different ORF

Setting Data mmAP Fraction retrieved (p<0.05)
same well, diff ORF raw->subset 0.0636 0.139 (51/368)
same well, diff ORF raw->subset->cc adjust 0.0583 0.0217 (8/368)
same well, diff ORF raw->subset->well correct 0.114 0.25 (92/368)
same well, diff ORF raw->subset->cc adjust->well correct 0.379 0.723 (266/368)
Same well, different ORF plots

same_well_diff_orf

3.2 Same ORF, different well

Setting Data mmAP Fraction retrieved (p<0.05)
same ORF, diff well raw->subset 0.00974 0.0 (0/37)
same ORF, diff well raw->subset->cc adjust 0.0202 0.027 (1/37)
same ORF, diff well raw->subset->well correct 0.0166 0.027 (1/37)
same ORF, diff well raw->subset->cc adjust->well correct 0.00834 0.0 (0/37)
Same ORF, different well plots

same_orf_diff_well

3.2 Same ORF, same well

Setting Data mmAP Fraction retrieved (p<0.05)
same ORF, same well raw->subset 0.195 0.903 (3297/3653)
same ORF, same well raw->subset->cc adjust 0.0856 0.417 (1524/3653)
same ORF, same well raw->subset->well correct 0.286 0.93 (3397/3653)
same ORF, same well raw->subset->cc adjust->well correct 0.538 0.989 (3612/3653)
Same ORF, same well plots

same_orf_diff_well

Observations:

  • cc adjustment corrects for plate effects better than well mean correction
  • a combination of both actually makes things WORSE (perhaps, due to an overcorrection)
@alxndrkalinin alxndrkalinin changed the title Regressing out cel count Regressing out cell count May 23, 2023
@alxndrkalinin
Copy link
Member Author

Update 5/31: results and visualizations were re-generated after removing some data after QC and making sure index isn't used as a feature when calculating metrics (see #16).

Overall, the conclusions hold: cell count adjustment seems to help, while subtracting well mean on top of that can hurt the replicate retrieval performance.

cc @shntnu @AnneCarpenter

@AnneCarpenter
Copy link

AnneCarpenter commented May 31, 2023 via email

@alxndrkalinin
Copy link
Member Author

alxndrkalinin commented Jun 2, 2023

Here are visualizations of mean values across the whole ORF dataset, for Cells_AreaShape_Area:

Raw profiles Cell count adjusted
Well position corrected CC adjust + well pos correct

for Cells_Intensity_MeanIntensity_AGP:

Raw profiles Cell count adjusted
Well position corrected CC adjust + well pos correct

Additionally, there are visualizations for Image_Threshold_SumOfEntropies_CellsIncludingEdges, which originally had high correlation with cell count:

  • before adjustment: 0.719
  • after adjustment: 1.77e-14
Raw profiles Cell count adjusted
Well position corrected CC adjust + well pos correct

@AnneCarpenter
Copy link

Questions from Arnaud

  • I am still wondering the impact of the regression part on classification/clustering algorithms. Did u quantify that with some ad-hoc generated data.
  • Also I am surprised that this step isn’t done after normalization (any reasons?).
  • For the correlation threshold to use to select features that will go through this process, u are choosing 0.4?

For the first Q: we've indeed been using something downstream for evaluation: ability to retrieve the same ORF reagent in a different well position (and usually different batch) - which we want to be high. And, ability to retrieve a different ORF in the same well position - which we want to be low. So far our results look bad and confusing and we are trying to figure out what is going on! perhaps @alxndrkalinin can address your other two Qs.

@shntnu
Copy link
Collaborator

shntnu commented Jun 5, 2023

  • Also I am surprised that this step isn’t done after normalization (any reasons?).

We argued that regression should be done as early in the processing pipeline to capture relationships between variables. Adjusting for plate-to-plate variation before regression does not seem desirable (not thought this through too deeply)

@AnneCarpenter
Copy link

On Alex's list is to expand the evaluation from the 37 duplicated ORF reagents (because who knows, perhaps there is something systematically weird about these reagents) to same-gene ORF reagents and even to same-GO-term reagents. It occurred to me that perhaps we ought to be using MOA matching in Target2 plates in each batch as another evaluation? It's compounds rather than ORFs but many technical variations/artifacts should have affected them the same (except the initial steps of virus production, etc)

@alxndrkalinin
Copy link
Member Author

Update 6/6

  • Normalization

We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case. Is that something we want to reconsider or is it not super important? (cc @niranjchandrasekaran @johnarevalo who're also using it)

  • Per-position mean feature visualizations

After we discussed normalization, I updated plate visualizations of feature means above using normalized data. I also made additional per-position visualizations.

We discussed with @shntnu that cell count adjustment seems to mostly reduce top-down effect, but not the left-right one.

Cell Counts - Raw profiles Cell Counts - Well position corrected

Since features are comparable when normalized, I also made plot for all features averaged per well position:

All features - Raw profiles All features - Cell count adjusted
All features - Position corrected All features - CC adjusted and position corrected

I also made similar visualization per feature type:

Cell features
Cell features - Raw profiles Cell features - Cell count adjusted
Cell features - Position corrected Cell features - CC adjusted and position corrected
Nuclei features
Nuclei features - Raw profiles Nuclei features - Cell count adjusted
Nuclei features - Position corrected Nuclei features - CC adjusted and position corrected
Cytoplasm features
Cytoplasm features - Raw profiles Cytoplasm features - Cell count adjusted
Cytoplasm features - Position corrected Cytoplasm features - CC adjusted and position corrected
Image features
Image features - Raw profiles Image features - Cell count adjusted
Image features - Position corrected Image features - CC adjusted and position corrected
  • Retrieving same gene in different wells (instead of ORFs)

Given that there aren't many more unique gene symbols vs ORFs, plots and metrics for Same Well – Different ORF & Same Well – Same ORF remain the same as above.

  • Replacing ORFs with gene symbols, increases sample size, but produces metric contradictory to our previous conclusions
Setting Data mmAP Fraction retrieved (p<0.05)
same gene, diff well raw->subset 0.0415 0.273 (72/264)
same gene, diff well raw->subset->cc adjust 0.0262 0.121 (32/264)
same gene, diff well raw->subset->well correct 0.0367 0.284 (75/264)
same gene, diff well raw->subset->cc adjust->well correct 0.0278 0.152 (40/264)
Plots

same_gene_diff_well

  • Removing Image features (per convo w/ @anne) improves retrieval
Setting Data mmAP Fraction retrieved (p<0.05)
same gene, diff well raw->subset 0.0475 0.333 (88/264)
same gene, diff well raw->subset->cc adjust 0.0311 0.155 (41/264)
same gene, diff well raw->subset->well correct 0.0432 0.33 (87/264)
same gene, diff well raw->subset->cc adjust->well correct 0.0304 0.152 (40/264)
Plots

same_orf_diff_well

  • Preliminary results with PCA haven't showed any improvement in retrieval
PCA results
Setting Data mmAP Fraction retrieved (p<0.05)
same gene, diff well raw->subset 0.0338 0.227 (60/264)
same gene, diff well raw->subset->cc adjust 0.015 0.00379 (1/264)
same gene, diff well raw->subset->well correct 0.0187 0.0606 (16/264)
same gene, diff well raw->subset->cc adjust->well correct 0.0189 0.0833 (22/264)

same_orf_diff_well

@niranjchandrasekaran
Copy link
Member

We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case.

@alxndrkalinin Do you mean the code doesn't separate the profiles by plates? If so, it is true that pycytominer does not normalize by plate in the typical profiling workflow. We separate the profiles by plate before doing RobustMAD normalization.

@alxndrkalinin
Copy link
Member Author

Update 6/12

  • 1. Normalization

I changed pre-processing to perform normalization per-plate and recalculated all downstream results. Note, that doing so resulted in some Inf values after RobustMAD because some Cytoplasm features have MAD=0 and our default epsilon=0.0. For plate visualizations, I removed those features, which should probably be done in preprocessing, so that's a TODO item.

  • 2. Per-position mean feature visualizations

I updated plate visualizations of feature means using per-plate normalized data.

Cell Count
Cell Counts - Raw profiles Cell Counts - Well position corrected
All features averaged
All features - Raw profiles All features - Cell count adjusted
All features - Position corrected All features - CC adjusted and position corrected
Cell features
Cell features - Raw profiles Cell features - Cell count adjusted
Cell features - Position corrected Cell features - CC adjusted and position corrected
Nuclei features
Nuclei features - Raw profiles Nuclei features - Cell count adjusted
Nuclei features - Position corrected Nuclei features - CC adjusted and position corrected
Cytoplasm features
Cytoplasm features - Raw profiles Cytoplasm features - Cell count adjusted
Cytoplasm features - Position corrected Cytoplasm features - CC adjusted and position corrected
Image features
Image features - Raw profiles Image features - Cell count adjusted
Image features - Position corrected Image features - CC adjusted and position corrected
  • 3. Retrieving ORFs/genes with pre-plate normalization

WDN–whole dataset normalization
PPN-IM–pre-plate normalization with image features
PPN-NI–per-plate normalization without image features
FR–fraction retrieved

3.1 Same well, different ORF/gene (lower is better, N=368)

Setting Data mmAP WDN FR WDN (p<0.05) mmAP PPN-IM FR PPN-IM (p<0.05) mmAP PPN-NI FR PPN-NI (p<0.05)
same well, diff ORF raw->subset 0.0636 0.139 0.148 0.899 0.164 0.938
same well, diff ORF raw->subset->cc adjust 0.0583 0.0217 0.0919 0.543 0.0974 0.625
same well, diff ORF raw->subset->well correct 0.114 0.25 0.241 0.864 0.184 0.69
same well, diff ORF raw->subset->cc adjust->well correct 0.379 0.723 0.391 0.959 0.362 0.886

3.2v1 Same ORF, different well (higher is better, N=37)

Setting Data mmAP WDN Fraction WDN (p<0.05) mmAP PPN-IM Fraction PPN-IM (p<0.05)
same ORF, different well raw->subset 0.00974 0.0 (0/37) 0.0341 0.0811 (3/37)
same ORF, different well raw->subset->cc adjust 0.0202 0.027 (1/37) 0.0314 0.0541 (2/37)
same ORF, different well raw->subset->well correct 0.0166 0.027 (1/37) 0.0505 0.108 (4/37)
same ORF, different well raw->subset->cc adjust->well correct 0.00834 0.0 (0/37) 0.0274 0.027 (1/37)

3.2v2 Same gene, different well (higher is better, N=264)

Setting Data mmAP WDN FR WDN (p<0.05) mmAP PPN-IM FR PPN-IM (p<0.05) mmAP PPN-NI FR PPN-NI (p<0.05)
same gene, different well raw->subset 0.0415 0.273 0.027 0.102 0.0351 0.121
same gene, different well raw->subset->cc adjust 0.0262 0.121 0.0233 0.053 0.0257 0.0682
same gene, different well raw->subset->well correct 0.0367 0.284 0.027 0.0606 0.0307 0.072
same gene, different well raw->subset->cc adjust->well correct 0.0278 0.152 0.0225 0.0758 0.0225 0.0758

3.3 Same ORF/gene, same well (expected to drop, but not too much, N=3653)

Setting Data mmAP WDN Fraction WDN (p<0.05) mmAP PPN-IM Fraction PPN-IM (p<0.05) mmAP PPN-NI Fraction PPN-NI (p<0.05)
same ORF, same well raw->subset 0.195 0.903 0.211 0.778 0.232 0.79
same ORF, same well raw->subset->cc adjust 0.0856 0.417 0.0863 0.346 0.102 0.405
same ORF, same well raw->subset->well correct 0.286 0.93 0.307 0.857 0.268 0.794
same ORF, same well raw->subset->cc adjust->well correct 0.538 0.989 0.412 0.929 0.396 0.903
  • 3. Misc

Other experiments included PCA with and w/o image features and Cosine kernel PCA, but did not show improvement over the results above.

@alxndrkalinin
Copy link
Member Author

alxndrkalinin commented Jun 13, 2023

In the 3.2v1 Same ORF, different well (higher is better, N=37) scenario, all 37 ORFs have 5x same-well replicates in one batch and 5x same-well replicates in another.

See an example
JCP2022_900041
          Metadata_Batch Metadata_Plate Metadata_Well
6670   2021_06_14_Batch6     BR00123945           F08
8794   2021_06_14_Batch6     BR00124766           F08
9148   2021_06_14_Batch6     BR00124767           F08
9502   2021_06_14_Batch6     BR00124768           F08
9856   2021_06_14_Batch6     BR00124769           F08
10921  2021_07_12_Batch8     BR00124787           H13
11281  2021_07_12_Batch8     BR00124788           H13
12703  2021_07_12_Batch8     BR00125619           H13
13062  2021_07_12_Batch8     BR00125620           H13
13422  2021_07_12_Batch8     BR00125621           H13

Whereas in the 3.2v2 Same gene, different well (higher is better, N=264) scenario, 33% of genes (89/264) come from a single batch and 1% (3/264) are from 3 batches:

# unique batches # unique gene symbols mmAP Fraction retrieved (p<0.05)
2 172 0.0254 0.038 (10 / 264)
1 89 0.0300 0.06 (16 / 264)
3 3 0.0256 0.004 (1 / 264)
An example of a gene with all replicates in the same batch
ADGRG7
      Metadata_JCP2022      Metadata_Batch Metadata_Plate Metadata_Well
17443   JCP2022_904157  2021_08_09_Batch11     BR00126538           H01
17539   JCP2022_908625  2021_08_09_Batch11     BR00126538           P01
17800   JCP2022_904157  2021_08_09_Batch11     BR00126539           H01
17896   JCP2022_908625  2021_08_09_Batch11     BR00126539           P01
18157   JCP2022_904157  2021_08_09_Batch11     BR00126540           H01
18253   JCP2022_908625  2021_08_09_Batch11     BR00126540           P01
18514   JCP2022_904157  2021_08_09_Batch11     BR00126541           H01
18610   JCP2022_908625  2021_08_09_Batch11     BR00126541           P01
18871   JCP2022_904157  2021_08_09_Batch11     BR00126542           H01
18967   JCP2022_908625  2021_08_09_Batch11     BR00126542           P01
An example of a gene with 3 different batches and 4 well positions
ATM
     Metadata_JCP2022     Metadata_Batch Metadata_Plate Metadata_Well
1942   JCP2022_909964  2021_05_31_Batch2     BR00121552           D23
2210   JCP2022_909964  2021_05_31_Batch2     BR00121553           D23
2478   JCP2022_909964  2021_05_31_Batch2     BR00121554           D23
2746   JCP2022_909964  2021_05_31_Batch2     BR00121555           D23
3014   JCP2022_909964  2021_05_31_Batch2     BR00121556           D23
5109   JCP2022_909963  2021_05_17_Batch4     BR00123785           K01
5464   JCP2022_909963  2021_05_17_Batch4     BR00123786           K01
5819   JCP2022_909963  2021_05_17_Batch4     BR00123787           K01
6174   JCP2022_909963  2021_05_17_Batch4     BR00123790           K01
6529   JCP2022_909963  2021_05_17_Batch4     BR00123791           K01
6705   JCP2022_909963  2021_06_14_Batch6     BR00123945           D18
6706   JCP2022_909964  2021_06_14_Batch6     BR00123945           N22
8829   JCP2022_909963  2021_06_14_Batch6     BR00124766           D18
8830   JCP2022_909964  2021_06_14_Batch6     BR00124766           N22
9183   JCP2022_909963  2021_06_14_Batch6     BR00124767           D18
9184   JCP2022_909964  2021_06_14_Batch6     BR00124767           N22
9537   JCP2022_909963  2021_06_14_Batch6     BR00124768           D18
9538   JCP2022_909964  2021_06_14_Batch6     BR00124768           N22
9891   JCP2022_909963  2021_06_14_Batch6     BR00124769           D18
9892   JCP2022_909964  2021_06_14_Batch6     BR00124769           N22

@shntnu
Copy link
Collaborator

shntnu commented Jun 17, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants