Regressing out cell count #15

alxndrkalinin · 2023-05-23T18:29:35Z

1. Motivation

Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.

2. Approach

To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.

2.1 Constant and low count features

Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.

Effects of regressing out cell count on a constant feature

Before	After

We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.

Visualizing # of unique feature values vs cell count

2.2 Adding cell count back as a feature

After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the feature_select step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio ~0.06 (see visualization below), which is below the cutoff value unique_cut=0.1 that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of 0.01, which later was replaced by 0.1, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we use feature_selection with unique_cut=0.01, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.

Cell count unique values / sample size ratio

3. Results

3.1 Same well, different ORF

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same well, diff ORF	raw->subset	0.0636	0.139 (51/368)
same well, diff ORF	raw->subset->cc adjust	0.0583	0.0217 (8/368)
same well, diff ORF	raw->subset->well correct	0.114	0.25 (92/368)
same well, diff ORF	raw->subset->cc adjust->well correct	0.379	0.723 (266/368)

Same well, different ORF plots

3.2 Same ORF, different well

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same ORF, diff well	raw->subset	0.00974	0.0 (0/37)
same ORF, diff well	raw->subset->cc adjust	0.0202	0.027 (1/37)
same ORF, diff well	raw->subset->well correct	0.0166	0.027 (1/37)
same ORF, diff well	raw->subset->cc adjust->well correct	0.00834	0.0 (0/37)

Same ORF, different well plots

3.2 Same ORF, same well

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same ORF, same well	raw->subset	0.195	0.903 (3297/3653)
same ORF, same well	raw->subset->cc adjust	0.0856	0.417 (1524/3653)
same ORF, same well	raw->subset->well correct	0.286	0.93 (3397/3653)
same ORF, same well	raw->subset->cc adjust->well correct	0.538	0.989 (3612/3653)

Same ORF, same well plots

Observations:

cc adjustment corrects for plate effects better than well mean correction
a combination of both actually makes things WORSE (perhaps, due to an overcorrection)

The text was updated successfully, but these errors were encountered:

alxndrkalinin · 2023-05-31T22:07:49Z

Update 5/31: results and visualizations were re-generated after removing some data after QC and making sure index isn't used as a feature when calculating metrics (see #16).

Overall, the conclusions hold: cell count adjustment seems to help, while subtracting well mean on top of that can hurt the replicate retrieval performance.

cc @shntnu @AnneCarpenter

AnneCarpenter · 2023-05-31T23:08:03Z

🎉 Thanks for getting that done lightning fast! Can we visualize what the pattern looks like for the mean in each well position? I’m curious if it’s the left right pattern. It’s surprising to me that this correction worsens outcomes! -- Sent from my mobile phone

alxndrkalinin · 2023-06-02T19:52:54Z

Here are visualizations of mean values across the whole ORF dataset, for Cells_AreaShape_Area:

Raw profiles	Cell count adjusted

Well position corrected	CC adjust + well pos correct

for Cells_Intensity_MeanIntensity_AGP:

Raw profiles	Cell count adjusted

Well position corrected	CC adjust + well pos correct

Additionally, there are visualizations for Image_Threshold_SumOfEntropies_CellsIncludingEdges, which originally had high correlation with cell count:

before adjustment: 0.719
after adjustment: 1.77e-14

Raw profiles	Cell count adjusted

Well position corrected	CC adjust + well pos correct

AnneCarpenter · 2023-06-05T12:20:28Z

Questions from Arnaud

I am still wondering the impact of the regression part on classification/clustering algorithms. Did u quantify that with some ad-hoc generated data.
Also I am surprised that this step isn’t done after normalization (any reasons?).
For the correlation threshold to use to select features that will go through this process, u are choosing 0.4?

For the first Q: we've indeed been using something downstream for evaluation: ability to retrieve the same ORF reagent in a different well position (and usually different batch) - which we want to be high. And, ability to retrieve a different ORF in the same well position - which we want to be low. So far our results look bad and confusing and we are trying to figure out what is going on! perhaps @alxndrkalinin can address your other two Qs.

shntnu · 2023-06-05T18:51:49Z

Also I am surprised that this step isn’t done after normalization (any reasons?).

We argued that regression should be done as early in the processing pipeline to capture relationships between variables. Adjusting for plate-to-plate variation before regression does not seem desirable (not thought this through too deeply)

AnneCarpenter · 2023-06-06T12:20:41Z

On Alex's list is to expand the evaluation from the 37 duplicated ORF reagents (because who knows, perhaps there is something systematically weird about these reagents) to same-gene ORF reagents and even to same-GO-term reagents. It occurred to me that perhaps we ought to be using MOA matching in Target2 plates in each batch as another evaluation? It's compounds rather than ORFs but many technical variations/artifacts should have affected them the same (except the initial steps of virus production, etc)

alxndrkalinin · 2023-06-07T04:02:45Z

Update 6/6

Normalization

We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case. Is that something we want to reconsider or is it not super important? (cc @niranjchandrasekaran @johnarevalo who're also using it)

Per-position mean feature visualizations

After we discussed normalization, I updated plate visualizations of feature means above using normalized data. I also made additional per-position visualizations.

We discussed with @shntnu that cell count adjustment seems to mostly reduce top-down effect, but not the left-right one.

Cell Counts - Raw profiles	Cell Counts - Well position corrected

Since features are comparable when normalized, I also made plot for all features averaged per well position:

All features - Raw profiles	All features - Cell count adjusted

All features - Position corrected	All features - CC adjusted and position corrected

I also made similar visualization per feature type:

Cell features

Cell features - Raw profiles	Cell features - Cell count adjusted

Cell features - Position corrected	Cell features - CC adjusted and position corrected

Nuclei features

Nuclei features - Raw profiles	Nuclei features - Cell count adjusted

Nuclei features - Position corrected	Nuclei features - CC adjusted and position corrected

Cytoplasm features

Cytoplasm features - Raw profiles	Cytoplasm features - Cell count adjusted

Cytoplasm features - Position corrected	Cytoplasm features - CC adjusted and position corrected

Image features

Image features - Raw profiles	Image features - Cell count adjusted

Image features - Position corrected	Image features - CC adjusted and position corrected

Retrieving same gene in different wells (instead of ORFs)

Given that there aren't many more unique gene symbols vs ORFs, plots and metrics for Same Well – Different ORF & Same Well – Same ORF remain the same as above.

Replacing ORFs with gene symbols, increases sample size, but produces metric contradictory to our previous conclusions

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same gene, diff well	raw->subset	0.0415	0.273 (72/264)
same gene, diff well	raw->subset->cc adjust	0.0262	0.121 (32/264)
same gene, diff well	raw->subset->well correct	0.0367	0.284 (75/264)
same gene, diff well	raw->subset->cc adjust->well correct	0.0278	0.152 (40/264)

Plots

Removing Image features (per convo w/ @anne) improves retrieval

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same gene, diff well	raw->subset	0.0475	0.333 (88/264)
same gene, diff well	raw->subset->cc adjust	0.0311	0.155 (41/264)
same gene, diff well	raw->subset->well correct	0.0432	0.33 (87/264)
same gene, diff well	raw->subset->cc adjust->well correct	0.0304	0.152 (40/264)

Plots

Preliminary results with PCA haven't showed any improvement in retrieval

PCA results

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same gene, diff well	raw->subset	0.0338	0.227 (60/264)
same gene, diff well	raw->subset->cc adjust	0.015	0.00379 (1/264)
same gene, diff well	raw->subset->well correct	0.0187	0.0606 (16/264)
same gene, diff well	raw->subset->cc adjust->well correct	0.0189	0.0833 (22/264)

niranjchandrasekaran · 2023-06-07T12:12:35Z

We have previously said that RobustMAD normalizes per-plate. I just checked and it does not seem to be the case.

@alxndrkalinin Do you mean the code doesn't separate the profiles by plates? If so, it is true that pycytominer does not normalize by plate in the typical profiling workflow. We separate the profiles by plate before doing RobustMAD normalization.

alxndrkalinin · 2023-06-13T03:46:37Z

Update 6/12

1. Normalization

I changed pre-processing to perform normalization per-plate and recalculated all downstream results. Note, that doing so resulted in some Inf values after RobustMAD because some Cytoplasm features have MAD=0 and our default epsilon=0.0. For plate visualizations, I removed those features, which should probably be done in preprocessing, so that's a TODO item.

2. Per-position mean feature visualizations

I updated plate visualizations of feature means using per-plate normalized data.

Cell Count

Cell Counts - Raw profiles	Cell Counts - Well position corrected

All features averaged

All features - Raw profiles	All features - Cell count adjusted

All features - Position corrected	All features - CC adjusted and position corrected

Cell features

Cell features - Raw profiles	Cell features - Cell count adjusted

Cell features - Position corrected	Cell features - CC adjusted and position corrected

Nuclei features

Nuclei features - Raw profiles	Nuclei features - Cell count adjusted

Nuclei features - Position corrected	Nuclei features - CC adjusted and position corrected

Cytoplasm features

Cytoplasm features - Raw profiles	Cytoplasm features - Cell count adjusted

Cytoplasm features - Position corrected	Cytoplasm features - CC adjusted and position corrected

Image features

Image features - Raw profiles	Image features - Cell count adjusted

Image features - Position corrected	Image features - CC adjusted and position corrected

3. Retrieving ORFs/genes with pre-plate normalization

WDN–whole dataset normalization
PPN-IM–pre-plate normalization with image features
PPN-NI–per-plate normalization without image features
FR–fraction retrieved

3.1 Same well, different ORF/gene (lower is better, N=368)

Setting	Data	mmAP WDN	FR WDN (p<0.05)	mmAP PPN-IM	FR PPN-IM (p<0.05)	mmAP PPN-NI	FR PPN-NI (p<0.05)
same well, diff ORF	raw->subset	0.0636	0.139	0.148	0.899	0.164	0.938
same well, diff ORF	raw->subset->cc adjust	0.0583	0.0217	0.0919	0.543	0.0974	0.625
same well, diff ORF	raw->subset->well correct	0.114	0.25	0.241	0.864	0.184	0.69
same well, diff ORF	raw->subset->cc adjust->well correct	0.379	0.723	0.391	0.959	0.362	0.886

3.2v1 Same ORF, different well (higher is better, N=37)

Setting	Data	mmAP WDN	Fraction WDN (p<0.05)	mmAP PPN-IM	Fraction PPN-IM (p<0.05)
same ORF, different well	raw->subset	0.00974	0.0 (0/37)	0.0341	0.0811 (3/37)
same ORF, different well	raw->subset->cc adjust	0.0202	0.027 (1/37)	0.0314	0.0541 (2/37)
same ORF, different well	raw->subset->well correct	0.0166	0.027 (1/37)	0.0505	0.108 (4/37)
same ORF, different well	raw->subset->cc adjust->well correct	0.00834	0.0 (0/37)	0.0274	0.027 (1/37)

3.2v2 Same gene, different well (higher is better, N=264)

Setting	Data	mmAP WDN	FR WDN (p<0.05)	mmAP PPN-IM	FR PPN-IM (p<0.05)	mmAP PPN-NI	FR PPN-NI (p<0.05)
same gene, different well	raw->subset	0.0415	0.273	0.027	0.102	0.0351	0.121
same gene, different well	raw->subset->cc adjust	0.0262	0.121	0.0233	0.053	0.0257	0.0682
same gene, different well	raw->subset->well correct	0.0367	0.284	0.027	0.0606	0.0307	0.072
same gene, different well	raw->subset->cc adjust->well correct	0.0278	0.152	0.0225	0.0758	0.0225	0.0758

3.3 Same ORF/gene, same well (expected to drop, but not too much, N=3653)

Setting	Data	mmAP WDN	Fraction WDN (p<0.05)	mmAP PPN-IM	Fraction PPN-IM (p<0.05)	mmAP PPN-NI	Fraction PPN-NI (p<0.05)
same ORF, same well	raw->subset	0.195	0.903	0.211	0.778	0.232	0.79
same ORF, same well	raw->subset->cc adjust	0.0856	0.417	0.0863	0.346	0.102	0.405
same ORF, same well	raw->subset->well correct	0.286	0.93	0.307	0.857	0.268	0.794
same ORF, same well	raw->subset->cc adjust->well correct	0.538	0.989	0.412	0.929	0.396	0.903

3. Misc

Other experiments included PCA with and w/o image features and Cosine kernel PCA, but did not show improvement over the results above.

alxndrkalinin · 2023-06-13T20:05:12Z

In the 3.2v1 Same ORF, different well (higher is better, N=37) scenario, all 37 ORFs have 5x same-well replicates in one batch and 5x same-well replicates in another.

See an example

JCP2022_900041
          Metadata_Batch Metadata_Plate Metadata_Well
6670   2021_06_14_Batch6     BR00123945           F08
8794   2021_06_14_Batch6     BR00124766           F08
9148   2021_06_14_Batch6     BR00124767           F08
9502   2021_06_14_Batch6     BR00124768           F08
9856   2021_06_14_Batch6     BR00124769           F08
10921  2021_07_12_Batch8     BR00124787           H13
11281  2021_07_12_Batch8     BR00124788           H13
12703  2021_07_12_Batch8     BR00125619           H13
13062  2021_07_12_Batch8     BR00125620           H13
13422  2021_07_12_Batch8     BR00125621           H13

Whereas in the 3.2v2 Same gene, different well (higher is better, N=264) scenario, 33% of genes (89/264) come from a single batch and 1% (3/264) are from 3 batches:

# unique batches	# unique gene symbols	mmAP	Fraction retrieved (p<0.05)
2	172	0.0254	0.038 (10 / 264)
1	89	0.0300	0.06 (16 / 264)
3	3	0.0256	0.004 (1 / 264)

An example of a gene with all replicates in the same batch

ADGRG7
      Metadata_JCP2022      Metadata_Batch Metadata_Plate Metadata_Well
17443   JCP2022_904157  2021_08_09_Batch11     BR00126538           H01
17539   JCP2022_908625  2021_08_09_Batch11     BR00126538           P01
17800   JCP2022_904157  2021_08_09_Batch11     BR00126539           H01
17896   JCP2022_908625  2021_08_09_Batch11     BR00126539           P01
18157   JCP2022_904157  2021_08_09_Batch11     BR00126540           H01
18253   JCP2022_908625  2021_08_09_Batch11     BR00126540           P01
18514   JCP2022_904157  2021_08_09_Batch11     BR00126541           H01
18610   JCP2022_908625  2021_08_09_Batch11     BR00126541           P01
18871   JCP2022_904157  2021_08_09_Batch11     BR00126542           H01
18967   JCP2022_908625  2021_08_09_Batch11     BR00126542           P01

An example of a gene with 3 different batches and 4 well positions

ATM
     Metadata_JCP2022     Metadata_Batch Metadata_Plate Metadata_Well
1942   JCP2022_909964  2021_05_31_Batch2     BR00121552           D23
2210   JCP2022_909964  2021_05_31_Batch2     BR00121553           D23
2478   JCP2022_909964  2021_05_31_Batch2     BR00121554           D23
2746   JCP2022_909964  2021_05_31_Batch2     BR00121555           D23
3014   JCP2022_909964  2021_05_31_Batch2     BR00121556           D23
5109   JCP2022_909963  2021_05_17_Batch4     BR00123785           K01
5464   JCP2022_909963  2021_05_17_Batch4     BR00123786           K01
5819   JCP2022_909963  2021_05_17_Batch4     BR00123787           K01
6174   JCP2022_909963  2021_05_17_Batch4     BR00123790           K01
6529   JCP2022_909963  2021_05_17_Batch4     BR00123791           K01
6705   JCP2022_909963  2021_06_14_Batch6     BR00123945           D18
6706   JCP2022_909964  2021_06_14_Batch6     BR00123945           N22
8829   JCP2022_909963  2021_06_14_Batch6     BR00124766           D18
8830   JCP2022_909964  2021_06_14_Batch6     BR00124766           N22
9183   JCP2022_909963  2021_06_14_Batch6     BR00124767           D18
9184   JCP2022_909964  2021_06_14_Batch6     BR00124767           N22
9537   JCP2022_909963  2021_06_14_Batch6     BR00124768           D18
9538   JCP2022_909964  2021_06_14_Batch6     BR00124768           N22
9891   JCP2022_909963  2021_06_14_Batch6     BR00124769           D18
9892   JCP2022_909964  2021_06_14_Batch6     BR00124769           N22

shntnu · 2023-06-17T14:09:41Z

(DRAFT)

Testing DINO features here: https://github.com/jump-cellpainting/morphmap/issues/91#issuecomment-1595712135

alxndrkalinin mentioned this issue May 23, 2023

Document project plan #2

Closed

alxndrkalinin changed the title ~~Regressing out cel count~~ Regressing out cell count May 23, 2023

alxndrkalinin mentioned this issue May 31, 2023

update cell count adjustments #16

Merged

alxndrkalinin mentioned this issue Jun 20, 2023

Classification-based analysis of plate position effect #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regressing out cell count #15

Regressing out cell count #15

alxndrkalinin commented May 23, 2023 •

edited

Loading

alxndrkalinin commented May 31, 2023

AnneCarpenter commented May 31, 2023 via email

alxndrkalinin commented Jun 2, 2023 •

edited

Loading

AnneCarpenter commented Jun 5, 2023

shntnu commented Jun 5, 2023

AnneCarpenter commented Jun 6, 2023

alxndrkalinin commented Jun 7, 2023

Normalization

Per-position mean feature visualizations

Retrieving same gene in different wells (instead of ORFs)

niranjchandrasekaran commented Jun 7, 2023

alxndrkalinin commented Jun 13, 2023

1. Normalization

2. Per-position mean feature visualizations

3. Retrieving ORFs/genes with pre-plate normalization

3. Misc

alxndrkalinin commented Jun 13, 2023 •

edited

Loading

shntnu commented Jun 17, 2023

Regressing out cell count #15

Regressing out cell count #15

Comments

alxndrkalinin commented May 23, 2023 • edited Loading

1. Motivation

2. Approach

2.1 Constant and low count features

2.2 Adding cell count back as a feature

3. Results

3.1 Same well, different ORF

3.2 Same ORF, different well

3.2 Same ORF, same well

alxndrkalinin commented May 31, 2023

AnneCarpenter commented May 31, 2023 via email

alxndrkalinin commented Jun 2, 2023 • edited Loading

AnneCarpenter commented Jun 5, 2023

shntnu commented Jun 5, 2023

AnneCarpenter commented Jun 6, 2023

alxndrkalinin commented Jun 7, 2023

Normalization

Per-position mean feature visualizations

Retrieving same gene in different wells (instead of ORFs)

niranjchandrasekaran commented Jun 7, 2023

alxndrkalinin commented Jun 13, 2023

1. Normalization

2. Per-position mean feature visualizations

3. Retrieving ORFs/genes with pre-plate normalization

3.1 Same well, different ORF/gene (lower is better, N=368)

3.2v1 Same ORF, different well (higher is better, N=37)

3.2v2 Same gene, different well (higher is better, N=264)

3.3 Same ORF/gene, same well (expected to drop, but not too much, N=3653)

3. Misc

alxndrkalinin commented Jun 13, 2023 • edited Loading

shntnu commented Jun 17, 2023

alxndrkalinin commented May 23, 2023 •

edited

Loading

alxndrkalinin commented Jun 2, 2023 •

edited

Loading

alxndrkalinin commented Jun 13, 2023 •

edited

Loading