Fix ADT filtering #379

sjspielman · 2023-07-14T14:02:50Z

Closes #378

This PR fixes the ADT filtering bug, with some aspects up for discussion during review. I took the approach outlined in the corresponding issue:

If both of the cleanTagCounts output columns discard and zero.ambient have NAs, totally bail on filtering (rather that just "skipping" filtering on cells with NA - I thought a uniform approach would be cleaner).
- In this case, metadata(processed_sce)$adt_scpca_filter_method is given a value of No filter, as the kids say these days.
- The colData column processed_sce$adt_scpca_filter is given a value of NULL.
- log1p normalization will be performed here on all cells
Otherwise, first try to filter on discard. If this column has NA's, fall back to filtering on zero.ambient
- sizeFactors are calculated here for normalization under this circumstance, as before

I also found another tab-related bug (!) wherein log1p normalization was not actually put back into the SCE object after calculation, so that's good! I moved this code back out of the if.

One note here: for the case of No filter, I could have just calculated log1p directly on the SCE object, rather than the "code detour" to ensure filtered out cells have NA normalized values. But I could add an if before entering the normalization to check if no filter was performed, and then just straight up do a logcounts(altExp(sce)) <- log1p(counts(altExp(sce)).

I then updated the QC report accordingly.

I added more logic for some of the output in cite.rmd; rather than only checking for has_processed, we need to also check that generally check that !is.null(processed_sce$adt_scpca_filter).
I also rearranged some text to better match the final context.

I'm attaching two versions of a rendered QC report on a library that first caught this error:

SCPCL000706_qc.html.txt is the QC report as usual run through the full workflow
qc_report-no-filtering.html.txt is a modified version of the above. To create this notebook, I manually modified the processed SCE so that processed_sce$adt_scpca_filter <- NULL and metadata(processed_sce)$adt_scpca_filter_method <- "No filter" and directly rendered the report (not via the workflow).

I know this may be a lot to digest here, so please let me know where I can clarify!! Tagging everyone for review since it would be good to have a couple eyes on this, I think.

Merging in `development` for `v0.5.2` release

jashapiro · 2023-07-14T15:01:03Z

bin/post_process_sce.R

+  use_discard <- sum(is.na(altExp(sce, alt_exp)$discard)) == 0
+  use_zero.ambient <- sum(is.na(altExp(sce, alt_exp)$zero.ambient)) == 0


So this is what I was asking about earlier, but the way I asked might have been confusing. In the case where we have discard for some cells but for other cells it is NA, should we not use it for those cells?

Which is to say should we be doing this fallback on a cell-by-cell basis rather than as a whole?

I am somewhat surprised that you can have NA for some but not all high ambient calls, but if that is really happening, then I think we probably want to use the information where we have it.

Honestly, I'm of two minds about this. I understand why we might want to use what filtering information we have, but I am specifically wary about using different kinds of filtering for the same dataset. I can imagine a situation where we have..

some cells filtered based on discard

some cells filtered based on zero.ambient

some cells unfiltered only because they have NA for both of these values. For these cells, I would imagine something funky could be going on and maybe they should be filtered, but they would get retained in a way that it might appear as though they are fine to retain. This is also sort of related to a comment you left about NULL -> "Keep".. this also makes me nervous in similar vein because it may give the impression to users that we have a good reason to keep those cells, when actually we have no reliable information either way.

All of this could be documented of course, but I do feel uncomfortable about using different filtering approaches for different cells in the same library. I'm happy to be convinced otherwise, but I need a little more convincing..

Yes, I agree that it is potentially a bit wonky, but I'd really be curious about how often each case is occurring. And are these cells that would be filtered for other reasons? If you can look at one of these datasets and give a breakdown, that would be helpful for evaluation.

On the change from NULL to "Keep": that is for the case where we do no filtering, so the column is describing what we did not what we are endorsing. What I want to avoid is a large number of places where we and downstream users have to start checking is.null(sce$adt_scpca_filter) before using it, when the alternative is pretty much always going to be just skipping the filtering, which is the same effect as if we had filled it in with "Keep".

If you can look at one of these datasets and give a breakdown, that would be helpful for evaluation.

good call, will do!

that is for the case where we do no filtering, so the column is describing what we did not what we are endorsing. What I want to avoid is a large number of places where we and downstream users have to start checking is.null(sce$adt_scpca_filter) before using it

👍 convinced!

Just jumping in here, and I wanted to echo that I agree with Josh that it's worth identifying the proportion of cells in a sample that have an NA value. Another idea I wanted to bring up was directly using the NA value in the filter column? So we have Keep, Remove, and NA. Rather than filtering some on zero.ambient and others on discard. This is just a column with suggestions and not actual filtering, so I think it might be helpful to inform users that a specific cell received an NA and no call was made by cleanTagCounts regarding whether the cell should be discarded or not.

This is just a column with suggestions and not actual filtering, so I think it might be helpful to inform users that a specific cell received an NA and no call was made by cleanTagCounts regarding whether the cell should be discarded or not.

I like this idea! Full inventory of NA cells coming soon...

bin/post_process_sce.R

jashapiro

I think this looks fine, but I made a couple of simplification suggestions (all() and any()).

bin/post_process_sce.R

sjspielman · 2023-07-17T21:53:49Z

I hadn't tested yet actually, just coming back to do that now! Will accept suggestions and then give it a full go to be extra sure!

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

sjspielman · 2023-07-18T13:28:55Z

I tested this with four libraries that had previously failed (error'd) due to filtering, and all four succeeded. While running, one of them failed on the first try of sce_qc_report due to RAM, so I went ahead and added the memory label to that process (#377) and re-tested that library. No longer fails on the try first, so this is all set to go!

sjspielman added 3 commits July 11, 2023 11:44

Merge pull request #374 from AlexsLemonade/development

4b4de3c

Merging in `development` for `v0.5.2` release

Bug fix for when ADT filtering fails, and associated QC updates. Tested

6fe4179

Text rearrangement and no filter warning

499a908

sjspielman requested review from jashapiro and allyhawkins and removed request for jashapiro July 14, 2023 14:02

jashapiro reviewed Jul 14, 2023

View reviewed changes

bin/post_process_sce.R Outdated Show resolved Hide resolved

sjspielman added 4 commits July 14, 2023 16:51

NULL -> 'Keep' and associated QC report updates

a93152c

WIP: cell filtering strategies with local running code

2aaacaa

solidify logic for adt filtering; not yet tested

94d87be

failure warnings a little better, maybe?

479355d

jashapiro approved these changes Jul 17, 2023

View reviewed changes

bin/post_process_sce.R Outdated Show resolved Hide resolved

bin/post_process_sce.R Outdated Show resolved Hide resolved

sjspielman and others added 3 commits July 17, 2023 17:54

Update bin/post_process_sce.R

550852b

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

Update bin/post_process_sce.R

e228662

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

add memory label to sce_qc_report for #377

eca66fb

sjspielman merged commit 5806d1b into development Jul 18, 2023
2 checks passed

sjspielman deleted the sjspielman/378-fix-adt-filtering branch July 18, 2023 13:29

This was referenced Jul 18, 2023

Add memory label to sce_qc_report process #377

Closed

Bug: ADT post-processing fails when cleanTagCounts assumptions are not met #378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ADT filtering #379

Fix ADT filtering #379

sjspielman commented Jul 14, 2023

jashapiro Jul 14, 2023

sjspielman Jul 14, 2023

jashapiro Jul 14, 2023

sjspielman Jul 14, 2023

allyhawkins Jul 14, 2023

sjspielman Jul 14, 2023

jashapiro left a comment

sjspielman commented Jul 17, 2023

sjspielman commented Jul 18, 2023

		use_discard <- sum(is.na(altExp(sce, alt_exp)$discard)) == 0
		use_zero.ambient <- sum(is.na(altExp(sce, alt_exp)$zero.ambient)) == 0

Fix ADT filtering #379

Fix ADT filtering #379

Conversation

sjspielman commented Jul 14, 2023

jashapiro Jul 14, 2023

Choose a reason for hiding this comment

sjspielman Jul 14, 2023

Choose a reason for hiding this comment

jashapiro Jul 14, 2023

Choose a reason for hiding this comment

sjspielman Jul 14, 2023

Choose a reason for hiding this comment

allyhawkins Jul 14, 2023

Choose a reason for hiding this comment

sjspielman Jul 14, 2023

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

sjspielman commented Jul 17, 2023

sjspielman commented Jul 18, 2023