Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ADT filtering #379

Merged
merged 10 commits into from
Jul 18, 2023
Merged

Fix ADT filtering #379

merged 10 commits into from
Jul 18, 2023

Conversation

sjspielman
Copy link
Member

Closes #378

This PR fixes the ADT filtering bug, with some aspects up for discussion during review. I took the approach outlined in the corresponding issue:

  • If both of the cleanTagCounts output columns discard and zero.ambient have NAs, totally bail on filtering (rather that just "skipping" filtering on cells with NA - I thought a uniform approach would be cleaner).
    • In this case, metadata(processed_sce)$adt_scpca_filter_method is given a value of No filter, as the kids say these days.
    • The colData column processed_sce$adt_scpca_filter is given a value of NULL.
    • log1p normalization will be performed here on all cells
  • Otherwise, first try to filter on discard. If this column has NA's, fall back to filtering on zero.ambient
    • sizeFactors are calculated here for normalization under this circumstance, as before

I also found another tab-related bug (!) wherein log1p normalization was not actually put back into the SCE object after calculation, so that's good! I moved this code back out of the if.

  • One note here: for the case of No filter, I could have just calculated log1p directly on the SCE object, rather than the "code detour" to ensure filtered out cells have NA normalized values. But I could add an if before entering the normalization to check if no filter was performed, and then just straight up do a logcounts(altExp(sce)) <- log1p(counts(altExp(sce)).

I then updated the QC report accordingly.

  • I added more logic for some of the output in cite.rmd; rather than only checking for has_processed, we need to also check that generally check that !is.null(processed_sce$adt_scpca_filter).
  • I also rearranged some text to better match the final context.

I'm attaching two versions of a rendered QC report on a library that first caught this error:

  • SCPCL000706_qc.html.txt is the QC report as usual run through the full workflow
  • qc_report-no-filtering.html.txt is a modified version of the above. To create this notebook, I manually modified the processed SCE so that processed_sce$adt_scpca_filter <- NULL and metadata(processed_sce)$adt_scpca_filter_method <- "No filter" and directly rendered the report (not via the workflow).

I know this may be a lot to digest here, so please let me know where I can clarify!! Tagging everyone for review since it would be good to have a couple eyes on this, I think.

@sjspielman sjspielman requested review from jashapiro and allyhawkins and removed request for jashapiro July 14, 2023 14:02
Comment on lines 101 to 102
use_discard <- sum(is.na(altExp(sce, alt_exp)$discard)) == 0
use_zero.ambient <- sum(is.na(altExp(sce, alt_exp)$zero.ambient)) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is what I was asking about earlier, but the way I asked might have been confusing. In the case where we have discard for some cells but for other cells it is NA, should we not use it for those cells?

Which is to say should we be doing this fallback on a cell-by-cell basis rather than as a whole?

I am somewhat surprised that you can have NA for some but not all high ambient calls, but if that is really happening, then I think we probably want to use the information where we have it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm of two minds about this. I understand why we might want to use what filtering information we have, but I am specifically wary about using different kinds of filtering for the same dataset. I can imagine a situation where we have..

  • some cells filtered based on discard
  • some cells filtered based on zero.ambient
  • some cells unfiltered only because they have NA for both of these values. For these cells, I would imagine something funky could be going on and maybe they should be filtered, but they would get retained in a way that it might appear as though they are fine to retain. This is also sort of related to a comment you left about NULL -> "Keep".. this also makes me nervous in similar vein because it may give the impression to users that we have a good reason to keep those cells, when actually we have no reliable information either way.

All of this could be documented of course, but I do feel uncomfortable about using different filtering approaches for different cells in the same library. I'm happy to be convinced otherwise, but I need a little more convincing..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that it is potentially a bit wonky, but I'd really be curious about how often each case is occurring. And are these cells that would be filtered for other reasons? If you can look at one of these datasets and give a breakdown, that would be helpful for evaluation.

On the change from NULL to "Keep": that is for the case where we do no filtering, so the column is describing what we did not what we are endorsing. What I want to avoid is a large number of places where we and downstream users have to start checking is.null(sce$adt_scpca_filter) before using it, when the alternative is pretty much always going to be just skipping the filtering, which is the same effect as if we had filled it in with "Keep".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can look at one of these datasets and give a breakdown, that would be helpful for evaluation.

good call, will do!

that is for the case where we do no filtering, so the column is describing what we did not what we are endorsing. What I want to avoid is a large number of places where we and downstream users have to start checking is.null(sce$adt_scpca_filter) before using it

👍 convinced!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just jumping in here, and I wanted to echo that I agree with Josh that it's worth identifying the proportion of cells in a sample that have an NA value. Another idea I wanted to bring up was directly using the NA value in the filter column? So we have Keep, Remove, and NA. Rather than filtering some on zero.ambient and others on discard. This is just a column with suggestions and not actual filtering, so I think it might be helpful to inform users that a specific cell received an NA and no call was made by cleanTagCounts regarding whether the cell should be discarded or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a column with suggestions and not actual filtering, so I think it might be helpful to inform users that a specific cell received an NA and no call was made by cleanTagCounts regarding whether the cell should be discarded or not.

I like this idea! Full inventory of NA cells coming soon...

bin/post_process_sce.R Outdated Show resolved Hide resolved
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks fine, but I made a couple of simplification suggestions (all() and any()).

bin/post_process_sce.R Outdated Show resolved Hide resolved
bin/post_process_sce.R Outdated Show resolved Hide resolved
@sjspielman
Copy link
Member Author

I hadn't tested yet actually, just coming back to do that now! Will accept suggestions and then give it a full go to be extra sure!

sjspielman and others added 3 commits July 17, 2023 17:54
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@sjspielman
Copy link
Member Author

I tested this with four libraries that had previously failed (error'd) due to filtering, and all four succeeded. While running, one of them failed on the first try of sce_qc_report due to RAM, so I went ahead and added the memory label to that process (#377) and re-tested that library. No longer fails on the try first, so this is all set to go!

@sjspielman sjspielman merged commit 5806d1b into development Jul 18, 2023
2 checks passed
@sjspielman sjspielman deleted the sjspielman/378-fix-adt-filtering branch July 18, 2023 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants