-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ADT filtering #379
Fix ADT filtering #379
Conversation
Merging in `development` for `v0.5.2` release
bin/post_process_sce.R
Outdated
use_discard <- sum(is.na(altExp(sce, alt_exp)$discard)) == 0 | ||
use_zero.ambient <- sum(is.na(altExp(sce, alt_exp)$zero.ambient)) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is what I was asking about earlier, but the way I asked might have been confusing. In the case where we have discard
for some cells but for other cells it is NA, should we not use it for those cells?
Which is to say should we be doing this fallback on a cell-by-cell basis rather than as a whole?
I am somewhat surprised that you can have NA
for some but not all high ambient calls, but if that is really happening, then I think we probably want to use the information where we have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm of two minds about this. I understand why we might want to use what filtering information we have, but I am specifically wary about using different kinds of filtering for the same dataset. I can imagine a situation where we have..
- some cells filtered based on
discard
- some cells filtered based on
zero.ambient
- some cells unfiltered only because they have
NA
for both of these values. For these cells, I would imagine something funky could be going on and maybe they should be filtered, but they would get retained in a way that it might appear as though they are fine to retain. This is also sort of related to a comment you left aboutNULL
->"Keep"
.. this also makes me nervous in similar vein because it may give the impression to users that we have a good reason to keep those cells, when actually we have no reliable information either way.
All of this could be documented of course, but I do feel uncomfortable about using different filtering approaches for different cells in the same library. I'm happy to be convinced otherwise, but I need a little more convincing..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that it is potentially a bit wonky, but I'd really be curious about how often each case is occurring. And are these cells that would be filtered for other reasons? If you can look at one of these datasets and give a breakdown, that would be helpful for evaluation.
On the change from NULL
to "Keep"
: that is for the case where we do no filtering, so the column is describing what we did not what we are endorsing. What I want to avoid is a large number of places where we and downstream users have to start checking is.null(sce$adt_scpca_filter)
before using it, when the alternative is pretty much always going to be just skipping the filtering, which is the same effect as if we had filled it in with "Keep".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can look at one of these datasets and give a breakdown, that would be helpful for evaluation.
good call, will do!
that is for the case where we do no filtering, so the column is describing what we did not what we are endorsing. What I want to avoid is a large number of places where we and downstream users have to start checking
is.null(sce$adt_scpca_filter)
before using it
👍 convinced!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just jumping in here, and I wanted to echo that I agree with Josh that it's worth identifying the proportion of cells in a sample that have an NA
value. Another idea I wanted to bring up was directly using the NA
value in the filter column? So we have Keep
, Remove
, and NA
. Rather than filtering some on zero.ambient
and others on discard
. This is just a column with suggestions and not actual filtering, so I think it might be helpful to inform users that a specific cell received an NA
and no call was made by cleanTagCounts
regarding whether the cell should be discarded or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a column with suggestions and not actual filtering, so I think it might be helpful to inform users that a specific cell received an
NA
and no call was made bycleanTagCounts
regarding whether the cell should be discarded or not.
I like this idea! Full inventory of NA
cells coming soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks fine, but I made a couple of simplification suggestions (all()
and any()
).
I hadn't tested yet actually, just coming back to do that now! Will accept suggestions and then give it a full go to be extra sure! |
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
I tested this with four libraries that had previously failed (error'd) due to filtering, and all four succeeded. While running, one of them failed on the first try of |
Closes #378
This PR fixes the ADT filtering bug, with some aspects up for discussion during review. I took the approach outlined in the corresponding issue:
cleanTagCounts
output columnsdiscard
andzero.ambient
haveNA
s, totally bail on filtering (rather that just "skipping" filtering on cells withNA
- I thought a uniform approach would be cleaner).metadata(processed_sce)$adt_scpca_filter_method
is given a value ofNo filter
, as the kids say these days.colData
columnprocessed_sce$adt_scpca_filter
is given a value ofNULL
.log1p
normalization will be performed here on all cellsdiscard
. If this column hasNA
's, fall back to filtering onzero.ambient
sizeFactor
s are calculated here for normalization under this circumstance, as beforeI also found another tab-related bug (!) wherein
log1p
normalization was not actually put back into the SCE object after calculation, so that's good! I moved this code back out of theif
.No filter
, I could have just calculatedlog1p
directly on the SCE object, rather than the "code detour" to ensure filtered out cells haveNA
normalized values. But I could add anif
before entering the normalization to check if no filter was performed, and then just straight up do alogcounts(altExp(sce)) <- log1p(counts(altExp(sce))
.I then updated the QC report accordingly.
cite.rmd
; rather than only checking forhas_processed
, we need to also check that generally check that!is.null(processed_sce$adt_scpca_filter)
.I'm attaching two versions of a rendered QC report on a library that first caught this error:
SCPCL000706_qc.html.txt
is the QC report as usual run through the full workflowqc_report-no-filtering.html.txt
is a modified version of the above. To create this notebook, I manually modified the processed SCE so thatprocessed_sce$adt_scpca_filter <- NULL
andmetadata(processed_sce)$adt_scpca_filter_method <- "No filter"
and directly rendered the report (not via the workflow).I know this may be a lot to digest here, so please let me know where I can clarify!! Tagging everyone for review since it would be good to have a couple eyes on this, I think.