-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove any unused cell hash oligos from unfiltered and fix conversion issue #675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem 2 (the bug I found) solution LGTM! Since I'm less familiar with the other code, I'll let @jashapiro weigh in primarily there.
bin/generate_unfiltered_sce.R
Outdated
@@ -178,5 +196,6 @@ unfiltered_sce <- unfiltered_sce |> | |||
# add dataframe with sample metadata to sce metadata | |||
add_sample_metadata(metadata_df = sample_metadata_df) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modules/sce-processing.nf
Outdated
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bin/generate_unfiltered_sce.R
Outdated
if (!file.exists(opt$cellhash_pool_file)) { | ||
stop("Can't find cellhash_pool_file") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB Josh recently discovered that stopifnot()
got better when we weren't looking! I am not suggesting to change this code, just sharing the good news - https://blog.r-hub.io/2022/03/10/input-checking/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make this more explicit, now we can do things like:
if (!file.exists(opt$cellhash_pool_file)) { | |
stop("Can't find cellhash_pool_file") | |
} | |
stopifnot("Can't find cellhash_pool_file" = file.exists(opt$cellhash_pool_file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mostly good, but I had a couple hopefully quick thoughts.
I was torn about the first solution... I think that part of the reason I wanted to keep all tags in the unfiltered SCE is that we would be able to see more easily if there were some error in the data with respect to the assigned HTOs (if other HTOs came up that were not expected). But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?
I'm also a little worried about not making any cellhash AnnData objects. Are we not outputting those at all? I can see that we might want to at some point, so I think it may be better to have them than not, at least in the case where we have more than one HTO. So it may be better to just have bash check for the output file from the conversion and not exclude cellhash tables from conversion at all. Again, unless that is a problem for some other reason that I am not currently thinking of.
bin/generate_unfiltered_sce.R
Outdated
if (!file.exists(opt$cellhash_pool_file)) { | ||
stop("Can't find cellhash_pool_file") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make this more explicit, now we can do things like:
if (!file.exists(opt$cellhash_pool_file)) { | |
stop("Can't find cellhash_pool_file") | |
} | |
stopifnot("Can't find cellhash_pool_file" = file.exists(opt$cellhash_pool_file)) |
modules/export-anndata.nf
Outdated
@@ -12,7 +12,7 @@ process export_anndata{ | |||
script: | |||
rna_hdf5_file = "${meta.library_id}_${file_type}_rna.hdf5" | |||
feature_hdf5_file = "${meta.library_id}_${file_type}_${meta.feature_type}.hdf5" | |||
feature_present = meta.feature_type in ["adt", "cellhash"] | |||
feature_present = meta.feature_type == "adt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but can we leave this as in
for future expansion? Also, we should be sure to update the stub in parallel if we are changing logic here.
feature_present = meta.feature_type == "adt" | |
feature_present = meta.feature_type in ["adt"] |
I guess I'm not totally clear here on not wanting the HTO features to come out as AnnData? Can you elaborate on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought is that we can check whether ${feature_hdf5_file}
exists after sce_to_anndata.R
and only run move_counts_anndata.py
on that file if it exists. sce_to_anndata.R
already prints a warning in this case, so I think we should be fine on that front.
I'm not entirely sure what you mean here? This should only be affecting cell hash data, so not sure what you mean by including the ADT here?
We had made the decision to not convert any multiplexed libraries to AnnData objects. So we don't want to go through the |
Brain-finger disconnect. I basically meant the altexp with all HTO tags. |
Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow. |
It's not entirely a problem, but I feel like it's a little misleading. If those tags weren't even used in the experiment, why would they be provided in the results? I think the case that you mentioned where a different oligo was recorded is definitely a possibility, but would you still trust that data if you didn't record the HTO correctly? Ultimately, with the change to only set |
Do you mean adding a check before running |
I'm more worried about an error in the cellhash pool file... Having the info just makes it easier to check! Also, you might be able to see if there were barcodes that were more likely to be mis-assigned because of sequencing error. It isn't a great check, but it is something, which is why I would lean toward leaving it in.
Yes. Something like:
in place of
|
I updated this PR to keep the original plan of keeping all HTOs in the altExp for unfiltered objects. I then only made changes in the export AnnData process. Here, This should be ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think we only want to move processed counts for features too, right?
""" | ||
sce_to_anndata.R \ | ||
--input_sce_file ${sce_file} \ | ||
--output_rna_h5 ${rna_hdf5_file} \ | ||
--output_feature_h5 ${feature_hdf5_file} \ | ||
${feature_present ? "--output_feature_h5 ${feature_hdf5_file}" : ''} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this had to be conditional, but it doesn't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I was still getting the unfiltered feature file being saved without the conditional, so I added it in to help prevent that.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
We have one specific library that both contains cell hashing oligos, but only has one of those oligos and isn't actually multiplexed. This means that although it's just a regular single-cell library, it has an
altExp
with cell hash data with 1 HTO and hasmeta.feature_type = cellhash
in the workflow.For this specific library, we want to convert the RNA data to AnnData, but not the HTO altExp, which again should only have 1 HTO in it. However, what was happening is that it was saving the HTO altExp for the unfiltered object only as AnnData, but not the processed and filtered objects. This is because we have a check that looks to see if the altExp has > 1 row. If it doesn't meet that criteria then the HTO data won't be saved as a separate AnnData object. This allowed this library to go through the conversion process, but only output the RNA, or so we thought.
Problem 1: The
unfiltered_cellhash.adt
was being converted, when that shouldn't happen. The reason is because when we quantify cell hash expression, we use an index that contains all possible HTO's found in all libraries in that project. So, this means when we create the unfiltered SCE object we are adding an altExp that has 1 row for every HTO found in the project. For this particular library, there is only one HTO that should be there, but we see all 12. Later when we demux, we remove any HTO's that aren't assigned to that library based on the cellhash pool file we have. However, that removes the extra unexpressed HTOs from filtered and processed but not unfiltered. So, we still have 12 rows which passes the > 1 row check when converting to AnnData.To combat this problem, I added a step to the script when we generate merged unfiltered objects, to filter only cellhash data to contain only the HTO's that are actually present in that library. I think before we didn't want to do any filtering in "unfiltered", but I don't think it makes sense to have an object with HTOs that weren't even added to that library to begin with.
Note we might be able to remove the code I copied over from
add_cellhash_calls.R
from that script, but I don't think it really matters either way.Problem 2: Because this one library is technically a "cellhash" library, but doesn't actually have any cellhash altExp data, Nextflow is told that there should be an extra feature AnnData object to run through
move_counts_anndata.py
. But that's only done for the processed object, and that file does not, in fact, exist since we didn't pass the n > 1 rule. Generally, we just want to avoid accidentally creating any HTO altExp objects, so I removed "cellhash" from the options when checking for if a feature is present.