Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove any unused cell hash oligos from unfiltered and fix conversion issue #675

Merged
merged 10 commits into from
Jan 30, 2024

Conversation

allyhawkins
Copy link
Member

We have one specific library that both contains cell hashing oligos, but only has one of those oligos and isn't actually multiplexed. This means that although it's just a regular single-cell library, it has an altExp with cell hash data with 1 HTO and has meta.feature_type = cellhash in the workflow.

For this specific library, we want to convert the RNA data to AnnData, but not the HTO altExp, which again should only have 1 HTO in it. However, what was happening is that it was saving the HTO altExp for the unfiltered object only as AnnData, but not the processed and filtered objects. This is because we have a check that looks to see if the altExp has > 1 row. If it doesn't meet that criteria then the HTO data won't be saved as a separate AnnData object. This allowed this library to go through the conversion process, but only output the RNA, or so we thought.

Problem 1: The unfiltered_cellhash.adt was being converted, when that shouldn't happen. The reason is because when we quantify cell hash expression, we use an index that contains all possible HTO's found in all libraries in that project. So, this means when we create the unfiltered SCE object we are adding an altExp that has 1 row for every HTO found in the project. For this particular library, there is only one HTO that should be there, but we see all 12. Later when we demux, we remove any HTO's that aren't assigned to that library based on the cellhash pool file we have. However, that removes the extra unexpressed HTOs from filtered and processed but not unfiltered. So, we still have 12 rows which passes the > 1 row check when converting to AnnData.

To combat this problem, I added a step to the script when we generate merged unfiltered objects, to filter only cellhash data to contain only the HTO's that are actually present in that library. I think before we didn't want to do any filtering in "unfiltered", but I don't think it makes sense to have an object with HTOs that weren't even added to that library to begin with.
Note we might be able to remove the code I copied over from add_cellhash_calls.R from that script, but I don't think it really matters either way.

Problem 2: Because this one library is technically a "cellhash" library, but doesn't actually have any cellhash altExp data, Nextflow is told that there should be an extra feature AnnData object to run through move_counts_anndata.py. But that's only done for the processed object, and that file does not, in fact, exist since we didn't pass the n > 1 rule. Generally, we just want to avoid accidentally creating any HTO altExp objects, so I removed "cellhash" from the options when checking for if a feature is present.

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem 2 (the bug I found) solution LGTM! Since I'm less familiar with the other code, I'll let @jashapiro weigh in primarily there.

@@ -178,5 +196,6 @@ unfiltered_sce <- unfiltered_sce |>
# add dataframe with sample metadata to sce metadata
add_sample_metadata(metadata_df = sample_metadata_df)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines 142 to 143


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines 163 to 165
if (!file.exists(opt$cellhash_pool_file)) {
stop("Can't find cellhash_pool_file")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB Josh recently discovered that stopifnot() got better when we weren't looking! I am not suggesting to change this code, just sharing the good news - https://blog.r-hub.io/2022/03/10/input-checking/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make this more explicit, now we can do things like:

Suggested change
if (!file.exists(opt$cellhash_pool_file)) {
stop("Can't find cellhash_pool_file")
}
stopifnot("Can't find cellhash_pool_file" = file.exists(opt$cellhash_pool_file))

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly good, but I had a couple hopefully quick thoughts.

I was torn about the first solution... I think that part of the reason I wanted to keep all tags in the unfiltered SCE is that we would be able to see more easily if there were some error in the data with respect to the assigned HTOs (if other HTOs came up that were not expected). But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?

I'm also a little worried about not making any cellhash AnnData objects. Are we not outputting those at all? I can see that we might want to at some point, so I think it may be better to have them than not, at least in the case where we have more than one HTO. So it may be better to just have bash check for the output file from the conversion and not exclude cellhash tables from conversion at all. Again, unless that is a problem for some other reason that I am not currently thinking of.

Comment on lines 163 to 165
if (!file.exists(opt$cellhash_pool_file)) {
stop("Can't find cellhash_pool_file")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make this more explicit, now we can do things like:

Suggested change
if (!file.exists(opt$cellhash_pool_file)) {
stop("Can't find cellhash_pool_file")
}
stopifnot("Can't find cellhash_pool_file" = file.exists(opt$cellhash_pool_file))

@@ -12,7 +12,7 @@ process export_anndata{
script:
rna_hdf5_file = "${meta.library_id}_${file_type}_rna.hdf5"
feature_hdf5_file = "${meta.library_id}_${file_type}_${meta.feature_type}.hdf5"
feature_present = meta.feature_type in ["adt", "cellhash"]
feature_present = meta.feature_type == "adt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but can we leave this as in for future expansion? Also, we should be sure to update the stub in parallel if we are changing logic here.

Suggested change
feature_present = meta.feature_type == "adt"
feature_present = meta.feature_type in ["adt"]

I guess I'm not totally clear here on not wanting the HTO features to come out as AnnData? Can you elaborate on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought is that we can check whether ${feature_hdf5_file} exists after sce_to_anndata.R and only run move_counts_anndata.py on that file if it exists. sce_to_anndata.R already prints a warning in this case, so I think we should be fine on that front.

@allyhawkins
Copy link
Member Author

But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?

I'm not entirely sure what you mean here? This should only be affecting cell hash data, so not sure what you mean by including the ADT here?

I'm also a little worried about not making any cellhash AnnData objects. Are we not outputting those at all? I can see that we might want to at some point, so I think it may be better to have them than not, at least in the case where we have more than one HTO. So it may be better to just have bash check for the output file from the conversion and not exclude cellhash tables from conversion at all. Again, unless that is a problem for some other reason that I am not currently thinking of.

We had made the decision to not convert any multiplexed libraries to AnnData objects. So we don't want to go through the export_anndata process, if the library contains cell hashing. So I think if you want to change that design, that's an entirely different question and will also affect the portal.

@jashapiro
Copy link
Member

But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?
I'm not entirely sure what you mean here? This should only be affecting cell hash data, so not sure what you mean by including the ADT here?

Brain-finger disconnect. I basically meant the altexp with all HTO tags.

@jashapiro
Copy link
Member

We had made the decision to not convert any multiplexed libraries to AnnData objects. So we don't want to go through the export_anndata process, if the library contains cell hashing. So I think if you want to change that design, that's an entirely different question and will also affect the portal.

Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow.

@allyhawkins
Copy link
Member Author

Brain-finger disconnect. I basically meant the altexp with all HTO tags.

It's not entirely a problem, but I feel like it's a little misleading. If those tags weren't even used in the experiment, why would they be provided in the results? I think the case that you mentioned where a different oligo was recorded is definitely a possibility, but would you still trust that data if you didn't record the HTO correctly?

Ultimately, with the change to only set feature_present to True with ADT data only, then it probably doesn't matter anymore, but I think it's a question of if we want to provide it to users and if we think it's important to have.

@allyhawkins
Copy link
Member Author

Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow.

Do you mean adding a check before running move_counts_anndata.py that the files exist?

@jashapiro
Copy link
Member

It's not entirely a problem, but I feel like it's a little misleading. If those tags weren't even used in the experiment, why would they be provided in the results? I think the case that you mentioned where a different oligo was recorded is definitely a possibility, but would you still trust that data if you didn't record the HTO correctly?

I'm more worried about an error in the cellhash pool file... Having the info just makes it easier to check! Also, you might be able to see if there were barcodes that were more likely to be mis-assigned because of sequencing error. It isn't a great check, but it is something, which is why I would lean toward leaving it in.

Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow.

Do you mean adding a check before running move_counts_anndata.py that the files exist?

Yes. Something like:

if [ -f "${feature_hdf5_file}" ]; then
  move_counts_anndata.py --anndata_file "${feature_hdf5_file}"
fi

in place of

${feature_present ? "move_counts_anndata.py --anndata_file ${feature_hdf5_file}" : ''}

@allyhawkins
Copy link
Member Author

I updated this PR to keep the original plan of keeping all HTOs in the altExp for unfiltered objects. I then only made changes in the export AnnData process. Here, feature_present is only true if it's ADT and not cellhash. I also added a check for the existence of the feature file before running move_counts_anndata.py. I tested this and no cellhash HDF5 files were produced.

This should be ready for another review.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think we only want to move processed counts for features too, right?

"""
sce_to_anndata.R \
--input_sce_file ${sce_file} \
--output_rna_h5 ${rna_hdf5_file} \
--output_feature_h5 ${feature_hdf5_file} \
${feature_present ? "--output_feature_h5 ${feature_hdf5_file}" : ''} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this had to be conditional, but it doesn't hurt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I was still getting the unfiltered feature file being saved without the conditional, so I added it in to help prevent that.

modules/export-anndata.nf Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins allyhawkins merged commit cd4499b into main Jan 30, 2024
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/filter-cellhash branch January 30, 2024 18:44
@allyhawkins allyhawkins mentioned this pull request Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants