Remove any unused cell hash oligos from unfiltered and fix conversion issue #675

allyhawkins · 2024-01-25T18:35:06Z

We have one specific library that both contains cell hashing oligos, but only has one of those oligos and isn't actually multiplexed. This means that although it's just a regular single-cell library, it has an altExp with cell hash data with 1 HTO and has meta.feature_type = cellhash in the workflow.

For this specific library, we want to convert the RNA data to AnnData, but not the HTO altExp, which again should only have 1 HTO in it. However, what was happening is that it was saving the HTO altExp for the unfiltered object only as AnnData, but not the processed and filtered objects. This is because we have a check that looks to see if the altExp has > 1 row. If it doesn't meet that criteria then the HTO data won't be saved as a separate AnnData object. This allowed this library to go through the conversion process, but only output the RNA, or so we thought.

Problem 1: The unfiltered_cellhash.adt was being converted, when that shouldn't happen. The reason is because when we quantify cell hash expression, we use an index that contains all possible HTO's found in all libraries in that project. So, this means when we create the unfiltered SCE object we are adding an altExp that has 1 row for every HTO found in the project. For this particular library, there is only one HTO that should be there, but we see all 12. Later when we demux, we remove any HTO's that aren't assigned to that library based on the cellhash pool file we have. However, that removes the extra unexpressed HTOs from filtered and processed but not unfiltered. So, we still have 12 rows which passes the > 1 row check when converting to AnnData.

To combat this problem, I added a step to the script when we generate merged unfiltered objects, to filter only cellhash data to contain only the HTO's that are actually present in that library. I think before we didn't want to do any filtering in "unfiltered", but I don't think it makes sense to have an object with HTOs that weren't even added to that library to begin with.
Note we might be able to remove the code I copied over from add_cellhash_calls.R from that script, but I don't think it really matters either way.

Problem 2: Because this one library is technically a "cellhash" library, but doesn't actually have any cellhash altExp data, Nextflow is told that there should be an extra feature AnnData object to run through move_counts_anndata.py. But that's only done for the processed object, and that file does not, in fact, exist since we didn't pass the n > 1 rule. Generally, we just want to avoid accidentally creating any HTO altExp objects, so I removed "cellhash" from the options when checking for if a feature is present.

sjspielman

Problem 2 (the bug I found) solution LGTM! Since I'm less familiar with the other code, I'll let @jashapiro weigh in primarily there.

sjspielman · 2024-01-25T18:48:31Z

bin/generate_unfiltered_sce.R

@@ -178,5 +196,6 @@ unfiltered_sce <- unfiltered_sce |>
  # add dataframe with sample metadata to sce metadata
  add_sample_metadata(metadata_df = sample_metadata_df)

+


Suggested change

sjspielman · 2024-01-25T18:48:55Z

modules/sce-processing.nf

+
+

Suggested change

sjspielman · 2024-01-25T18:51:06Z

bin/generate_unfiltered_sce.R

+    if (!file.exists(opt$cellhash_pool_file)) {
+      stop("Can't find cellhash_pool_file")
+    }


NB Josh recently discovered that stopifnot() got better when we weren't looking! I am not suggesting to change this code, just sharing the good news - https://blog.r-hub.io/2022/03/10/input-checking/

Just to make this more explicit, now we can do things like:

Suggested change

if (!file.exists(opt$cellhash_pool_file)) {

stop("Can't find cellhash_pool_file")

}

stopifnot("Can't find cellhash_pool_file" = file.exists(opt$cellhash_pool_file))

jashapiro

I think this is mostly good, but I had a couple hopefully quick thoughts.

I was torn about the first solution... I think that part of the reason I wanted to keep all tags in the unfiltered SCE is that we would be able to see more easily if there were some error in the data with respect to the assigned HTOs (if other HTOs came up that were not expected). But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?

I'm also a little worried about not making any cellhash AnnData objects. Are we not outputting those at all? I can see that we might want to at some point, so I think it may be better to have them than not, at least in the case where we have more than one HTO. So it may be better to just have bash check for the output file from the conversion and not exclude cellhash tables from conversion at all. Again, unless that is a problem for some other reason that I am not currently thinking of.

jashapiro · 2024-01-29T20:03:14Z

bin/generate_unfiltered_sce.R

+    if (!file.exists(opt$cellhash_pool_file)) {
+      stop("Can't find cellhash_pool_file")
+    }


Just to make this more explicit, now we can do things like:

Suggested change

if (!file.exists(opt$cellhash_pool_file)) {

stop("Can't find cellhash_pool_file")

}

stopifnot("Can't find cellhash_pool_file" = file.exists(opt$cellhash_pool_file))

jashapiro · 2024-01-29T20:10:52Z

modules/export-anndata.nf

@@ -12,7 +12,7 @@ process export_anndata{
    script:
      rna_hdf5_file = "${meta.library_id}_${file_type}_rna.hdf5"
      feature_hdf5_file = "${meta.library_id}_${file_type}_${meta.feature_type}.hdf5"
-      feature_present = meta.feature_type in ["adt", "cellhash"]
+      feature_present = meta.feature_type == "adt"


Minor, but can we leave this as in for future expansion? Also, we should be sure to update the stub in parallel if we are changing logic here.

Suggested change

feature_present = meta.feature_type == "adt"

feature_present = meta.feature_type in ["adt"]

I guess I'm not totally clear here on not wanting the HTO features to come out as AnnData? Can you elaborate on that?

Another thought is that we can check whether ${feature_hdf5_file} exists after sce_to_anndata.R and only run move_counts_anndata.py on that file if it exists. sce_to_anndata.R already prints a warning in this case, so I think we should be fine on that front.

allyhawkins · 2024-01-29T22:36:01Z

But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?

I'm not entirely sure what you mean here? This should only be affecting cell hash data, so not sure what you mean by including the ADT here?

I'm also a little worried about not making any cellhash AnnData objects. Are we not outputting those at all? I can see that we might want to at some point, so I think it may be better to have them than not, at least in the case where we have more than one HTO. So it may be better to just have bash check for the output file from the conversion and not exclude cellhash tables from conversion at all. Again, unless that is a problem for some other reason that I am not currently thinking of.

We had made the decision to not convert any multiplexed libraries to AnnData objects. So we don't want to go through the export_anndata process, if the library contains cell hashing. So I think if you want to change that design, that's an entirely different question and will also affect the portal.

jashapiro · 2024-01-29T22:54:18Z

But I looked at the QC report, and it seems we are indeed only reporting the expected HTOs, so I think this is probably fine, but I'm also not sure I see where the problem with including the ADT is?
I'm not entirely sure what you mean here? This should only be affecting cell hash data, so not sure what you mean by including the ADT here?

Brain-finger disconnect. I basically meant the altexp with all HTO tags.

jashapiro · 2024-01-29T22:58:20Z

We had made the decision to not convert any multiplexed libraries to AnnData objects. So we don't want to go through the export_anndata process, if the library contains cell hashing. So I think if you want to change that design, that's an entirely different question and will also affect the portal.

Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow.

allyhawkins · 2024-01-29T22:59:52Z

Brain-finger disconnect. I basically meant the altexp with all HTO tags.

It's not entirely a problem, but I feel like it's a little misleading. If those tags weren't even used in the experiment, why would they be provided in the results? I think the case that you mentioned where a different oligo was recorded is definitely a possibility, but would you still trust that data if you didn't record the HTO correctly?

Ultimately, with the change to only set feature_present to True with ADT data only, then it probably doesn't matter anymore, but I think it's a question of if we want to provide it to users and if we think it's important to have.

allyhawkins · 2024-01-29T23:01:21Z

Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow.

Do you mean adding a check before running move_counts_anndata.py that the files exist?

jashapiro · 2024-01-30T00:02:24Z

It's not entirely a problem, but I feel like it's a little misleading. If those tags weren't even used in the experiment, why would they be provided in the results? I think the case that you mentioned where a different oligo was recorded is definitely a possibility, but would you still trust that data if you didn't record the HTO correctly?

I'm more worried about an error in the cellhash pool file... Having the info just makes it easier to check! Also, you might be able to see if there were barcodes that were more likely to be mis-assigned because of sequencing error. It isn't a great check, but it is something, which is why I would lean toward leaving it in.

Yes, this does argue for just skipping all cellhash data! But I think it may still be worth doing the check in bash to prevent other possible errors, unless we always want those errors to kill the workflow.

Do you mean adding a check before running move_counts_anndata.py that the files exist?

Yes. Something like:

if [ -f "${feature_hdf5_file}" ]; then
  move_counts_anndata.py --anndata_file "${feature_hdf5_file}"
fi

in place of

${feature_present ? "move_counts_anndata.py --anndata_file ${feature_hdf5_file}" : ''}

This reverts commit 1b97dd2.

This reverts commit 28ce238.

allyhawkins · 2024-01-30T16:52:44Z

I updated this PR to keep the original plan of keeping all HTOs in the altExp for unfiltered objects. I then only made changes in the export AnnData process. Here, feature_present is only true if it's ADT and not cellhash. I also added a check for the existence of the feature file before running move_counts_anndata.py. I tested this and no cellhash HDF5 files were produced.

This should be ready for another review.

jashapiro

Looks good to me. I think we only want to move processed counts for features too, right?

jashapiro · 2024-01-30T17:35:04Z

modules/export-anndata.nf

      """
      sce_to_anndata.R \
        --input_sce_file ${sce_file} \
        --output_rna_h5 ${rna_hdf5_file} \
-        --output_feature_h5 ${feature_hdf5_file} \
+        ${feature_present ? "--output_feature_h5 ${feature_hdf5_file}" : ''} \


I don't know if this had to be conditional, but it doesn't hurt.

For some reason I was still getting the unfiltered feature file being saved without the conditional, so I added it in to help prevent that.

modules/export-anndata.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins and others added 5 commits January 25, 2024 11:03

pass cellhash pool file for creating unfiltered

1b97dd2

remove cellhash that don't exist in unfiltered

28ce238

make sure to skip cellhash for moving feature counts

dadb516

just don't convert adt for cellhash

4cb7610

equals not in

ccf8a90

allyhawkins requested review from jashapiro and sjspielman January 25, 2024 18:35

missing quotes

90b2221

sjspielman reviewed Jan 25, 2024

View reviewed changes

allyhawkins mentioned this pull request Jan 25, 2024

Prepare for scpca-nf release v0.7.2 #669

Closed

12 tasks

jashapiro reviewed Jan 29, 2024

View reviewed changes

allyhawkins added 3 commits January 30, 2024 09:55

Revert "pass cellhash pool file for creating unfiltered"

f957926

This reverts commit 1b97dd2.

Revert "remove cellhash that don't exist in unfiltered"

67a5648

This reverts commit 28ce238.

check for feature file

168c645

allyhawkins requested a review from jashapiro January 30, 2024 16:52

jashapiro approved these changes Jan 30, 2024

View reviewed changes

only move processed feature counts

f196b1d

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins merged commit cd4499b into main Jan 30, 2024
3 checks passed

allyhawkins deleted the allyhawkins/filter-cellhash branch January 30, 2024 18:44

allyhawkins mentioned this pull request Jan 31, 2024

Prep for v0.7.2 #680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove any unused cell hash oligos from unfiltered and fix conversion issue #675

Remove any unused cell hash oligos from unfiltered and fix conversion issue #675

allyhawkins commented Jan 25, 2024

sjspielman left a comment

sjspielman Jan 25, 2024

sjspielman Jan 25, 2024

sjspielman Jan 25, 2024

jashapiro Jan 29, 2024

jashapiro left a comment

jashapiro Jan 29, 2024

jashapiro Jan 29, 2024

jashapiro Jan 29, 2024

allyhawkins commented Jan 29, 2024

jashapiro commented Jan 29, 2024

jashapiro commented Jan 29, 2024

allyhawkins commented Jan 29, 2024

allyhawkins commented Jan 29, 2024

jashapiro commented Jan 30, 2024

allyhawkins commented Jan 30, 2024

jashapiro left a comment

jashapiro Jan 30, 2024

allyhawkins Jan 30, 2024

		@@ -178,5 +196,6 @@ unfiltered_sce <- unfiltered_sce \|>
		# add dataframe with sample metadata to sce metadata
		add_sample_metadata(metadata_df = sample_metadata_df)

	feature_present = meta.feature_type == "adt"
	feature_present = meta.feature_type in ["adt"]

Remove any unused cell hash oligos from unfiltered and fix conversion issue #675

Remove any unused cell hash oligos from unfiltered and fix conversion issue #675

Conversation

allyhawkins commented Jan 25, 2024

sjspielman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Jan 29, 2024

jashapiro commented Jan 29, 2024

jashapiro commented Jan 29, 2024

allyhawkins commented Jan 29, 2024

allyhawkins commented Jan 29, 2024

jashapiro commented Jan 30, 2024

allyhawkins commented Jan 30, 2024

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment