Add CellAssign process #479

sjspielman · 2023-10-02T19:26:22Z

Towards #415

The next step is running cell assign, done here! I think given all the nice organizational changes in #476 & #477 this is pretty straightforward for review. The one additional thing to note is that I realized we probably need to be including singler_reference_name in the same way that we include cellassign_reference_name. This wasn't previously needed since we had been doing singler celltyping and adding to SCE object in the same process, where the reference name was accessible from the trained singler model itself. Now that we will be adding singler & cellassign to the final step all at once in the last process, we'll need to grab both reference names for input.

I also have one discussion-y comment to bring up at this point - my memory from our figjam session 🎸 was that we had decided not to keep using indicator variables in meta, e.g. meta.has_singler=true (edit lowercase true, this isn't R). I am not seeing any notes that explicitly say this though, so let's make sure I am remembering correctly! We can/should certainly still add relevant information to SCE metadata as we had been doing, but with our current celltyping workflow structure I don't really see the need for anything in the json metadata unless I am missing something?

Noting the next PR will write the process & R script for adding celltypes into the processed SCE.

…ference_name variable. Run cellassign and prepare channel for next steps

jashapiro

Looks good. I suggested a few changes:

Using static filenames within the "bash script" of the process. Since these files are isolated and never "escape" the process, it isn't worth assigning them library-specific names.
Similarly, for checkpoint files, it doesn't seem worth the effort to give them special names: the directory they are in will already encode specificity to the library
Removing the singler_reference_name variable. We didn't have it before because it is already in the .rds files.
- Speaking of the .rds files: If we keep this structure where we are not passing the singler results directly along to cellassign, we might want to trim the output singler_results.rds file in classify_SingleR.R output to save space. This would be a separate PR, and I would probably want to make trimming an option at the script level (which we would invoke in the process). Ignore this: I was confused about the actual output at this point... we aren't outputting the full SCE like before.
Finally: no need to add extra values to the input channel for assign_celltypes: As long as they are in meta, we can access that from within the process.

modules/classify-celltypes.nf

jashapiro · 2023-10-02T20:47:07Z

modules/classify-celltypes.nf

+         // singler reference name 
+         Utils.parseNA(it.singler_ref_name),       


The reason we didn't have this before is because it is encoded in the rds file of the SingleR model. Since the cellassign input is just a tsv, we don't have that info.

When we pass along the SingleR results, that also should have the reference name in the output .rds file. Since we have to read that in anyway to get the full results tables we want to add to the output, we will have that information and don't need to pass it separately through the workflow.

Suggested change

// singler reference name

Utils.parseNA(it.singler_ref_name),

just noting i'll circle back to this suggestion after final.final discussion in the PR!

The reason we didn't have this before is because it is encoded in the rds file of the SingleR model. Since the cellassign input is just a tsv, we don't have that info.

I was a bit confused: the reference name was being parsed out of the file name:

scpca-nf/bin/classify_SingleR.R

Line 78 in 67e175b

model_names <- stringr::str_remove(basename(model_files), "_model.rds")

Looking back at #402, it seems like we added that variable to the table mostly for convenience, but we could have parsed it out of the file name. It seems like we are using a shorthand at the moment, but the full file name (less .tsv) should probably be included as the name (i.e. "PanglaoDB-blood" rather than "blood"). In nextflow this can be done with file("path/my_model.tsv").baseName which strips both the directory and the extension.

And fun addition, if we want to do the same string parsing for the singler model files, which end with "_model.rds", we could do: file("path/my_model.tsv").baseName - '_model' which is so odd, but it works. If you want to be safer, use a little regex instead: file("path/my_model.tsv").baseName - ~'_model$'. Yes, you can make a regex by just adding a ~ before a quote. Groovy is so strange, but kind of cool sometimes.

just noting i'll circle back to this suggestion after final.final discussion in the PR!

My current suggestion is to punt this to a separate PR: it isn't really part of getting the cellassign process done.

jashapiro · 2023-10-02T20:47:36Z

modules/classify-celltypes.nf

+        .map{ project_id, meta, processed_sce, singler_model_file, singler_reference_name, cellassign_reference_file, cellassign_reference_name ->
          meta.celltype_publish_dir = "${params.checkpoints_dir}/celltype/${meta.library_id}";
          meta.singler_dir = "${meta.celltype_publish_dir}/${meta.library_id}_singler";
          meta.cellassign_dir = "${meta.celltype_publish_dir}/${meta.library_id}_cellassign";
-          meta.singler_model_file = singler_model_file;
+          meta.singler_model_file = singler_model_file; 
+          meta.singler_reference_name = singler_reference_name; 


Just noting updating these to remove the singler_reference_name as well.

modules/classify-celltypes.nf

jashapiro · 2023-10-03T00:57:47Z

I also have one discussion-y comment to bring up at this point - my memory from our figjam session 🎸 was that we had decided not to keep using indicator variables in meta, e.g. meta.has_singler=true (edit lowercase true, this isn't R). I am not seeing any notes that explicitly say this though, so let's make sure I am remembering correctly! We can/should certainly still add relevant information to SCE metadata as we had been doing, but with our current celltyping workflow structure I don't really see the need for anything in the json metadata unless I am missing something?

The presence of a reference file(s) for singleR and/or cellassign will be enough to say whether that processing has happened. There should not be any need for a separate boolean value.

jashapiro · 2023-10-03T12:26:14Z

Removing the singler_reference_name variable. We didn't have it before because it is already in the .rds files.

I realized that I am not 100% sure whether the reference name is in the output .rds object. It was there when we included the full SCE object, but maybe not now? I'd still say it is better if we can include it in the object itself, rather than creating an extra meta variable and associated column that we would have to add to the input table.

sjspielman · 2023-10-03T13:09:04Z

A couple thoughts on reference names -

The singler trained model name and the cellassign reference names are both present in their respective file names, .e.g. BlueprintEncodeData_model.rds & PanglaoDB-blood.tsv. So, since these filenames are in meta, I guess we don't need to pass either the singler or cellassign reference names in separately. We could extract this info from filenames with some string manipulation. (Not loving that we have _ vs -, though..).
In terms of adding the singler model name to the singler object saved inrds, I'm not sure how that would play out - my idea was to keep that as the exact structure that singler exports so that people can actually use it if they want to with singler's native functions, e.g. heatmap. It's a DataFrame, so we'd have to add a column with model name which should be fine unless there is crazy strict checking in singler's functions. But, if we do my point in 1^ , then this isn't needed.

Edit - I realize we may be saying different things when referring to RDS files!

In my point 1, I'm referring to the "trained" model rds files saved scpca-references/celltype/singler_models/
In my point 2, I'm referring to the rds output from classifying singler.

sjspielman · 2023-10-03T13:12:29Z

I realized that I am not 100% sure whether the reference name is in the output .rds object. It was there when we included the full SCE object, but maybe not now?

It is not in the output rds from the singler process at this point:

scpca-nf/modules/classify-celltypes.nf

Line 25 in fb5f3bf

--output_singler_results_file "${singler_dir}/singler_results.rds" \

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro · 2023-10-03T14:03:07Z

I realized that I am not 100% sure whether the reference name is in the output .rds object. It was there when we included the full SCE object, but maybe not now?

It is not in the output rds from the singler process at this point:

scpca-nf/modules/classify-celltypes.nf

Line 25 in fb5f3bf

--output_singler_results_file "${singler_dir}/singler_results.rds" \

What I meant by this is whether the reference name makes it into the DataFrame object. When we were outputting the full SCE, we were tagging the SingleR results in the metadata by the reference name, so we were fine there. But looking into it a bit more, I don't think the SingleR results DataFrame itself contains that information. But we can add it quite trivially, as DataFrame objects have metadata slots just like the SCE. So we could add in something like metadata(singler_result)$singler_reference <- singler_reference_name to store the reference name along with the results itself. We can't do that with a TSV output from cellassign unless we add a comment line or separate column, which is a whole mess of its own. We would have the scpca-meta.json to pull it from though anyway, so that may not be a concern.

If we can parse the reference name out of the file name to add to the nextflow meta object rather than using a separate column in the input table, I might prefer that, but w might have to revisit the logic of adding the cellassign reference name in the first place.

sjspielman · 2023-10-03T14:43:29Z

But we can add it quite trivially, as DataFrame objects have metadata slots just like the SCE. So we could add in something like metadata(singler_result)$singler_reference <- singler_reference_name to store the reference name along with the results itself.

Right, I would just want to make sure this doesn't cause singler's native functions to break, which it really shouldn't. Let me double check this*...

Edit - confirmed, SingleR doesn't care about extra metadata in the DataFrame object.

jashapiro · 2023-10-03T14:47:38Z

But we can add it quite trivially, as DataFrame objects have metadata slots just like the SCE. So we could add in something like metadata(singler_result)$singler_reference <- singler_reference_name to store the reference name along with the results itself.

Right, I would just want to make sure this doesn't cause singler's native functions to break, which it really shouldn't. Let me double check though...

As long as we are adding in metadata, I'm fine with passing in the reference names for both cellassign and singler directly as an argument to the assign script. The only thing we have to be cautious about (and we have to do this anyway) is that we need to make sure we are passing in the correct value. It is possible for the meta object to become disconnected from the file name/reference that was actually used if we are not careful, but I expect that the repeat_celltyping logic should handle that when we add it.

sjspielman · 2023-10-03T15:47:37Z

As long as we are adding in metadata, I'm fine with passing in the reference names for both cellassign and singler directly as an argument to the assign script. The only thing we have to be cautious about (and we have to do this anyway) is that we need to make sure we are passing in the correct value.

Feels like this could serve as an extra check as well - we'd want to make sure the singler reference name that gets passed in is the same as the name recorded in the singler DataFrame's metadata.

…trings are also included in model filenames

…ociated checks

sjspielman · 2023-10-03T17:36:41Z

Changes since last time:

9de11fc: I got rid of all tracking of the reference names in favor of "regexing them," if you will..., out of the reference/model file names directly.
09565d1: I updated the SinglerR script to add the reference name as DataFrame metadata, with a couple checks to go with it

EDIT!

We would have the scpca-meta.json to pull it from though anyway, so that may not be a concern.

Based on my code changes, ^ requires pulling out of the reference name for cellassign, which seems like what we were aiming for anyways?

bin/classify_SingleR.R

jashapiro

This looks fine, with a few small suggestions (some just whitespace).
Any concerns related to tracking reference file names can be settled when we are actually using that part.

bin/classify_SingleR.R

modules/classify-celltypes.nf

jashapiro · 2023-10-03T19:12:23Z

modules/classify-celltypes.nf

          meta.cellassign_reference_file = cellassign_reference_file;
-          meta.cellassign_reference_name = cellassign_reference_name;


I assume that by removing this here, you plan to add some code to add_celltypes_to_sce to pull out the cellassign reference name?

Yes, that was the idea!

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

sjspielman · 2023-10-03T19:29:50Z

Just for posterity...

stub failed for 3c2715c, but none of these changes really should have caused that?
Locally, stub ran fine for me at that commit!
I retriggered the CI stub check and now it's passing 🤷‍♀️

sjspielman added 4 commits October 2, 2023 17:23

initial code for cellassign, not tested and probably buggy

f255997

Merge branch 'development' into sjspielman/add-process-for-cellassign

ead459e

Lots of debugging and updates, including passing through a singler_re…

3d7c1c9

…ference_name variable. Run cellassign and prepare channel for next steps

spacing and delete NO_NAME which isn't needed

474d8de

sjspielman requested a review from jashapiro October 2, 2023 19:28

jashapiro reviewed Oct 3, 2023

View reviewed changes

sjspielman and others added 4 commits October 3, 2023 09:13

Apply suggestions from code review

61b1b6f

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

Apply suggestions from code review

6ee6043

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

cellassign reference file variable name

4ada0c8

missing a quote

88039d7

jashapiro closed this Oct 3, 2023

jashapiro reopened this Oct 3, 2023

sjspielman added 2 commits October 3, 2023 17:28

remove singler and cellassign reference names from meta since those s…

9de11fc

…trings are also included in model filenames

Add reference name from model file to DataFrame metadata, and add ass…

09565d1

…ociated checks

sjspielman requested a review from jashapiro October 3, 2023 17:37

sjspielman commented Oct 3, 2023

View reviewed changes

bin/classify_SingleR.R Outdated Show resolved Hide resolved

switch to stringr::str_match

d686b60

jashapiro approved these changes Oct 3, 2023

View reviewed changes

Apply suggestions from code review

3c2715c

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

sjspielman merged commit 1e3ab54 into development Oct 3, 2023
3 checks passed

sjspielman deleted the sjspielman/add-process-for-cellassign branch October 3, 2023 19:49

sjspielman mentioned this pull request Oct 3, 2023

Split up cell types QC report #483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CellAssign process #479

Add CellAssign process #479

sjspielman commented Oct 2, 2023 •

edited

Loading

jashapiro left a comment •

edited

Loading

jashapiro Oct 2, 2023

sjspielman Oct 3, 2023

jashapiro Oct 3, 2023

jashapiro Oct 3, 2023

jashapiro Oct 2, 2023

jashapiro commented Oct 3, 2023

jashapiro commented Oct 3, 2023

sjspielman commented Oct 3, 2023 •

edited

Loading

sjspielman commented Oct 3, 2023 •

edited

Loading

jashapiro commented Oct 3, 2023

sjspielman commented Oct 3, 2023 •

edited

Loading

jashapiro commented Oct 3, 2023 •

edited

Loading

sjspielman commented Oct 3, 2023

sjspielman commented Oct 3, 2023 •

edited

Loading

jashapiro left a comment

jashapiro Oct 3, 2023

sjspielman Oct 3, 2023

sjspielman commented Oct 3, 2023

		// singler reference name
		Utils.parseNA(it.singler_ref_name),

		meta.cellassign_reference_file = cellassign_reference_file;
		meta.cellassign_reference_name = cellassign_reference_name;

Add CellAssign process #479

Add CellAssign process #479

Conversation

sjspielman commented Oct 2, 2023 • edited Loading

jashapiro left a comment • edited Loading

Choose a reason for hiding this comment

jashapiro Oct 2, 2023

Choose a reason for hiding this comment

sjspielman Oct 3, 2023

Choose a reason for hiding this comment

jashapiro Oct 3, 2023

Choose a reason for hiding this comment

jashapiro Oct 3, 2023

Choose a reason for hiding this comment

jashapiro Oct 2, 2023

Choose a reason for hiding this comment

jashapiro commented Oct 3, 2023

jashapiro commented Oct 3, 2023

sjspielman commented Oct 3, 2023 • edited Loading

sjspielman commented Oct 3, 2023 • edited Loading

jashapiro commented Oct 3, 2023

sjspielman commented Oct 3, 2023 • edited Loading

jashapiro commented Oct 3, 2023 • edited Loading

sjspielman commented Oct 3, 2023

sjspielman commented Oct 3, 2023 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Oct 3, 2023

Choose a reason for hiding this comment

sjspielman Oct 3, 2023

Choose a reason for hiding this comment

sjspielman commented Oct 3, 2023

sjspielman commented Oct 2, 2023 •

edited

Loading

jashapiro left a comment •

edited

Loading

sjspielman commented Oct 3, 2023 •

edited

Loading

sjspielman commented Oct 3, 2023 •

edited

Loading

sjspielman commented Oct 3, 2023 •

edited

Loading

jashapiro commented Oct 3, 2023 •

edited

Loading

sjspielman commented Oct 3, 2023 •

edited

Loading