Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CellAssign process #479

Merged
merged 12 commits into from
Oct 3, 2023
Merged

Conversation

sjspielman
Copy link
Member

@sjspielman sjspielman commented Oct 2, 2023

Towards #415

The next step is running cell assign, done here! I think given all the nice organizational changes in #476 & #477 this is pretty straightforward for review. The one additional thing to note is that I realized we probably need to be including singler_reference_name in the same way that we include cellassign_reference_name. This wasn't previously needed since we had been doing singler celltyping and adding to SCE object in the same process, where the reference name was accessible from the trained singler model itself. Now that we will be adding singler & cellassign to the final step all at once in the last process, we'll need to grab both reference names for input.

I also have one discussion-y comment to bring up at this point - my memory from our figjam session 🎸 was that we had decided not to keep using indicator variables in meta, e.g. meta.has_singler=true (edit lowercase true, this isn't R). I am not seeing any notes that explicitly say this though, so let's make sure I am remembering correctly! We can/should certainly still add relevant information to SCE metadata as we had been doing, but with our current celltyping workflow structure I don't really see the need for anything in the json metadata unless I am missing something?

Noting the next PR will write the process & R script for adding celltypes into the processed SCE.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I suggested a few changes:

  • Using static filenames within the "bash script" of the process. Since these files are isolated and never "escape" the process, it isn't worth assigning them library-specific names.
  • Similarly, for checkpoint files, it doesn't seem worth the effort to give them special names: the directory they are in will already encode specificity to the library
  • Removing the singler_reference_name variable. We didn't have it before because it is already in the .rds files.
    • Speaking of the .rds files: If we keep this structure where we are not passing the singler results directly along to cellassign, we might want to trim the output singler_results.rds file in classify_SingleR.R output to save space. This would be a separate PR, and I would probably want to make trimming an option at the script level (which we would invoke in the process). Ignore this: I was confused about the actual output at this point... we aren't outputting the full SCE like before.
  • Finally: no need to add extra values to the input channel for assign_celltypes: As long as they are in meta, we can access that from within the process.

modules/classify-celltypes.nf Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
Comment on lines 124 to 125
// singler reference name
Utils.parseNA(it.singler_ref_name),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we didn't have this before is because it is encoded in the rds file of the SingleR model. Since the cellassign input is just a tsv, we don't have that info.

When we pass along the SingleR results, that also should have the reference name in the output .rds file. Since we have to read that in anyway to get the full results tables we want to add to the output, we will have that information and don't need to pass it separately through the workflow.

Suggested change
// singler reference name
Utils.parseNA(it.singler_ref_name),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting i'll circle back to this suggestion after final.final discussion in the PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we didn't have this before is because it is encoded in the rds file of the SingleR model. Since the cellassign input is just a tsv, we don't have that info.

I was a bit confused: the reference name was being parsed out of the file name:

model_names <- stringr::str_remove(basename(model_files), "_model.rds")

Looking back at #402, it seems like we added that variable to the table mostly for convenience, but we could have parsed it out of the file name. It seems like we are using a shorthand at the moment, but the full file name (less .tsv) should probably be included as the name (i.e. "PanglaoDB-blood" rather than "blood"). In nextflow this can be done with file("path/my_model.tsv").baseName which strips both the directory and the extension.

And fun addition, if we want to do the same string parsing for the singler model files, which end with "_model.rds", we could do: file("path/my_model.tsv").baseName - '_model' which is so odd, but it works. If you want to be safer, use a little regex instead: file("path/my_model.tsv").baseName - ~'_model$'. Yes, you can make a regex by just adding a ~ before a quote. Groovy is so strange, but kind of cool sometimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting i'll circle back to this suggestion after final.final discussion in the PR!

My current suggestion is to punt this to a separate PR: it isn't really part of getting the cellassign process done.

Comment on lines 138 to 143
.map{ project_id, meta, processed_sce, singler_model_file, singler_reference_name, cellassign_reference_file, cellassign_reference_name ->
meta.celltype_publish_dir = "${params.checkpoints_dir}/celltype/${meta.library_id}";
meta.singler_dir = "${meta.celltype_publish_dir}/${meta.library_id}_singler";
meta.cellassign_dir = "${meta.celltype_publish_dir}/${meta.library_id}_cellassign";
meta.singler_model_file = singler_model_file;
meta.singler_model_file = singler_model_file;
meta.singler_reference_name = singler_reference_name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting updating these to remove the singler_reference_name as well.

modules/classify-celltypes.nf Outdated Show resolved Hide resolved
@jashapiro
Copy link
Member

I also have one discussion-y comment to bring up at this point - my memory from our figjam session 🎸 was that we had decided not to keep using indicator variables in meta, e.g. meta.has_singler=true (edit lowercase true, this isn't R). I am not seeing any notes that explicitly say this though, so let's make sure I am remembering correctly! We can/should certainly still add relevant information to SCE metadata as we had been doing, but with our current celltyping workflow structure I don't really see the need for anything in the json metadata unless I am missing something?

The presence of a reference file(s) for singleR and/or cellassign will be enough to say whether that processing has happened. There should not be any need for a separate boolean value.

@jashapiro
Copy link
Member

  • Removing the singler_reference_name variable. We didn't have it before because it is already in the .rds files.

I realized that I am not 100% sure whether the reference name is in the output .rds object. It was there when we included the full SCE object, but maybe not now? I'd still say it is better if we can include it in the object itself, rather than creating an extra meta variable and associated column that we would have to add to the input table.

@sjspielman
Copy link
Member Author

sjspielman commented Oct 3, 2023

A couple thoughts on reference names -

  1. The singler trained model name and the cellassign reference names are both present in their respective file names, .e.g. BlueprintEncodeData_model.rds & PanglaoDB-blood.tsv. So, since these filenames are in meta, I guess we don't need to pass either the singler or cellassign reference names in separately. We could extract this info from filenames with some string manipulation. (Not loving that we have _ vs -, though..).
  2. In terms of adding the singler model name to the singler object saved inrds, I'm not sure how that would play out - my idea was to keep that as the exact structure that singler exports so that people can actually use it if they want to with singler's native functions, e.g. heatmap. It's a DataFrame, so we'd have to add a column with model name which should be fine unless there is crazy strict checking in singler's functions. But, if we do my point in 1^ , then this isn't needed.

Edit - I realize we may be saying different things when referring to RDS files!

  • In my point 1, I'm referring to the "trained" model rds files saved scpca-references/celltype/singler_models/
  • In my point 2, I'm referring to the rds output from classifying singler.

@sjspielman
Copy link
Member Author

sjspielman commented Oct 3, 2023

I realized that I am not 100% sure whether the reference name is in the output .rds object. It was there when we included the full SCE object, but maybe not now?

It is not in the output rds from the singler process at this point:

--output_singler_results_file "${singler_dir}/singler_results.rds" \

sjspielman and others added 4 commits October 3, 2023 09:13
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@jashapiro
Copy link
Member

I realized that I am not 100% sure whether the reference name is in the output .rds object. It was there when we included the full SCE object, but maybe not now?

It is not in the output rds from the singler process at this point:

--output_singler_results_file "${singler_dir}/singler_results.rds" \

What I meant by this is whether the reference name makes it into the DataFrame object. When we were outputting the full SCE, we were tagging the SingleR results in the metadata by the reference name, so we were fine there. But looking into it a bit more, I don't think the SingleR results DataFrame itself contains that information. But we can add it quite trivially, as DataFrame objects have metadata slots just like the SCE. So we could add in something like metadata(singler_result)$singler_reference <- singler_reference_name to store the reference name along with the results itself. We can't do that with a TSV output from cellassign unless we add a comment line or separate column, which is a whole mess of its own. We would have the scpca-meta.json to pull it from though anyway, so that may not be a concern.

If we can parse the reference name out of the file name to add to the nextflow meta object rather than using a separate column in the input table, I might prefer that, but w might have to revisit the logic of adding the cellassign reference name in the first place.

@sjspielman
Copy link
Member Author

sjspielman commented Oct 3, 2023

But we can add it quite trivially, as DataFrame objects have metadata slots just like the SCE. So we could add in something like metadata(singler_result)$singler_reference <- singler_reference_name to store the reference name along with the results itself.

Right, I would just want to make sure this doesn't cause singler's native functions to break, which it really shouldn't. Let me double check this*...

Edit - confirmed, SingleR doesn't care about extra metadata in the DataFrame object.

@jashapiro
Copy link
Member

jashapiro commented Oct 3, 2023

But we can add it quite trivially, as DataFrame objects have metadata slots just like the SCE. So we could add in something like metadata(singler_result)$singler_reference <- singler_reference_name to store the reference name along with the results itself.

Right, I would just want to make sure this doesn't cause singler's native functions to break, which it really shouldn't. Let me double check though...

As long as we are adding in metadata, I'm fine with passing in the reference names for both cellassign and singler directly as an argument to the assign script. The only thing we have to be cautious about (and we have to do this anyway) is that we need to make sure we are passing in the correct value. It is possible for the meta object to become disconnected from the file name/reference that was actually used if we are not careful, but I expect that the repeat_celltyping logic should handle that when we add it.

@jashapiro jashapiro closed this Oct 3, 2023
@jashapiro jashapiro reopened this Oct 3, 2023
@sjspielman
Copy link
Member Author

As long as we are adding in metadata, I'm fine with passing in the reference names for both cellassign and singler directly as an argument to the assign script. The only thing we have to be cautious about (and we have to do this anyway) is that we need to make sure we are passing in the correct value.

Feels like this could serve as an extra check as well - we'd want to make sure the singler reference name that gets passed in is the same as the name recorded in the singler DataFrame's metadata.

@sjspielman
Copy link
Member Author

sjspielman commented Oct 3, 2023

Changes since last time:

  • 9de11fc: I got rid of all tracking of the reference names in favor of "regexing them," if you will..., out of the reference/model file names directly.
  • 09565d1: I updated the SinglerR script to add the reference name as DataFrame metadata, with a couple checks to go with it

EDIT!

We would have the scpca-meta.json to pull it from though anyway, so that may not be a concern.

Based on my code changes, ^ requires pulling out of the reference name for cellassign, which seems like what we were aiming for anyways?

bin/classify_SingleR.R Outdated Show resolved Hide resolved
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, with a few small suggestions (some just whitespace).
Any concerns related to tracking reference file names can be settled when we are actually using that part.

bin/classify_SingleR.R Outdated Show resolved Hide resolved
bin/classify_SingleR.R Outdated Show resolved Hide resolved
bin/classify_SingleR.R Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
modules/classify-celltypes.nf Outdated Show resolved Hide resolved
Comment on lines 137 to -122
meta.cellassign_reference_file = cellassign_reference_file;
meta.cellassign_reference_name = cellassign_reference_name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that by removing this here, you plan to add some code to add_celltypes_to_sce to pull out the cellassign reference name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the idea!

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@sjspielman
Copy link
Member Author

Just for posterity...

  • stub failed for 3c2715c, but none of these changes really should have caused that?
  • Locally, stub ran fine for me at that commit!
  • I retriggered the CI stub check and now it's passing 🤷‍♀️

@sjspielman sjspielman merged commit 1e3ab54 into development Oct 3, 2023
3 checks passed
@sjspielman sjspielman deleted the sjspielman/add-process-for-cellassign branch October 3, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants