-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CellAssign process #479
Add CellAssign process #479
Conversation
…ference_name variable. Run cellassign and prepare channel for next steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I suggested a few changes:
- Using static filenames within the "bash script" of the process. Since these files are isolated and never "escape" the process, it isn't worth assigning them library-specific names.
- Similarly, for checkpoint files, it doesn't seem worth the effort to give them special names: the directory they are in will already encode specificity to the library
- Removing the
singler_reference_name
variable. We didn't have it before because it is already in the.rds
files.Speaking of the. Ignore this: I was confused about the actual output at this point... we aren't outputting the full SCE like before..rds
files: If we keep this structure where we are not passing the singler results directly along to cellassign, we might want to trim the outputsingler_results.rds
file inclassify_SingleR.R
output to save space. This would be a separate PR, and I would probably want to make trimming an option at the script level (which we would invoke in the process)
- Finally: no need to add extra values to the input channel for
assign_celltypes
: As long as they are inmeta
, we can access that from within the process.
modules/classify-celltypes.nf
Outdated
// singler reference name | ||
Utils.parseNA(it.singler_ref_name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we didn't have this before is because it is encoded in the rds file of the SingleR model. Since the cellassign input is just a tsv, we don't have that info.
When we pass along the SingleR results, that also should have the reference name in the output .rds
file. Since we have to read that in anyway to get the full results tables we want to add to the output, we will have that information and don't need to pass it separately through the workflow.
// singler reference name | |
Utils.parseNA(it.singler_ref_name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noting i'll circle back to this suggestion after final.final discussion in the PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we didn't have this before is because it is encoded in the rds file of the SingleR model. Since the cellassign input is just a tsv, we don't have that info.
I was a bit confused: the reference name was being parsed out of the file name:
scpca-nf/bin/classify_SingleR.R
Line 78 in 67e175b
model_names <- stringr::str_remove(basename(model_files), "_model.rds") |
Looking back at #402, it seems like we added that variable to the table mostly for convenience, but we could have parsed it out of the file name. It seems like we are using a shorthand at the moment, but the full file name (less .tsv
) should probably be included as the name (i.e. "PanglaoDB-blood" rather than "blood"). In nextflow this can be done with file("path/my_model.tsv").baseName
which strips both the directory and the extension.
And fun addition, if we want to do the same string parsing for the singler model files, which end with "_model.rds", we could do: file("path/my_model.tsv").baseName - '_model'
which is so odd, but it works. If you want to be safer, use a little regex instead: file("path/my_model.tsv").baseName - ~'_model$'
. Yes, you can make a regex by just adding a ~ before a quote. Groovy is so strange, but kind of cool sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noting i'll circle back to this suggestion after final.final discussion in the PR!
My current suggestion is to punt this to a separate PR: it isn't really part of getting the cellassign process done.
modules/classify-celltypes.nf
Outdated
.map{ project_id, meta, processed_sce, singler_model_file, singler_reference_name, cellassign_reference_file, cellassign_reference_name -> | ||
meta.celltype_publish_dir = "${params.checkpoints_dir}/celltype/${meta.library_id}"; | ||
meta.singler_dir = "${meta.celltype_publish_dir}/${meta.library_id}_singler"; | ||
meta.cellassign_dir = "${meta.celltype_publish_dir}/${meta.library_id}_cellassign"; | ||
meta.singler_model_file = singler_model_file; | ||
meta.singler_model_file = singler_model_file; | ||
meta.singler_reference_name = singler_reference_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting updating these to remove the singler_reference_name as well.
The presence of a reference file(s) for singleR and/or cellassign will be enough to say whether that processing has happened. There should not be any need for a separate boolean value. |
I realized that I am not 100% sure whether the reference name is in the output |
A couple thoughts on reference names -
Edit - I realize we may be saying different things when referring to RDS files!
|
It is not in the output rds from the singler process at this point: scpca-nf/modules/classify-celltypes.nf Line 25 in fb5f3bf
|
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
What I meant by this is whether the reference name makes it into the If we can parse the reference name out of the file name to add to the nextflow |
Right, I would just want to make sure this doesn't cause singler's native functions to break, which it really shouldn't. Let me double check this*... Edit - confirmed, SingleR doesn't care about extra metadata in the DataFrame object. |
As long as we are adding in metadata, I'm fine with passing in the reference names for both cellassign and singler directly as an argument to the assign script. The only thing we have to be cautious about (and we have to do this anyway) is that we need to make sure we are passing in the correct value. It is possible for the |
Feels like this could serve as an extra check as well - we'd want to make sure the singler reference name that gets passed in is the same as the name recorded in the singler DataFrame's metadata. |
…trings are also included in model filenames
Changes since last time:
EDIT!
Based on my code changes, ^ requires pulling out of the reference name for cellassign, which seems like what we were aiming for anyways? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, with a few small suggestions (some just whitespace).
Any concerns related to tracking reference file names can be settled when we are actually using that part.
meta.cellassign_reference_file = cellassign_reference_file; | ||
meta.cellassign_reference_name = cellassign_reference_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that by removing this here, you plan to add some code to add_celltypes_to_sce
to pull out the cellassign reference name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the idea!
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Just for posterity...
|
Towards #415
The next step is running cell assign, done here! I think given all the nice organizational changes in #476 & #477 this is pretty straightforward for review. The one additional thing to note is that I realized we probably need to be including
singler_reference_name
in the same way that we includecellassign_reference_name
. This wasn't previously needed since we had been doing singler celltyping and adding to SCE object in the same process, where the reference name was accessible from the trained singler model itself. Now that we will be adding singler & cellassign to the final step all at once in the last process, we'll need to grab both reference names for input.I also have one discussion-y comment to bring up at this point - my memory from our figjam session 🎸 was that we had decided not to keep using indicator variables in meta, e.g.
meta.has_singler=true
(edit lowercase true, this isn't R). I am not seeing any notes that explicitly say this though, so let's make sure I am remembering correctly! We can/should certainly still add relevant information to SCE metadata as we had been doing, but with our current celltyping workflow structure I don't really see the need for anything in the json metadata unless I am missing something?Noting the next PR will write the process & R script for adding celltypes into the processed SCE.