-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CellAssign process to main workflow #476
Conversation
…Os in script for where to change strategies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with some suggestions, but I ended up realizing we were missing something, which requires a bit more of a rewrite.
I'm recommending changing the output to be a folder, which means a few different things, discussed below. We might want to do that first as a separate PR for the singleR to test things out, even though I made all my suggestions here.
Oh, and checking for null
: !thing
is usally sufficient (and would be here)
It is true
if thing is null
, as null is treated as false
in logic, unlike in R (as is 0
or ""
or []
, but you usually want those to be treated as false
as well)
# Convert SCE to AnnData | ||
sce_to_anndata.R \ | ||
--input_sce_file ${processed_rds} \ | ||
--output_rna_h5 ${processed_hdf5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't outputting this, so the "Nextflow way" would be to just give it a file name for use in the process.
--output_rna_h5 ${processed_hdf5} | |
--output_rna_h5 anndata.hdf5 |
--input_sce_file ${processed_rds} \ | ||
--output_rna_h5 ${processed_hdf5} | ||
|
||
# Run CellAssign | ||
predict_cellassign.py \ | ||
--input_hdf5_file ${processed_hdf5} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following earlier suggestion
--input_hdf5_file ${processed_hdf5} \ | |
--input_hdf5_file anndata.hdf5 \ |
script: | ||
cellassign_predictions = "${meta.library_id}_predictions.tsv" | ||
processed_hdf5 = "${meta.library_id}_processed.hdf5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following other suggestions, we shouldn't need this
processed_hdf5 = "${meta.library_id}_processed.hdf5" |
publishDir ( | ||
path: "${params.checkpoints_dir}/celltype/${meta.library_id}", | ||
mode: 'copy', | ||
pattern: "*{_predictions.tsv,.json}" // Only the prediction matrix (tsv) and meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that we don't actually write the metadata here, but we should. Missed this in the singleR step to.
To do this there are a couple of options: the easiest is probably to actually adjust the output so we are writing out a folder rather than the individual files. This actually makes a few things easier downstream, and removes a bunch if definitions... I will follow with suggestions implementing a version of this.
The first is to modify the pattern (sorry glob!)
pattern: "*{_predictions.tsv,.json}" // Only the prediction matrix (tsv) and meta | |
pattern: "cellassign" |
// we only run celltyping for rows with a singler model file | ||
// branch here so we have meta and processed sce in the .skip | ||
.branch{ | ||
skip: it[2] == null | ||
run: true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want this to be done separately for singler and for cellassign? So I would move this down to be part of the singler_input_ch
definition, and then mix it back before going to cellassign step, repeating the logic.
Now I'm going back to my earlier thought about when to assign the files and use NO_FILE
, because we could pretty easily use that for this branch logic, and then just pass all files to the singler process.
Pseudocode below: This assumes that there are no nulls, just file()
results
singler_input_ch = celltype_input_ch
.branch{
skip: it[2].name == "NO_FILE"
run: true
}
classify_singleR(singler_input_ch.run)
cellassign_input_ch = classify_singleR.out
// add on blank file for skipped singleR results and mix back in
.mix(singler_input_ch.skip.map{it.asList() + [ file(empty_file) ] )
.branch{
skip: it[3].name == "NO_FILE"
run: true
}
output: | ||
tuple val(meta), path(cellassign_predictions), val(ref_name) | ||
tuple val(meta), path(processed_rds), path(singler_annotations_tsv), path(singler_full_results), path(cellassign_predictions_tsv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aspirationally changing the singler annotations output here to assume you have changed that process as well to be similar. Note that the cellassign path is now a constant, which we will create below
tuple val(meta), path(processed_rds), path(singler_annotations_tsv), path(singler_full_results), path(cellassign_predictions_tsv) | |
tuple val(meta), path(processed_rds), path(singler_dir), path("cellassign") |
The updated input:
would be:
tuple val(meta), path(processed_rds), path(singler_dir), path(cellassign_ref)
""" | ||
# Convert SCE to AnnData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create an output directory
# Convert SCE to AnnData | |
# create output directory | |
mkdir -p cellassign | |
# Convert SCE to AnnData |
predict_cellassign.py \ | ||
--input_hdf5_file ${processed_hdf5} \ | ||
--output_predictions ${cellassign_predictions} \ | ||
--reference ${cellassign_reference_mtx} \ | ||
--output_predictions ${cellassign_predictions_tsv} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant path for output
--output_predictions ${cellassign_predictions_tsv} \ | |
--output_predictions cellassign/predictions.tsv \ |
--threads ${task.cpus} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--threads ${task.cpus} | |
""" | |
--threads ${task.cpus} | |
# write out metadata for tracking | |
echo ${echo ${Utils.makeJson(meta)} > cellassign/scpca-meta.json | |
""" |
touch "${cellassign_predictions_tsv}" | ||
touch "${processed_hdf5}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to update the stub to create the output folder as well
Ignore most of this review and look at #477 first. |
Towards #415
This PR takes the next steps towards celltyping moving into the main workflow:
celltype_input_ch.run
andcelltype_input_ch.skip
, based on whether or not the singler file value is null (is there a better way to check if a value isnull
here? I couldn't find it!)NA
s, to ensure the flow of the skipped branches gets "tested".