Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CellAssign process to main workflow #476

Closed
wants to merge 7 commits into from

Conversation

sjspielman
Copy link
Member

@sjspielman sjspielman commented Sep 29, 2023

Towards #415

This PR takes the next steps towards celltyping moving into the main workflow:

  • I added branching to create two channels in the subworkflow, celltype_input_ch.run and celltype_input_ch.skip, based on whether or not the singler file value is null (is there a better way to check if a value is null here? I couldn't find it!)
    • However, I am not currently checking if the file exists. This probably needs to happen somewhere before going into the process, but wanted to get feedback on this branching first before upgrading it.
  • I then prepared a channel for heading into CellAssign, and updated the CellAssign process itself. I included the same type of branch here as well - skip the cellassign if its file value is null. I again have no checks for the file existing and I am also not checking the reference name. Do we think I should?
  • I have written the code with the SingleR output files coming along for the right into the CellAssign process, but I left a couple TODO comments about where we would change code to have that not happen. In that case, we'd join these files back up with cell assign output for input to the final process that will add annotations to the SCE, in the next PR.
  • A couple more TODOs for subsequent PRs are sprinkled in there, too. Let me know if anything about them is unclear.
  • (Edit) - I also updated the stub celltype file to have some NAs, to ensure the flow of the skipped branches gets "tested".

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with some suggestions, but I ended up realizing we were missing something, which requires a bit more of a rewrite.

I'm recommending changing the output to be a folder, which means a few different things, discussed below. We might want to do that first as a separate PR for the singleR to test things out, even though I made all my suggestions here.

Oh, and checking for null: !thing is usally sufficient (and would be here)
It is true if thing is null, as null is treated as false in logic, unlike in R (as is 0 or "" or [], but you usually want those to be treated as false as well)

# Convert SCE to AnnData
sce_to_anndata.R \
--input_sce_file ${processed_rds} \
--output_rna_h5 ${processed_hdf5}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't outputting this, so the "Nextflow way" would be to just give it a file name for use in the process.

Suggested change
--output_rna_h5 ${processed_hdf5}
--output_rna_h5 anndata.hdf5

--input_sce_file ${processed_rds} \
--output_rna_h5 ${processed_hdf5}

# Run CellAssign
predict_cellassign.py \
--input_hdf5_file ${processed_hdf5} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following earlier suggestion

Suggested change
--input_hdf5_file ${processed_hdf5} \
--input_hdf5_file anndata.hdf5 \

script:
cellassign_predictions = "${meta.library_id}_predictions.tsv"
processed_hdf5 = "${meta.library_id}_processed.hdf5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following other suggestions, we shouldn't need this

Suggested change
processed_hdf5 = "${meta.library_id}_processed.hdf5"

publishDir (
path: "${params.checkpoints_dir}/celltype/${meta.library_id}",
mode: 'copy',
pattern: "*{_predictions.tsv,.json}" // Only the prediction matrix (tsv) and meta
Copy link
Member

@jashapiro jashapiro Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that we don't actually write the metadata here, but we should. Missed this in the singleR step to.

To do this there are a couple of options: the easiest is probably to actually adjust the output so we are writing out a folder rather than the individual files. This actually makes a few things easier downstream, and removes a bunch if definitions... I will follow with suggestions implementing a version of this.

The first is to modify the pattern (sorry glob!)

Suggested change
pattern: "*{_predictions.tsv,.json}" // Only the prediction matrix (tsv) and meta
pattern: "cellassign"

Comment on lines +124 to +129
// we only run celltyping for rows with a singler model file
// branch here so we have meta and processed sce in the .skip
.branch{
skip: it[2] == null
run: true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want this to be done separately for singler and for cellassign? So I would move this down to be part of the singler_input_ch definition, and then mix it back before going to cellassign step, repeating the logic.

Now I'm going back to my earlier thought about when to assign the files and use NO_FILE, because we could pretty easily use that for this branch logic, and then just pass all files to the singler process.

Pseudocode below: This assumes that there are no nulls, just file() results

singler_input_ch = celltype_input_ch
  .branch{
    skip: it[2].name == "NO_FILE"
    run: true
  }

  classify_singleR(singler_input_ch.run)
  
cellassign_input_ch = classify_singleR.out
   // add on blank file for skipped singleR results and mix back in
  .mix(singler_input_ch.skip.map{it.asList() + [ file(empty_file) ] )
  .branch{
    skip: it[3].name == "NO_FILE"
    run: true
  }

output:
tuple val(meta), path(cellassign_predictions), val(ref_name)
tuple val(meta), path(processed_rds), path(singler_annotations_tsv), path(singler_full_results), path(cellassign_predictions_tsv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aspirationally changing the singler annotations output here to assume you have changed that process as well to be similar. Note that the cellassign path is now a constant, which we will create below

Suggested change
tuple val(meta), path(processed_rds), path(singler_annotations_tsv), path(singler_full_results), path(cellassign_predictions_tsv)
tuple val(meta), path(processed_rds), path(singler_dir), path("cellassign")

The updated input: would be:

tuple val(meta), path(processed_rds), path(singler_dir), path(cellassign_ref)

"""
# Convert SCE to AnnData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an output directory

Suggested change
# Convert SCE to AnnData
# create output directory
mkdir -p cellassign
# Convert SCE to AnnData

predict_cellassign.py \
--input_hdf5_file ${processed_hdf5} \
--output_predictions ${cellassign_predictions} \
--reference ${cellassign_reference_mtx} \
--output_predictions ${cellassign_predictions_tsv} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant path for output

Suggested change
--output_predictions ${cellassign_predictions_tsv} \
--output_predictions cellassign/predictions.tsv \

Comment on lines 66 to 67
--threads ${task.cpus}
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--threads ${task.cpus}
"""
--threads ${task.cpus}
# write out metadata for tracking
echo ${echo ${Utils.makeJson(meta)} > cellassign/scpca-meta.json
"""

Comment on lines +72 to +73
touch "${cellassign_predictions_tsv}"
touch "${processed_hdf5}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to update the stub to create the output folder as well

@jashapiro
Copy link
Member

jashapiro commented Oct 2, 2023

Ignore most of this review and look at #477 first.

@sjspielman
Copy link
Member Author

Given changes in #477 & #478, I'm going to close out this PR and open a fresh one so that the PR is shorter without all the above "deprecated" comments. Bonus, no need to ever find out what conflict horrors await when merging those changes into this branch 😬

@sjspielman sjspielman closed this Oct 2, 2023
@sjspielman sjspielman deleted the sjspielman/add-cellassign-process branch October 2, 2023 15:48
@sjspielman sjspielman restored the sjspielman/add-cellassign-process branch October 2, 2023 17:09
@sjspielman sjspielman deleted the sjspielman/add-cellassign-process branch October 2, 2023 19:14
@sjspielman sjspielman mentioned this pull request Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants