-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with small numbers of cells and CellAssign #662
Deal with small numbers of cells and CellAssign #662
Conversation
I think maybe skip it for small numbers of cells. My interpretation is that we are running into trouble when the validation set is getting too small... I might expect maybe that the 10 would fail with I would check with just 10 cells to see which of us is correct, and see if |
Okay, you're right. So less than 30 it doesn't work with the default and then anything 30 or above works. Based on your reply, I'll update this to just skip running CellAssign for anything < 30 cells. |
@jashapiro I changed how I was dealing with a failed CellAssign run when adding in annotations to the SCE object. If no assignments, then no additional metadata is added and the only addition is a note in the column for CellAssign cell type annotations that CellAssign failed to assign. Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, with my main comment being about the order of options. I am always a fan of failing first, mostly because it is usually the shorter block, which makes following indentation/nesting easier.
The other significant comment is about the column value for skipped cells.
I made a comment about changing barcode
to barcodes
for consistency, but I think it is too late for that, given that we do not want to be rerunning CellAssign, so please ignore that comment.
# if the only column is the barcode column then CellAssign didn't complete successfully | ||
# create data frame with celltype and prediction as NA | ||
# celltype will later get converted to Unclassified cell | ||
if (colnames(predictions) == "barcode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because I saw it in another review: are we doing singular barcode
here because of AnnData conventions or plural to match SCE?
bin/add_celltypes_to_sce.R
Outdated
} else { | ||
# if failed then note that in the cell type column | ||
sce$cellassign_celltype_annotation <- "CellAssign unable to assign cell types." | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we switch the if/else order here? I like the short block first for easier tracing
bin/add_celltypes_to_sce.R
Outdated
metadata(sce)$cellassign_reference_organs <- cellassign_organs | ||
} else { | ||
# if failed then note that in the cell type column | ||
sce$cellassign_celltype_annotation <- "CellAssign unable to assign cell types." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather long string to put in every row! Can we come up with something shorter? CellAssign not run
or CellAssign skipped
or something like that?
Or even just Not run
or Skipped
with no CellAssign
since that is part of the column name?
bin/predict_cellassign.py
Outdated
model.train() | ||
predictions = model.predict() | ||
predictions["barcode"] = subset_adata.obs_names | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I would switch the order of these for consistency/ease. if subset_adata.n_obs < 30:
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@jashapiro I switched the order of the Just wanted to also circle back to my previous question:
|
I saw this question, but I'm not really sure! I feel like it is probably worth a note, but I would not want to dwell on it. Skipping the plots definitely seems correct, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, with the thought that if there are updates to the report, those can be a separate PR?
So I went to run the next set of projects and pretty quickly
CellAssign
failed forSCPCL000784
. I did some digging and this processed object only has 6 cells, which seems to be part of the reason it's failing. I think with so few cells, the default values that define the training set don't work. After doing some googling of the errors that I was seeing (which were slightly different locally vs. on Nextflow), I saw a lot of mention of the default training size not working with fewer items. So I went through the CellAssign docs, and noticed that if I set thetrain_size
to be smaller, then it worked. With 6 cells,train_size=0.6
does not work, buttrain_size=0.5
does work.I went ahead and implemented a check that if there are < 10 cells (from my interpretation that's when we would have issues), then we adjust the
train_size
. Another idea is that we don't even runCellAssign
in this case and skip it. Thoughts?Also, when I was working on this, I thought it was an issue with the genes not overlapping the reference, but that turned out to be me loading in the wrong reference set. But I think we may want to add a check to make sure that we actually have genes left, because that also failed to run CellAssign. So I added an error if none of the genes in the reference are also in the provided AnnData object.
For some added additional context, this was the error I was getting in Nextflow:
And then this was the error I was getting when running through the cellassign script locally: