Deal with small numbers of cells and CellAssign #662

allyhawkins · 2024-01-19T21:33:13Z

So I went to run the next set of projects and pretty quickly CellAssign failed for SCPCL000784. I did some digging and this processed object only has 6 cells, which seems to be part of the reason it's failing. I think with so few cells, the default values that define the training set don't work. After doing some googling of the errors that I was seeing (which were slightly different locally vs. on Nextflow), I saw a lot of mention of the default training size not working with fewer items. So I went through the CellAssign docs, and noticed that if I set the train_size to be smaller, then it worked. With 6 cells, train_size=0.6 does not work, but train_size=0.5 does work.

I went ahead and implemented a check that if there are < 10 cells (from my interpretation that's when we would have issues), then we adjust the train_size. Another idea is that we don't even run CellAssign in this case and skip it. Thoughts?

Also, when I was working on this, I thought it was an issue with the genes not overlapping the reference, but that turned out to be me loading in the wrong reference set. But I think we may want to add a check to make sure that we actually have genes left, because that also failed to run CellAssign. So I added an error if none of the genes in the reference are also in the provided AnnData object.

For some added additional context, this was the error I was getting in Nextflow:

‘SummarizedExperiment’ 
  Registered S3 methods overwritten by 'zellkonverter':
    method                                             from      
    py_to_r.numpy.ndarray                              reticulate
    py_to_r.pandas.core.arrays.categorical.Categorical reticulate
  /usr/local/lib/python3.10/dist-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
    self.seed = seed
  /usr/local/lib/python3.10/dist-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
    self.dl_pin_memory_gpu_training = (
  Global seed set to 2021
  No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
  GPU available: False, used: False
  TPU available: False, using: 0 TPU cores
  IPU available: False, using: 0 IPUs
  HPU available: False, using: 0 HPUs
  /usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
    rank_zero_warn(
  Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 391, in _check_dataloader_iterable
      iter(dataloader)  # type: ignore[call-overload]
  TypeError: 'NoneType' object is not iterable
  During handling of the above exception, another exception occurred:
  Traceback (most recent call last):
    File "/home/nextflow-bin/predict_cellassign.py", line 106, in <module>
      model.train()
    File "/usr/local/lib/python3.10/dist-packages/scvi/external/cellassign/_model.py", line 231, in train
      return runner()
    File "/usr/local/lib/python3.10/dist-packages/scvi/train/_trainrunner.py", line 99, in __call__
      self.trainer.fit(self.training_plan, self.data_splitter)
    File "/usr/local/lib/python3.10/dist-packages/scvi/train/_trainer.py", line 186, in fit
      super().fit(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
      call._call_and_handle_interrupt(
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
      return trainer_fn(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
      self._run(model, ckpt_path=ckpt_path)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
      results = self._run_stage()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
      self.fit_loop.run()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 198, in run
      self.on_run_start()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 309, in on_run_start
      self.epoch_loop.val_loop.setup_data()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/evaluation_loop.py", line 168, in setup_data
      _check_dataloader_iterable(dl, source, trainer_fn)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 407, in _check_dataloader_iterable
      raise TypeError(
  TypeError: An invalid dataloader was returned from `DataSplitter.val_dataloader()`. Found None.

And then this was the error I was getting when running through the cellassign script locally:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:376: LightningDeprecationWarning: The `Callback.on_batch_end` hook was deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_train_batch_end` instead.
  rank_zero_deprecation(
/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1933: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:127: UserWarning: Total length of `AnnDataLoader` across ranks is zero. Please make sure this was your intention.
  rank_zero_warn(
Epoch 1/400:   0%|                                                                                                         | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/scvi/external/cellassign/_model.py", line 223, in train
    return runner()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/scvi/train/_trainrunner.py", line 74, in __call__
    self.trainer.fit(self.training_plan, self.data_splitter)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/scvi/train/_trainer.py", line 188, in fit
    super().fit(*args, **kwargs)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 294, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 179, in on_train_epoch_end
    self._run_early_stopping_check(trainer)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 190, in _run_early_stopping_check
    if trainer.fast_dev_run or not self._validate_condition_metric(  # disable early_stopping with fast_dev_run
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 145, in _validate_condition_metric
    raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `elbo_validation` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `train_loss`, `train_loss_step`, `train_loss_epoch`, `elbo_train`, `kl_global_train`, `kl_local_train`, `reconstruction_loss_train`

jashapiro · 2024-01-19T22:01:21Z

I went ahead and implemented a check that if there are < 10 cells (from my interpretation that's when we would have issues), then we adjust the train_size. Another idea is that we don't even run CellAssign in this case and skip it. Thoughts?

I think maybe skip it for small numbers of cells.

My interpretation is that we are running into trouble when the validation set is getting too small... I might expect maybe that the 10 would fail with 0.9 if 6 was failing with 0.6: Maybe it needs at least 3 in the validation set?

I would check with just 10 cells to see which of us is correct, and see if 0.9 works there. Or what the minimum # that does work is. (I'm going to guess 30.)

allyhawkins · 2024-01-19T22:27:48Z

I think maybe skip it for small numbers of cells.

My interpretation is that we are running into trouble when the validation set is getting too small... I might expect maybe that the 10 would fail with 0.9 if 6 was failing with 0.6: Maybe it needs at least 3 in the validation set?

I would check with just 10 cells to see which of us is correct, and see if 0.9 works there. Or what the minimum # that does work is. (I'm going to guess 30.)

Okay, you're right. So less than 30 it doesn't work with the default and then anything 30 or above works. Based on your reply, I'll update this to just skip running CellAssign for anything < 30 cells.

allyhawkins · 2024-01-22T15:22:08Z

@jashapiro I changed how I was dealing with a failed CellAssign run when adding in annotations to the SCE object. If no assignments, then no additional metadata is added and the only addition is a note in the column for CellAssign cell type annotations that CellAssign failed to assign.

Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata.

jashapiro

This looks fine, with my main comment being about the order of options. I am always a fan of failing first, mostly because it is usually the shorter block, which makes following indentation/nesting easier.

The other significant comment is about the column value for skipped cells.

I made a comment about changing barcode to barcodes for consistency, but I think it is too late for that, given that we do not want to be rerunning CellAssign, so please ignore that comment.

jashapiro · 2024-01-22T15:18:53Z

bin/add_celltypes_to_sce.R

+  # if the only column is the barcode column then CellAssign didn't complete successfully
+  # create data frame with celltype and prediction as NA
+  # celltype will later get converted to Unclassified cell
+  if (colnames(predictions) == "barcode") {


Just because I saw it in another review: are we doing singular barcode here because of AnnData conventions or plural to match SCE?

jashapiro · 2024-01-22T15:22:23Z

bin/add_celltypes_to_sce.R

+  } else {
+    # if failed then note that in the cell type column
+    sce$cellassign_celltype_annotation <- "CellAssign unable to assign cell types."
  }


Can we switch the if/else order here? I like the short block first for easier tracing

jashapiro · 2024-01-22T15:23:45Z

bin/add_celltypes_to_sce.R

+    metadata(sce)$cellassign_reference_organs <- cellassign_organs
+  } else {
+    # if failed then note that in the cell type column
+    sce$cellassign_celltype_annotation <- "CellAssign unable to assign cell types."


This is a rather long string to put in every row! Can we come up with something shorter? CellAssign not run or CellAssign skipped or something like that?

Or even just Not run or Skipped with no CellAssign since that is part of the column name?

jashapiro · 2024-01-22T15:25:28Z

bin/predict_cellassign.py

+    model.train()
+    predictions = model.predict()
+    predictions["barcode"] = subset_adata.obs_names
+else:


Again, I would switch the order of these for consistency/ease. if subset_adata.n_obs < 30:

bin/predict_cellassign.py

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2024-01-22T16:02:52Z

@jashapiro I switched the order of the if/else statements in both places. I knew you had a preference, but thought it was the other way around, so whoops!
And then I also switched to using Not run as the value in the cell type column.

Just wanted to also circle back to my previous question:

Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata.

jashapiro · 2024-01-22T16:46:40Z

Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata.

I saw this question, but I'm not really sure! I feel like it is probably worth a note, but I would not want to dwell on it. Skipping the plots definitely seems correct, though.

jashapiro

Approving, with the thought that if there are updates to the report, those can be a separate PR?

allyhawkins and others added 8 commits January 19, 2024 11:44

account for no genes found in both SCE and cell assign ref

0501278

account for no cellassign results

048937f

remove barcode to obs names

e674178

try not equal to

6ddddad

maybe adjust train_size?

c8bd1a9

proper python syntax for setting train size

dc41808

add some parenthesis

12d1487

fail if no genes are shared

4cc72a2

allyhawkins requested a review from jashapiro January 19, 2024 21:33

skip cellassign if < 30 cells

c8364a1

sjspielman mentioned this pull request Jan 22, 2024

Note that cellassign is skipped with <30 cells AlexsLemonade/scpca-docs#246

Closed

include cell assign failure in cell type annotation column

4f499b6

jashapiro reviewed Jan 22, 2024

View reviewed changes

allyhawkins and others added 2 commits January 22, 2024 09:58

Apply suggestions from code review

7e7b339

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

switch if/else and use not run

f862cc2

allyhawkins requested a review from jashapiro January 22, 2024 16:02

jashapiro approved these changes Jan 22, 2024

View reviewed changes

allyhawkins mentioned this pull request Jan 22, 2024

Maybe add a note to QC report about when skipping CellAssign #663

Closed

allyhawkins merged commit 6cf64ed into main Jan 22, 2024
3 checks passed

allyhawkins deleted the allyhawkins/no-overlapping-genes-with-refs branch January 22, 2024 17:10

allyhawkins mentioned this pull request Jan 22, 2024

Note that CellAssign could be labeled with Not run AlexsLemonade/scpca-docs#249

Merged

sjspielman mentioned this pull request Jan 22, 2024

Indicate cell line status in QC report #645

Closed

This was referenced Jan 25, 2024

Prepare for scpca-nf release v0.7.2 #669

Closed

Prep for v0.7.2 #680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with small numbers of cells and CellAssign #662

Deal with small numbers of cells and CellAssign #662

allyhawkins commented Jan 19, 2024

jashapiro commented Jan 19, 2024

allyhawkins commented Jan 19, 2024

allyhawkins commented Jan 22, 2024

jashapiro left a comment

jashapiro Jan 22, 2024

jashapiro Jan 22, 2024

jashapiro Jan 22, 2024

jashapiro Jan 22, 2024

allyhawkins commented Jan 22, 2024

jashapiro commented Jan 22, 2024

jashapiro left a comment

Deal with small numbers of cells and CellAssign #662

Deal with small numbers of cells and CellAssign #662

Conversation

allyhawkins commented Jan 19, 2024

jashapiro commented Jan 19, 2024

allyhawkins commented Jan 19, 2024

allyhawkins commented Jan 22, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 22, 2024

Choose a reason for hiding this comment

jashapiro Jan 22, 2024

Choose a reason for hiding this comment

jashapiro Jan 22, 2024

Choose a reason for hiding this comment

jashapiro Jan 22, 2024

Choose a reason for hiding this comment

allyhawkins commented Jan 22, 2024

jashapiro commented Jan 22, 2024

jashapiro left a comment

Choose a reason for hiding this comment