Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with small numbers of cells and CellAssign #662

Merged
merged 12 commits into from
Jan 22, 2024

Conversation

allyhawkins
Copy link
Member

So I went to run the next set of projects and pretty quickly CellAssign failed for SCPCL000784. I did some digging and this processed object only has 6 cells, which seems to be part of the reason it's failing. I think with so few cells, the default values that define the training set don't work. After doing some googling of the errors that I was seeing (which were slightly different locally vs. on Nextflow), I saw a lot of mention of the default training size not working with fewer items. So I went through the CellAssign docs, and noticed that if I set the train_size to be smaller, then it worked. With 6 cells, train_size=0.6 does not work, but train_size=0.5 does work.

I went ahead and implemented a check that if there are < 10 cells (from my interpretation that's when we would have issues), then we adjust the train_size. Another idea is that we don't even run CellAssign in this case and skip it. Thoughts?

Also, when I was working on this, I thought it was an issue with the genes not overlapping the reference, but that turned out to be me loading in the wrong reference set. But I think we may want to add a check to make sure that we actually have genes left, because that also failed to run CellAssign. So I added an error if none of the genes in the reference are also in the provided AnnData object.

For some added additional context, this was the error I was getting in Nextflow:

‘SummarizedExperiment’ 
  Registered S3 methods overwritten by 'zellkonverter':
    method                                             from      
    py_to_r.numpy.ndarray                              reticulate
    py_to_r.pandas.core.arrays.categorical.Categorical reticulate
  /usr/local/lib/python3.10/dist-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
    self.seed = seed
  /usr/local/lib/python3.10/dist-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
    self.dl_pin_memory_gpu_training = (
  Global seed set to 2021
  No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
  GPU available: False, used: False
  TPU available: False, using: 0 TPU cores
  IPU available: False, using: 0 IPUs
  HPU available: False, using: 0 HPUs
  /usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
    rank_zero_warn(
  Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 391, in _check_dataloader_iterable
      iter(dataloader)  # type: ignore[call-overload]
  TypeError: 'NoneType' object is not iterable
  During handling of the above exception, another exception occurred:
  Traceback (most recent call last):
    File "/home/nextflow-bin/predict_cellassign.py", line 106, in <module>
      model.train()
    File "/usr/local/lib/python3.10/dist-packages/scvi/external/cellassign/_model.py", line 231, in train
      return runner()
    File "/usr/local/lib/python3.10/dist-packages/scvi/train/_trainrunner.py", line 99, in __call__
      self.trainer.fit(self.training_plan, self.data_splitter)
    File "/usr/local/lib/python3.10/dist-packages/scvi/train/_trainer.py", line 186, in fit
      super().fit(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
      call._call_and_handle_interrupt(
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
      return trainer_fn(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
      self._run(model, ckpt_path=ckpt_path)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
      results = self._run_stage()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
      self.fit_loop.run()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 198, in run
      self.on_run_start()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 309, in on_run_start
      self.epoch_loop.val_loop.setup_data()
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/evaluation_loop.py", line 168, in setup_data
      _check_dataloader_iterable(dl, source, trainer_fn)
    File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 407, in _check_dataloader_iterable
      raise TypeError(
  TypeError: An invalid dataloader was returned from `DataSplitter.val_dataloader()`. Found None.

And then this was the error I was getting when running through the cellassign script locally:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:376: LightningDeprecationWarning: The `Callback.on_batch_end` hook was deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_train_batch_end` instead.
  rank_zero_deprecation(
/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1933: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:127: UserWarning: Total length of `AnnDataLoader` across ranks is zero. Please make sure this was your intention.
  rank_zero_warn(
Epoch 1/400:   0%|                                                                                                         | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/scvi/external/cellassign/_model.py", line 223, in train
    return runner()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/scvi/train/_trainrunner.py", line 74, in __call__
    self.trainer.fit(self.training_plan, self.data_splitter)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/scvi/train/_trainer.py", line 188, in fit
    super().fit(*args, **kwargs)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 294, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 179, in on_train_epoch_end
    self._run_early_stopping_check(trainer)
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 190, in _run_early_stopping_check
    if trainer.fast_dev_run or not self._validate_condition_metric(  # disable early_stopping with fast_dev_run
  File "/Users/allyhawkins/miniconda3/envs/scvi/lib/python3.10/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 145, in _validate_condition_metric
    raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `elbo_validation` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `train_loss`, `train_loss_step`, `train_loss_epoch`, `elbo_train`, `kl_global_train`, `kl_local_train`, `reconstruction_loss_train`

@jashapiro
Copy link
Member

I went ahead and implemented a check that if there are < 10 cells (from my interpretation that's when we would have issues), then we adjust the train_size. Another idea is that we don't even run CellAssign in this case and skip it. Thoughts?

I think maybe skip it for small numbers of cells.

My interpretation is that we are running into trouble when the validation set is getting too small... I might expect maybe that the 10 would fail with 0.9 if 6 was failing with 0.6: Maybe it needs at least 3 in the validation set?

I would check with just 10 cells to see which of us is correct, and see if 0.9 works there. Or what the minimum # that does work is. (I'm going to guess 30.)

@allyhawkins
Copy link
Member Author

I think maybe skip it for small numbers of cells.

My interpretation is that we are running into trouble when the validation set is getting too small... I might expect maybe that the 10 would fail with 0.9 if 6 was failing with 0.6: Maybe it needs at least 3 in the validation set?

I would check with just 10 cells to see which of us is correct, and see if 0.9 works there. Or what the minimum # that does work is. (I'm going to guess 30.)

Okay, you're right. So less than 30 it doesn't work with the default and then anything 30 or above works. Based on your reply, I'll update this to just skip running CellAssign for anything < 30 cells.

@allyhawkins
Copy link
Member Author

@jashapiro I changed how I was dealing with a failed CellAssign run when adding in annotations to the SCE object. If no assignments, then no additional metadata is added and the only addition is a note in the column for CellAssign cell type annotations that CellAssign failed to assign.

Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, with my main comment being about the order of options. I am always a fan of failing first, mostly because it is usually the shorter block, which makes following indentation/nesting easier.

The other significant comment is about the column value for skipped cells.

I made a comment about changing barcode to barcodes for consistency, but I think it is too late for that, given that we do not want to be rerunning CellAssign, so please ignore that comment.

# if the only column is the barcode column then CellAssign didn't complete successfully
# create data frame with celltype and prediction as NA
# celltype will later get converted to Unclassified cell
if (colnames(predictions) == "barcode") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I saw it in another review: are we doing singular barcode here because of AnnData conventions or plural to match SCE?

Comment on lines 243 to 246
} else {
# if failed then note that in the cell type column
sce$cellassign_celltype_annotation <- "CellAssign unable to assign cell types."
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch the if/else order here? I like the short block first for easier tracing

metadata(sce)$cellassign_reference_organs <- cellassign_organs
} else {
# if failed then note that in the cell type column
sce$cellassign_celltype_annotation <- "CellAssign unable to assign cell types."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather long string to put in every row! Can we come up with something shorter? CellAssign not run or CellAssign skipped or something like that?

Or even just Not run or Skipped with no CellAssign since that is part of the column name?

model.train()
predictions = model.predict()
predictions["barcode"] = subset_adata.obs_names
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would switch the order of these for consistency/ease. if subset_adata.n_obs < 30:

bin/predict_cellassign.py Outdated Show resolved Hide resolved
allyhawkins and others added 2 commits January 22, 2024 09:58
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

@jashapiro I switched the order of the if/else statements in both places. I knew you had a preference, but thought it was the other way around, so whoops!
And then I also switched to using Not run as the value in the cell type column.

Just wanted to also circle back to my previous question:

Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata.

@jashapiro
Copy link
Member

Do we want to include a note in the report that CellAssign failed due to a low number of cells? Right now, it just won't show any plots that require CellAssign as it won't be present as a cell type method in the metadata.

I saw this question, but I'm not really sure! I feel like it is probably worth a note, but I would not want to dwell on it. Skipping the plots definitely seems correct, though.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, with the thought that if there are updates to the report, those can be a separate PR?

@allyhawkins allyhawkins merged commit 6cf64ed into main Jan 22, 2024
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/no-overlapping-genes-with-refs branch January 22, 2024 17:10
This was referenced Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants