Transformer backend error on CUDA #1774

fakezeta · 2024-02-29T08:49:50Z

LocalAI version:

quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Environment, CPU architecture, OS, and Version:

Windows 11 Docker 25.03 with wsl2 backend
Kernel Version: 5.15.133.1-microsoft-standard-WSL2
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.62GiB
GPU NVidia 3060Ti 8GB VRAM

Describe the bug

Running intfloat/multilingual-e5-base with transformer backend with cuda: true fail with RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select) in logs
To Reproduce

Request embedding from AnythingLLM with the following embedding configuration

name: text-embedding-ada-002
backend: transformers
cuda: true
embeddings: true
low_vram: true
f16: true
device: cuda:0
parameters:
  model: intfloat/multilingual-e5-base

Expected behavior

Generate Embedding
Logs

8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr Server started. Listening on: 127.0.0.1:46411
8:26AM DBG GRPC Service Ready
8:26AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:intfloat/multilingual-e5-base ContextSize:0 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:true Embeddings:true NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:12 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/intfloat/multilingual-e5-base Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:true CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr Loading model intfloat/multilingual-e5-base to CUDA.
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr Traceback (most recent call last):
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     response_or_iterator = behavior(argument, context)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/build/backend/python/transformers/transformers_server.py", line 112, in Embedding
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     model_output = self.model(**encoded_input)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return forward_call(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 830, in forward
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     embedding_output = self.embeddings(
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                        ^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return self._call_impl(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return forward_call(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 126, in forward
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     inputs_embeds = self.word_embeddings(input_ids)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return self._call_impl(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return forward_call(*args, **kwargs)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return F.embedding(
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr   File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8:26AM DBG GRPC(intfloat/multilingual-e5-base-127.0.0.1:46411): stderr RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

Additional context

I've implemented a fix locally and opened this Issue to track it.

The text was updated successfully, but these errors were encountered:

#1775 and fix: Transformer backend error on CUDA #1774 (#1823) * fixes #1775 and #1774 Add BitsAndBytes Quantization and fixes embedding on CUDA devices * Manage 4bit and 8 bit quantization Manage different BitsAndBytes options with the quantization: parameter in yaml * fix compilation errors on non CUDA environment

…for Openvino and CUDA (#1892) * fixes #1775 and #1774 Add BitsAndBytes Quantization and fixes embedding on CUDA devices * Manage 4bit and 8 bit quantization Manage different BitsAndBytes options with the quantization: parameter in yaml * fix compilation errors on non CUDA environment * OpenVINO draft First draft of OpenVINO integration in transformer backend * first working implementation * Streaming working * Small fix for regression on CUDA and XPU * use pip version of optimum[openvino] * Update backend/python/transformers/transformers_server.py Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

fakezeta added bug Something isn't working unconfirmed labels Feb 29, 2024

fakezeta mentioned this issue Feb 29, 2024

feat: Add Bitsandbytes quantization for transformer backend #1775

Closed

fakezeta mentioned this issue Mar 12, 2024

feat: Add Bitsandbytes quantization for transformer backend enhancement #1775 and fix: Transformer backend error on CUDA #1774 #1823

Merged

1 task

fakezeta closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer backend error on CUDA #1774

Transformer backend error on CUDA #1774

fakezeta commented Feb 29, 2024

Transformer backend error on CUDA #1774

Transformer backend error on CUDA #1774

Comments

fakezeta commented Feb 29, 2024