multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1240

manishiitg · 2024-02-01T07:04:26Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 80, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 624, in load_model
    raise err
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 585, in load_model
    model = getattr(transformers, model_type).from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3924, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 310, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[2024-02-01 07:03:05,572] [ERROR] [axolotl.load_model:623] [PID:77] [RANK:0] Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Traceback (most recent call last):
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 585, in load_model
    model = getattr(transformers, model_type).from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3924, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 310, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Steps to reproduce

!docker run --gpus all
-v /root/.cache:/root/.cache
-v /home/gcpuser/sky_workdir:/sky_workdir
winglian/axolotl:main-py3.10-cu118-2.0.1
accelerate launch -m axolotl.cli.train /sky_workdir/hi-qlora-hi-2.yaml --deepspeed /sky_workdir/zero3_bf16.json

Config yaml

base_model: teknium/OpenHermes-2.5-Mistral-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

chat_template: chatml
datasets:

path: manishiitg/chat-instruct-hi-v4
type: completion

hub_model_id: manishiitg/open-aditi-chat-hi-1.5
hf_use_auth_token: true

wandb_project: open-aditi-chat-hi-1.5

dataset_prepared_path: manishiitg
push_dataset_to_hub: manishiitg
val_set_size: 0
output_dir: /sky-notebook/manishiitg/open-aditi-chat-hi-1.5

adapter: qlora
lora_model_dir:
save_safetensors: true

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

gate_proj
down_proj
up_proj
q_proj
v_proj
k_proj
o_proj

lora_modules_to_save:

embed_tokens
lm_head

wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 9
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 20 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: ""
eos_token: ""
unk_token: ""
tokens: # these are delimiters

"<|im_start|>"
"<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

manishiitg · 2024-02-02T06:27:35Z

this got fixed after using zero2 and reducing batch size

manishiitg · 2024-02-04T04:28:13Z

this issue is mainly with zero3_bf16.json works well with zero2. batch size is not an issue

ehartford · 2024-03-01T05:16:37Z

This is a real error that is preventing deepspeed zero3 from working as it should.
I am using a known good deepspeed zero3 config that works on other systems besides axolotl - but with axolotl it doesn't work.

Nagico · 2024-03-11T03:00:49Z

This is a real error that is preventing deepspeed zero3 from working as it should. I am using a known good deepspeed zero3 config that works on other systems besides axolotl - but with axolotl it doesn't work.

It's a problem of deepspeed:

huggingface/transformers#29266 (comment)

So maybe we cannot use qlora with deepspeed currently because of bitsandbytes.

And if this project dosen't use zero_init=True, this issue can solve the compatibility problem of bitsandbytes and deepspeed zero 3

microsoft/DeepSpeed#4295

manishiitg added the bug Something isn't working label Feb 1, 2024

manishiitg mentioned this issue Feb 1, 2024

Peft deepspeed resume #1227

Merged

manishiitg closed this as completed Feb 2, 2024

ehartford reopened this Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1240

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1240

manishiitg commented Feb 1, 2024

manishiitg commented Feb 2, 2024

manishiitg commented Feb 4, 2024

ehartford commented Mar 1, 2024

Nagico commented Mar 11, 2024 •

edited

Loading

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1240

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1240

Comments

manishiitg commented Feb 1, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

manishiitg commented Feb 2, 2024

manishiitg commented Feb 4, 2024

ehartford commented Mar 1, 2024

Nagico commented Mar 11, 2024 • edited Loading

Nagico commented Mar 11, 2024 •

edited

Loading