Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1240

Open
6 of 8 tasks
manishiitg opened this issue Feb 1, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@manishiitg
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 80, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 624, in load_model
    raise err
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 585, in load_model
    model = getattr(transformers, model_type).from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3924, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 310, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[2024-02-01 07:03:05,572] [ERROR] [axolotl.load_model:623] [PID:77] [RANK:0] Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Traceback (most recent call last):
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 585, in load_model
    model = getattr(transformers, model_type).from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3924, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 310, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Steps to reproduce

!docker run --gpus all
-v /root/.cache:/root/.cache
-v /home/gcpuser/sky_workdir:/sky_workdir
winglian/axolotl:main-py3.10-cu118-2.0.1
accelerate launch -m axolotl.cli.train /sky_workdir/hi-qlora-hi-2.yaml --deepspeed /sky_workdir/zero3_bf16.json

Config yaml

base_model: teknium/OpenHermes-2.5-Mistral-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

chat_template: chatml
datasets:

  • path: manishiitg/chat-instruct-hi-v4
    type: completion

hub_model_id: manishiitg/open-aditi-chat-hi-1.5
hf_use_auth_token: true

wandb_project: open-aditi-chat-hi-1.5

dataset_prepared_path: manishiitg
push_dataset_to_hub: manishiitg
val_set_size: 0
output_dir: /sky-notebook/manishiitg/open-aditi-chat-hi-1.5

adapter: qlora
lora_model_dir:
save_safetensors: true

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

  • gate_proj
  • down_proj
  • up_proj
  • q_proj
  • v_proj
  • k_proj
  • o_proj

lora_modules_to_save:

  • embed_tokens
  • lm_head

wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 9
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 20 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""
tokens: # these are delimiters

  • "<|im_start|>"
  • "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@manishiitg manishiitg added the bug Something isn't working label Feb 1, 2024
@manishiitg
Copy link
Author

this got fixed after using zero2 and reducing batch size

@manishiitg
Copy link
Author

this issue is mainly with zero3_bf16.json works well with zero2. batch size is not an issue

@ehartford
Copy link
Collaborator

This is a real error that is preventing deepspeed zero3 from working as it should.
I am using a known good deepspeed zero3 config that works on other systems besides axolotl - but with axolotl it doesn't work.

@ehartford ehartford reopened this Mar 1, 2024
@Nagico
Copy link

Nagico commented Mar 11, 2024

This is a real error that is preventing deepspeed zero3 from working as it should. I am using a known good deepspeed zero3 config that works on other systems besides axolotl - but with axolotl it doesn't work.

It's a problem of deepspeed:

huggingface/transformers#29266 (comment)

So maybe we cannot use qlora with deepspeed currently because of bitsandbytes.


And if this project dosen't use zero_init=True, this issue can solve the compatibility problem of bitsandbytes and deepspeed zero 3

microsoft/DeepSpeed#4295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants