Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525

Closed
jagadish-amd opened this issue Sep 11, 2024 · 15 comments
Assignees
Labels
bug Something isn't working training

Comments

@jagadish-amd
Copy link
Contributor

We are using DeepSpeed; transformer, accelerate to fine tune Qwen llm, and hit the below issue.
[rank2]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank2]: stage3_prefetch_bucket_size
[rank2]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank2]: For further information visit https://errors.pydantic.dev/2.9/v/int_from_float

Relevant stack:
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank2]: return inner_training_loop(
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
[rank2]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank2]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank2]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank2]: self._initialize_params(copy.copy(self._param_dict))
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank2]: self.zero_config = get_zero_config(param_dict)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank2]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank2]: super().init(**data)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pydantic/main.py", line 211, in init
[rank2]: validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
[rank2]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig

This was working prior to Pydantic migration PR #5167

In our case, the stage3_prefetch_bucket_size parameter in DeepSpeedZeroConfig is calculated as 0.9 * hidden_size * hidden_size as per
https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py#L244 .
Hidden_size is 4096 and stage3_prefetch_bucket_size turns out to be 15099494.4

One of the solution is to convert the stage3_prefetch_bucket_size value to int in transformer library (deepspeed integration file)

diff --git a/src/transformers/integrations/deepspeed.py b/src/transformers/integrations/deepspeed.py
index aae1204ac..622080d41 100644
--- a/src/transformers/integrations/deepspeed.py
+++ b/src/transformers/integrations/deepspeed.py
@@ -241,7 +241,7 @@ class HfTrainerDeepSpeedConfig(HfDeepSpeedConfig):
                 # automatically assign the optimal config values based on model config
                 self.fill_only(
                     "zero_optimization.stage3_prefetch_bucket_size",
-                    0.9 * hidden_size * hidden_size,
+                    int(0.9 * hidden_size * hidden_size),
                 )
                 self.fill_only(
                     "zero_optimization.stage3_param_persistence_threshold",

I am not sure if this is the right solution, requesting DeepSpeed team's help here.

@jagadish-amd jagadish-amd added bug Something isn't working training labels Sep 11, 2024
@jagadish-amd
Copy link
Contributor Author

ping @jithunnair-amd @loadams

@loadams loadams self-assigned this Sep 11, 2024
@adk9
Copy link
Contributor

adk9 commented Sep 11, 2024

I am not sure if this is the right solution, requesting DeepSpeed team's help here.

This is actually the right solution. It looks like a fix for this is now in place upstream (huggingface/transformers#33402) and will make it to the next transformers release.

@jagadish-amd
Copy link
Contributor Author

I am not sure if this is the right solution, requesting DeepSpeed team's help here.

This is actually the right solution. It looks like a fix for this is now in place upstream (huggingface/transformers#33402) and will make it to the next transformers release.

ha ha, yeah! Thanks for the info.

@loadams
Copy link
Contributor

loadams commented Sep 16, 2024

Thanks @adk9 - @jagadish-amd - should we close this bug for now?

@jagadish-amd
Copy link
Contributor Author

Thanks @adk9 - @jagadish-amd - should we close this bug for now?

Sure @loadams we can close the issue.

@loadams
Copy link
Contributor

loadams commented Sep 16, 2024

Thanks! And thanks for pointing this out

@loadams loadams closed this as completed Sep 16, 2024
@chadj2
Copy link

chadj2 commented Oct 21, 2024

I was able to solve this by downgrading deepspeed:

deepspeed = "0.14.5"

@loadams
Copy link
Contributor

loadams commented Oct 23, 2024

@chadj2 - this should be fixed if you're using the latest transformers and deepspeed.

@yourssmile
Copy link

[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank4]: For further information visit https://errors.pydantic.dev/2.9/v/int_from_float
[rank5]: Traceback (most recent call last):
[rank5]: File "/data/home/xucong/VTimeLLM-main/vtimellm/train/train_mem.py", line 20, in
[rank5]: train()
[rank5]: File "/data/home/xucong/VTimeLLM-main/vtimellm/train/train.py", line 371, in train
[rank5]: trainer.train()
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank5]: return inner_training_loop(
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
[rank5]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank5]: result = self._prepare_deepspeed(*args)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank5]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank5]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank5]: self._initialize_params(copy.copy(self._param_dict))
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank5]: self.zero_config = get_zero_config(param_dict)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank5]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank5]: super().init(**data)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/pydantic/main.py", line 209, in init
[rank5]: validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
[rank5]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig

I am still experiencing this problem in version 0.15.1

@loadams
Copy link
Contributor

loadams commented Oct 31, 2024

@yourssmile - what transformers version are you using?

@yourssmile
Copy link

Thanks
my transformers version is 4.31.0

Here's part of my "pip list". Hope it helps
Package Version


deepspeed 0.15.1
flash_attn 2.6.3
peft 0.13.2
pillow 10.4.0
pip 24.2
torch 2.5.1
torchvision 0.20.1
tqdm 4.66.6
transformers 4.31.0

@loadams
Copy link
Contributor

loadams commented Nov 1, 2024

Hi @yourssmile - as mentioned above in this thread, you need a newer version of transformers that contains this commit that supports the latest DeepSpeed versions (0.15.0+). So you will need to move to a newer version of transformers.

@yourssmile
Copy link

Thank you @loadams !! This issue was resolved after I upgraded to transformers 4.46.1.

@chadj2
Copy link

chadj2 commented Nov 4, 2024

@chadj2 - this should be fixed if you're using the latest transformers and deepspeed.

I wish I could upgrade but something starting in 0.14.1 allocates a lot more VRAM and causes an OOM when I use zero 2. As I mentioned earlier 0.14.5 fixes this specific issue but I am stuck on 0.14.1.

Its an older model based on the original starcoder and so I am hoping this issue will just go away if I upgrade to a more recent model.

@oraluben
Copy link
Contributor

oraluben commented Nov 4, 2024

@chadj2 you could still apply huggingface/transformers@ecf7024 manually to your transformers installation to workaround that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

6 participants