-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525
Comments
ping @jithunnair-amd @loadams |
This is actually the right solution. It looks like a fix for this is now in place upstream (huggingface/transformers#33402) and will make it to the next |
ha ha, yeah! Thanks for the info. |
Thanks @adk9 - @jagadish-amd - should we close this bug for now? |
Sure @loadams we can close the issue. |
Thanks! And thanks for pointing this out |
I was able to solve this by downgrading deepspeed:
|
@chadj2 - this should be fixed if you're using the latest transformers and deepspeed. |
[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float] I am still experiencing this problem in version 0.15.1 |
@yourssmile - what transformers version are you using? |
Thanks Here's part of my "pip list". Hope it helps deepspeed 0.15.1 |
Hi @yourssmile - as mentioned above in this thread, you need a newer version of transformers that contains this commit that supports the latest DeepSpeed versions (0.15.0+). So you will need to move to a newer version of transformers. |
Thank you @loadams !! This issue was resolved after I upgraded to transformers 4.46.1. |
I wish I could upgrade but something starting in 0.14.1 allocates a lot more VRAM and causes an OOM when I use zero 2. As I mentioned earlier 0.14.5 fixes this specific issue but I am stuck on 0.14.1. Its an older model based on the original starcoder and so I am hoping this issue will just go away if I upgrade to a more recent model. |
@chadj2 you could still apply huggingface/transformers@ecf7024 manually to your transformers installation to workaround that |
We are using DeepSpeed; transformer, accelerate to fine tune Qwen llm, and hit the below issue.
[rank2]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank2]: stage3_prefetch_bucket_size
[rank2]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank2]: For further information visit https://errors.pydantic.dev/2.9/v/int_from_float
Relevant stack:
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank2]: return inner_training_loop(
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
[rank2]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank2]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank2]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank2]: self._initialize_params(copy.copy(self._param_dict))
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank2]: self.zero_config = get_zero_config(param_dict)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank2]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank2]: super().init(**data)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pydantic/main.py", line 211, in init
[rank2]: validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
[rank2]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
This was working prior to Pydantic migration PR #5167
In our case, the stage3_prefetch_bucket_size parameter in DeepSpeedZeroConfig is calculated as 0.9 * hidden_size * hidden_size as per
https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py#L244 .
Hidden_size is 4096 and stage3_prefetch_bucket_size turns out to be 15099494.4
One of the solution is to convert the stage3_prefetch_bucket_size value to int in transformer library (deepspeed integration file)
I am not sure if this is the right solution, requesting DeepSpeed team's help here.
The text was updated successfully, but these errors were encountered: