[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525

jagadish-amd · 2024-09-11T21:29:14Z

We are using DeepSpeed; transformer, accelerate to fine tune Qwen llm, and hit the below issue.
[rank2]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank2]: stage3_prefetch_bucket_size
[rank2]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank2]: For further information visit https://errors.pydantic.dev/2.9/v/int_from_float

Relevant stack:
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank2]: return inner_training_loop(
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
[rank2]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank2]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank2]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank2]: self._initialize_params(copy.copy(self._param_dict))
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank2]: self.zero_config = get_zero_config(param_dict)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank2]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank2]: super().init(**data)
[rank2]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pydantic/main.py", line 211, in init
[rank2]: validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
[rank2]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig

This was working prior to Pydantic migration PR #5167

In our case, the stage3_prefetch_bucket_size parameter in DeepSpeedZeroConfig is calculated as 0.9 * hidden_size * hidden_size as per
https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py#L244 .
Hidden_size is 4096 and stage3_prefetch_bucket_size turns out to be 15099494.4

One of the solution is to convert the stage3_prefetch_bucket_size value to int in transformer library (deepspeed integration file)

diff --git a/src/transformers/integrations/deepspeed.py b/src/transformers/integrations/deepspeed.py
index aae1204ac..622080d41 100644
--- a/src/transformers/integrations/deepspeed.py
+++ b/src/transformers/integrations/deepspeed.py
@@ -241,7 +241,7 @@ class HfTrainerDeepSpeedConfig(HfDeepSpeedConfig):
                 # automatically assign the optimal config values based on model config
                 self.fill_only(
                     "zero_optimization.stage3_prefetch_bucket_size",
-                    0.9 * hidden_size * hidden_size,
+                    int(0.9 * hidden_size * hidden_size),
                 )
                 self.fill_only(
                     "zero_optimization.stage3_param_persistence_threshold",

I am not sure if this is the right solution, requesting DeepSpeed team's help here.

The text was updated successfully, but these errors were encountered:

jagadish-amd · 2024-09-11T21:29:59Z

ping @jithunnair-amd @loadams

adk9 · 2024-09-11T22:09:20Z

I am not sure if this is the right solution, requesting DeepSpeed team's help here.

This is actually the right solution. It looks like a fix for this is now in place upstream (huggingface/transformers#33402) and will make it to the next transformers release.

jagadish-amd · 2024-09-11T22:25:59Z

I am not sure if this is the right solution, requesting DeepSpeed team's help here.

This is actually the right solution. It looks like a fix for this is now in place upstream (huggingface/transformers#33402) and will make it to the next transformers release.

ha ha, yeah! Thanks for the info.

loadams · 2024-09-16T22:25:12Z

Thanks @adk9 - @jagadish-amd - should we close this bug for now?

jagadish-amd · 2024-09-16T22:36:48Z

Thanks @adk9 - @jagadish-amd - should we close this bug for now?

Sure @loadams we can close the issue.

loadams · 2024-09-16T23:04:49Z

Thanks! And thanks for pointing this out

chadj2 · 2024-10-21T23:16:10Z

I was able to solve this by downgrading deepspeed:

deepspeed = "0.14.5"

loadams · 2024-10-23T21:05:23Z

@chadj2 - this should be fixed if you're using the latest transformers and deepspeed.

yourssmile · 2024-10-31T12:47:42Z

[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank4]: For further information visit https://errors.pydantic.dev/2.9/v/int_from_float
[rank5]: Traceback (most recent call last):
[rank5]: File "/data/home/xucong/VTimeLLM-main/vtimellm/train/train_mem.py", line 20, in
[rank5]: train()
[rank5]: File "/data/home/xucong/VTimeLLM-main/vtimellm/train/train.py", line 371, in train
[rank5]: trainer.train()
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank5]: return inner_training_loop(
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
[rank5]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank5]: result = self._prepare_deepspeed(*args)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank5]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank5]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank5]: self._initialize_params(copy.copy(self._param_dict))
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank5]: self.zero_config = get_zero_config(param_dict)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank5]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank5]: super().init(**data)
[rank5]: File "/root/miniforge3-py310/envs/py10flash/lib/python3.10/site-packages/pydantic/main.py", line 209, in init
[rank5]: validated_self = self.pydantic_validator.validate_python(data, self_instance=self)
[rank5]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig

I am still experiencing this problem in version 0.15.1

loadams · 2024-10-31T16:25:40Z

@yourssmile - what transformers version are you using?

yourssmile · 2024-11-01T02:18:21Z

Thanks
my transformers version is 4.31.0

Here's part of my "pip list". Hope it helps
Package Version

deepspeed 0.15.1
flash_attn 2.6.3
peft 0.13.2
pillow 10.4.0
pip 24.2
torch 2.5.1
torchvision 0.20.1
tqdm 4.66.6
transformers 4.31.0

loadams · 2024-11-01T14:47:59Z

Hi @yourssmile - as mentioned above in this thread, you need a newer version of transformers that contains this commit that supports the latest DeepSpeed versions (0.15.0+). So you will need to move to a newer version of transformers.

yourssmile · 2024-11-04T02:09:58Z

Thank you @loadams !! This issue was resolved after I upgraded to transformers 4.46.1.

chadj2 · 2024-11-04T02:34:48Z

@chadj2 - this should be fixed if you're using the latest transformers and deepspeed.

I wish I could upgrade but something starting in 0.14.1 allocates a lot more VRAM and causes an OOM when I use zero 2. As I mentioned earlier 0.14.5 fixes this specific issue but I am stuck on 0.14.1.

Its an older model based on the original starcoder and so I am hoping this issue will just go away if I upgrade to a more recent model.

oraluben · 2024-11-04T03:01:53Z

@chadj2 you could still apply huggingface/transformers@ecf7024 manually to your transformers installation to workaround that

jagadish-amd added bug Something isn't working training labels Sep 11, 2024

loadams self-assigned this Sep 11, 2024

loadams closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525

[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525

jagadish-amd commented Sep 11, 2024

jagadish-amd commented Sep 11, 2024

adk9 commented Sep 11, 2024

jagadish-amd commented Sep 11, 2024

loadams commented Sep 16, 2024

jagadish-amd commented Sep 16, 2024

loadams commented Sep 16, 2024

chadj2 commented Oct 21, 2024

loadams commented Oct 23, 2024

yourssmile commented Oct 31, 2024

loadams commented Oct 31, 2024

yourssmile commented Nov 1, 2024

loadams commented Nov 1, 2024

yourssmile commented Nov 4, 2024

chadj2 commented Nov 4, 2024

oraluben commented Nov 4, 2024

[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525

[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig #6525

Comments

jagadish-amd commented Sep 11, 2024

jagadish-amd commented Sep 11, 2024

adk9 commented Sep 11, 2024

jagadish-amd commented Sep 11, 2024

loadams commented Sep 16, 2024

jagadish-amd commented Sep 16, 2024

loadams commented Sep 16, 2024

chadj2 commented Oct 21, 2024

loadams commented Oct 23, 2024

yourssmile commented Oct 31, 2024

loadams commented Oct 31, 2024

yourssmile commented Nov 1, 2024

loadams commented Nov 1, 2024

yourssmile commented Nov 4, 2024

chadj2 commented Nov 4, 2024

oraluben commented Nov 4, 2024