-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP requires larger LR to converge at similar rate as DeepSpeed #2624
Comments
What's interesting here from my perspective is ages ago NVIDIA came out with a report that you should modify your LR by the number of GPUs, and there's been mixed results as to why that should be done. So on one hand I'm pleasantly surprised to see this behavior? https://huggingface.co/docs/accelerate/concept_guides/performance#learning-rates |
@muellerzr thanks for sharing the above. But In this case we are not comparing a multi-gpu with a single-gpu case, but rather two multi-gpu cases with the exact same number of gpus. Below is a debug run im doing where I only have 1 data point to force the convergence in a deterministic and fast manner. But as you see, the LRs are stepped exactly the same. The first loss and grad norm is exactly the same, but even with the same LRs, the loss behaviors are different. I find this very perplexing. DS
FSDP
|
Hello @fabianlim, can you provide a minimal reproducer for the above? Are you using mixed-precision and gradient accumulation? |
@pacman100 I have updated the description to include the reproduction script. As you can see in the args, I do activate Also i have attached two runs here, for different LR's Some observations
Somehow I feel all this points to a precision issue. Somehow at very low LRs, if the updates are too small, then some precision losses cause problems with the FSDP parameter updates. Or conversely, DS is holding something at higher precision (hence the more memory consumption) and is able to function at higher LRs. @stas00 any ideas from your side? |
@fabianlim - in the FSDP mixed precision policy you are running here, what is the reduce_dtype? |
@lessw2020 Thanks for the suggestion! I realize now that:
To summarize my understanding:
I updated the reproduction script to allow the To conclude:
|
@muellerzr is it worth writing a short blog post on this? I think it is very useful for folks who use both stacks. |
A blog certainly, a portion in the docs is even better, probably under a new concept guide |
@muellerzr I have started to draft a concept guide here. Just putting the link here to solicitate early with you, but its still very rough and I will be adding more to it https://github.com/fabianlim/accelerate/blob/update_docs_fsdp_deepspeed/docs/source/concept_guides/fsdp_and_deepspeed.md While preparing it I tried to reference the other existing concept guides. Currently it only has two sections
Any suggestions for more sections? Tried to say high level as much as possible and not go too much detail into code as this may change over time. |
Hi @fabianlim, can you check the per-gpu-batch-size on each rank when using DS ZeRO3 and FSDP? The fact that ZeRO3 consumes more memory than FSDP makes me think perhaps the way these two frameworks decide the per-gpu-batch-size is different. You can also explicitly set the per-gpu-batch-size for DS by changing "auto" in "train_micro_batch_size_per_gpu": "auto" to a specific number (e.g., 16). You may also want to check how to set a fixed per-gpu-batch-size for FSDP. |
@minjiazhang they are the same, as you can see I set the The reason for the extra memory, is because in DS because local flat params, optimizer params, and gradients, are kept in fp32, see here, whereas in FSDP it follows the |
@fabianlim I think that alone is a great improvement to the docs already! Would you like to open the PR? :) |
@muellerzr sure will do, give me a day or two to proof read it again! |
Thank you, Fabian! That makes sense. Then I agree the casting issue of bf16 to the master fp32 weight is likely the one that caused the memory consumption difference and the convergence gap. |
After reading this thread, basically the problem comes from FSDP keeping the training regime to the one of the initial model weights, regardless of the training regime specified by the user (w/ or w/o AMP). Therefore if the model was loaded in half-precision - it should probably raise an exception if a user is trying to apply AMP to it. I think this should be done on the code level and not documentation - nobody reads documentation. But this can be easily fixed in the code. i.e.:
Summary: Using AMP with 16bit model should be clearly an assertion |
Reopening issue again since we may be adding more code. @stas00 thanks for the suggestion. I took a stab at doing this after the FSDP wrap; I havnt tested it yet because im in the middle of something else. Idea is to look only at the FSDP flat_params, and only upcast those. Only thing Im not that sure of if its possible to have a flat_param on the meta device, when Maybe we can both code updates and documentation; the docs also serve to explain other similarities/ differences in FSDP/DS, besides this upcasting issue. I will raise a PR soon. |
Conceptually your proposal looks good, @fabianlim - thank you for leading this important investigation! Here is a bunch of suggestions/questions/concerns:
I know it'd be disruptive/BC-breaking but perhaps asserting and guiding the user to do the right thing would be a cleaner solution in the long term? This would work better for any custom loop where the user can correct their code. But solutions like HF Trainer would still require done-on-behalf-of-user approach, since the user can't control the dtype of the model. Not sure what the best solution is and perhaps it might be a good idea to involve HF Transformers maintainers to make sure this solution is done across the board including HF Trainer. |
@stas00 thanks for the thoughts! these are very helpful!
Yes I think I will try to propose something, and carefully document all the implications. Your thoughts are very helpful since you have pointed out certain things I might have overlooked. Then we should get the trainer maintainers to review. |
Your follow up notes resonate well, @fabianlim The other thing I forgot to mention is to also think about how fp8 trainings will happen, as I'd imagine they would be AMP as well, but a different |
@stas00 thanks for your suggestion! Got it, will consider |
Btw we’re working with the nvidia team rn to enable FSDP + fp8 currently, should be out by next release (and merged hopefully within a week). See #2655 |
@muellerzr thanks for pointing it out, i see that the changes will be necessary to delay the autocast after the FSDP wrap. Can #2655 be tested at this point, just so I can get a sense of it? |
Working on that soon @fabianlim :) Will update when its working enough to toy with on FSDP |
@muellerzr btw the command line arguments here almost never get updated. It seems it is not integrated with Sphinx https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch. I guess we should also manually update this reference? |
I raised an inital PR. I have implemented and tested the casting logic requested by @stas00 . I had to use a few flags that are internal to the pytorch FSDP wrapper, that was the downside of it. I have also updated the documentation along with it. |
Hello @fabianlim and everyone part of this discussion, Thank you @fabianlim for the updated reproducer code. We did mention in several issues that the base model has to be loaded in the default FP32 dtype with AMP for full finetuning when all the parameters are trainable, basically all trainable parmas need to be in FP32 when using AMP. Please refer this PR to enable FA2 with the model loaded in FP32 for proper convergence huggingface/transformers#28142 as well as the issue comment tagged therein huggingface/transformers#26498 (comment). Thank you for the PR with the detailed concept guide ✨ and the logic to upcast the flat parameters to FP32. |
In terms of the Trainer being inline here, the changes from the above PR would be used by the Trainer and as such no changes on its side are required. |
@pacman100 thanks for your comments! Yes I was aware that it has been documented that when using AMP, one should load the model in fp32. However because nothing is really stopping the user to load in low precision while using mixed-precision, @stas00's suggested to avoid the difference in behavior of DS and FSDP in this (wrong-usage) scenario, just upcast it automatically and inform the user we are doing this. Thanks for pointing out huggingface/transformers#28142. Understand that this updates a warning to inform user to load in This picture repeats the FSDP and DS experiments with FA2. I verified that the models were indeed loaded with FA2 and 16 bit weights. I added an
|
@pacman100, relying on a user to remember to read some specific docs or even to be aware that they exist is futile. Do you not want users to succeed and continue using Accelerate and not switch to another framework? At the very least I think you'd want to assert (not warn) if AMP is used with non-FP32 weights and give a user a way to override the assert via a flag or an env var if the really want to waste more compute and lose time. I fail to see how not preventing invalid use of Accelerate/FSDP is good to anybody. Perhaps I'm missing something? This problem is going to affect more and more people as more and more models are distributed in half-precision. |
I can be open to allowing a flag, else raising a |
I commented on this before getting to the new PR #2674 so I'm not sure if my comment is still relevant - doesn't Accelerate now do the right thing and upcast on behalf of the user automatically? and if so shouldn't this issue be closed then? |
Yes indeed you are right, that PR included upcasting so this can be closed. CC @fabianlim :) |
@muellerzr @stas00 yes that is correct accelerate will now do the upcasting. Closing this issue! |
System Info
Deepspeed config
To reproduce use this script
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
mistralai/Mistral-7B-v0.1
) on multi-gpu training (say 4 nodes) for both FSDP and DeepSpeed (DS). Set FSDP toFULL_SHARD
and DS toZero3
.How slow is FSDP? Empircally, we found that if we increase the FSDP learning rate by the factor of the number of gpus used, then roughly be the same, as observed in the plots below.
This is unexpected as if the learning rate is set higher for the FSDP experiment, we expect the FSDP curve to converge faster. Our initial guess was that DS perhaps sums the gradients instead of averaging, but for zero3, we clearly see averaging being done if we trace into _avg_scatter_grads. In particular, the averaging will happen inside reduce_scatter_coalesced
Expected behavior
Expect that FSDP and DS, with all training parameters set the same, will converge at the same speed without having to scale the LRs.
@pacman100 @stas00 perhaps DS is doing an extra reduction somewhere that is not obvious?
The text was updated successfully, but these errors were encountered: