-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss of 175B Megatron-LM doesn't convergence with memory-efficient-attention when dropout.p=0.1 and attn-bias=LowerTriangularMask #724
Comments
Hi @ZhangDY-6483 xformers.ops.memory_efficient_attention(..., op=xformers.ops.fmha.MemoryEfficientAttentionFlashAttentionOp) [1] Only works because you are on A100, and if you are on f16/bf16 and don't use a |
Thanks, and can I use memory efficient attention on fp32 and pipeline parallel (no date/model parallel)? |
Flash will only work with f16/bf16 unfortunately. I'm not exactly sure what you mean with pipeline parallel - this should not interfere with the attention part I think |
Sorry, I mean the pipeline parallelism which is a strategy to train large models on multiple GPUs.(Ref 1. PIPELINE PARALLELISM; Ref 2. https://arxiv.org/abs/1811.06965) |
I've got a fix for the dropout issue. I need to do some more testing, but should be available next week hopefully :) |
It should be fixed as of 70161e5, and will be included in the next release (0.0.19). In the meantime, you can also use a development build |
🐛 Bug
The loss doesn't convergence and the figure is shown in the attatched figure.Command
To Reproduce
Steps to reproduce the behavior:
Expected behavior
transformers.txt
Loss has the same trend before replacement and after that.Environment
Please copy and paste the output from the
environment collection script from PyTorch
(or fill out the checklist below manually).
You can run the script with:
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (GCC) 8.2.0
Clang version: 3.8.0 (tags/RELEASE_380/final)
CMake version: version 3.26.1
Libc version: glibc-2.26
Python version: 3.7.13 (default, Apr 24 2022, 01:04:09) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.14.0_1-0-0-44-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB
Nvidia driver version: 470.82.01
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.4.1
/usr/lib/libcudnn_adv_infer.so.8.4.1
/usr/lib/libcudnn_adv_train.so.8.4.1
/usr/lib/libcudnn_cnn_infer.so.8.4.1
/usr/lib/libcudnn_cnn_train.so.8.4.1
/usr/lib/libcudnn_ops_infer.so.8.4.1
/usr/lib/libcudnn_ops_train.so.8.4.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.21.6
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1+cu117
[pip3] torchvision==0.14.1+cu117
[conda] Could not collect
Additional context
The text was updated successfully, but these errors were encountered: