Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN tensor values problem for GTX16xx users (no problem on other devices) #7908

Closed
1 of 2 tasks
YipKo opened this issue May 20, 2022 · 43 comments · Fixed by #7917 or #8804
Closed
1 of 2 tasks

NaN tensor values problem for GTX16xx users (no problem on other devices) #7908

YipKo opened this issue May 20, 2022 · 43 comments · Fixed by #7917 or #8804
Assignees
Labels
bug Something isn't working

Comments

@YipKo
Copy link

YipKo commented May 20, 2022

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Validation

Bug

I used yolov5 to test with the demo dataset (coco128) and found that box and obj are nan. Also, there are no detections appear on validation images. This only happens on GTX1660ti devices (GPU mode), when I use CPU or use Google colab(Tesla K80) / RTX2070 for training, everything works fine.
image

Environment

  • Windows 10 10.0.19044.1706
  • YOLOv5-6.1 (version 6.1)
  • Nvidia GTX 1660 TI, 6 GB
  • Python3.9
  • cudatoolkit-11.3.1
  • pytorch-1.11.0-py3.9_cuda11.3_cudnn8_0
  • (also tried pytorch-1.11.0-py3.9_cuda11.5_cudnn8_0)
  • (with dependencies installed correctly)

Minimal Reproducible Example

The command used for training is
python train.py

Additional

There are issues here also discussing the same problem.

However, I have tried pytorch with cuda version 11.5 (whose cudnn version is 8.3.0>8.2.2) and I also tried downloading cuDNN from nvidia and copy/paste the dll files into the relevant folder in torch/lib , the problem still can not be solved.

Another workaround is to downgrade to pytorch with cuda version 10.2(tested and it works), but this is currently not feasible as CUDA-10.2 PyTorch builds are no longer available for Windows.

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@YipKo YipKo added the bug Something isn't working label May 20, 2022
@github-actions
Copy link
Contributor

github-actions bot commented May 20, 2022

👋 Hello @MarkDeia, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@MarkDeia you may be able to work around this by disabling AMP in train.py. Anywhere that says enabled=cuda set to enabled=False

@YipKo
Copy link
Author

YipKo commented May 20, 2022

@MarkDeia you may be able to work around this by disabling AMP in train.py. Anywhere that says enabled=cuda set to enabled=False

@glenn-jocher Thanks for your reply, by turning off the Automatic mixed precision function, the box obj cls values are back to normal, but the P R mAP value during the validation are still 0.
image

At first, I think the problem should be the cuda/cudnn dependency that comes with pytorch, But NVIDIA claims that this problem has been solved on the 8.2.2 version of cudnn.
By using the code from this issue(via MobileNetV2). I tested it in pytorch_with_cuda113 (covered with the 8.2.2 version cudnn dll files) environment, the outputs are normal.
image

I am very confused, the amp and fp16 values seem to be fine.It looks like that the problem of returning nan with half precision has been fixed, but the problem still exists in the training and validation of yolov5.

Also,the detection works well. python detect.py
image

@glenn-jocher
Copy link
Member

@MarkDeia 0 labels means you have zero labels. Without labels there won't be any metrics obviously.

@YipKo
Copy link
Author

YipKo commented May 20, 2022

@MarkDeia 0 labels means you have zero labels. Without labels there won't be any metrics obviously.
@glenn-jocher In fact I think I set the labels of the validation dataset correctly, since the validation dataset is the same as the training dataset in the coco128 dataset.
I mentioned that when I used the previous version of pytorch, (pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2) everything was running correctly, when I switched to pytorch 1.11( cuda11.3) using conda , the problem arises. The code and dataset I use does not change at all in both operations, and I ran it with this commandpython train.py .

image
image

@glenn-jocher
Copy link
Member

@MarkDeia they're two separate issues. The Labels 0 is indicating that there are simply no labels in your validation set, which has nothing to do with CUDA or your environment or hardware. There is no fundamental problem with detecting labels as your training has box and cls losses.

@YipKo
Copy link
Author

YipKo commented May 20, 2022

@MarkDeia they're two separate issues. The Labels 0 is indicating that there are simply no labels in your validation set, which has nothing to do with CUDA or your environment or hardware. There is no fundamental problem with detecting labels as your training has box and cls losses.

@glenn-jocher I am not understanding since I am a new to it,so what is causing no labels in my validation set?
image

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

Your dataset is structured incorrectly. To train correctly your data must be in YOLOv5 format. Please see our Train Custom Data tutorial for full documentation on dataset setup and all steps required to start training your first model. A few excerpts from the tutorial:

1.1 Create dataset.yaml

COCO128 is an example small tutorial dataset composed of the first 128 images in COCO train2017. These same 128 images are used for both training and validation to verify our training pipeline is capable of overfitting. data/coco128.yaml, shown below, is the dataset config file that defines 1) the dataset root directory path and relative paths to train / val / test image directories (or *.txt files with image paths), 2) the number of classes nc and 3) a list of class names:

# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ../datasets/coco128  # dataset root dir
train: images/train2017  # train images (relative to 'path') 128 images
val: images/train2017  # val images (relative to 'path') 128 images
test:  # test images (optional)

# Classes
nc: 80  # number of classes
names: [ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
         'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
         'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
         'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
         'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
         'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
         'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
         'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
         'hair drier', 'toothbrush' ]  # class names

1.2 Create Labels

After using a tool like Roboflow Annotate to label your images, export your labels to YOLO format, with one *.txt file per image (if no objects in image, no *.txt file is required). The *.txt file specifications are:

  • One row per object
  • Each row is class x_center y_center width height format.
  • Box coordinates must be in normalized xywh format (from 0 - 1). If your boxes are in pixels, divide x_center and width by image width, and y_center and height by image height.
  • Class numbers are zero-indexed (start from 0).

Image Labels

The label file corresponding to the above image contains 2 persons (class 0) and a tie (class 27):

1.3 Organize Directories

Organize your train and val images and labels according to the example below. YOLOv5 assumes /coco128 is inside a /datasets directory next to the /yolov5 directory. YOLOv5 locates labels automatically for each image by replacing the last instance of /images/ in each image path with /labels/. For example:

../datasets/coco128/images/im0.jpg  # image
../datasets/coco128/labels/im0.txt  # label

Good luck 🍀 and let us know if you have any other questions!

@YipKo
Copy link
Author

YipKo commented May 21, 2022

@glenn-jocher I think you may not have understood my expression, I ran the same code and the same dataset in both pytorch_cuda11.3 and pytorch_cuda10.2 environments, however the problem only occurred in the pytorch_cuda11.3 environment, furthermore, I was using the yolov5 demo dataset ( coco128 ), so I think there is no problem with the structure of my dataset.(I confirm that my data (coco128)is in YOLOv5 format)
I am puzzled by this result, but by running it in a different environment, I am more inclined to think that there is no problem with my dataset and the integrity of the code.

In any case, it is certain that what cause part of the problem comes from the autocast function in torch\cuda\amp\autocast_mode.py.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

@MarkDeia well I can't really say what might be the issue. If you can help us recreate the problem with a minimum reproducible example we could get started debugging it, but given your hardware I don't think there's any reproducibility on other environments.

In any case I'd always recommend running in our Docker image if you are having issues with a local environment. See https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/

@YipKo
Copy link
Author

YipKo commented May 21, 2022

@MarkDeia well I can't really say what might be the issue. If you can help us recreate the problem with a minimum reproducible example we could get started debugging it, but given your hardware I don't think there's any reproducibility on other environments.

In any case I'd always recommend running in our Docker image if you are having issues with a local environment. See https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/

@glenn-jocher Thank you for your patience in your busy schedule. 👍
In view of the speciality of this system(NVIDIA GTX16xx only), please chime in all those who encounter the same problem. btw, maybe you should get more sleep :) After all, it's the weekend.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

@MarkDeia well what we can do, which won't solve your problem, but will probably help a lot of people is to run a check before training to make sure that everything works correctly, and if not refer them to this issue or a tutorial about options.

There's definitely been multiple users that have run into issues, usually with a combination of CUDA11, Windows, Conda and consumer cards.

I'm not sure what the minimum test might be, after all we don't want to have to run a short COCO128 train before everyone's actual trainings as that would probably do more bad than good. Ok I've got it. We can run inference with and without AMP and the check will be a torch.allclose() on the outputs. If you run this on your system what do you see? On Colab we have the same detections, with boxes accurate to <1 pixel.

# PyTorch Hub
import torch

# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')

# Images
dir = 'https://ultralytics.com/images/'
imgs = [dir + f for f in ('zidane.jpg', 'bus.jpg')]  # batch of images

# Inference
results = model(imgs)
model.amp = True
results_amp = model(imgs)
print(results.xyxy[0] - results_amp.xyxy[0])

tensor([[-0.44983, -0.21283,  0.20471, -0.35834, -0.00050,  0.00000],
        [ 0.05951,  0.02808, -0.19067,  0.33899, -0.00065,  0.00000],
        [-0.05856, -0.06934, -0.00732,  0.04700,  0.00124,  0.00000],
        [-0.10693,  0.35675,  0.36877,  0.09174, -0.00141,  0.00000]], device='cuda:0')

@YipKo
Copy link
Author

YipKo commented May 21, 2022

@glenn-jocher
print(results_amp.xyxy[0])
print(results.xyxy[0])
the output are as follow

tensor([], device='cuda:0', size=(0, 6))
tensor([[7.42550e+02, 4.80370e+01, 1.14120e+03, 7.16642e+02, 8.81825e-01, 0.00000e+00],
        [4.42060e+02, 4.37528e+02, 4.96809e+02, 7.09839e+02, 6.87341e-01, 2.70000e+01],
        [1.25191e+02, 1.93681e+02, 7.11992e+02, 7.13047e+02, 6.39421e-01, 0.00000e+00],
        [9.82893e+02, 3.08357e+02, 1.02737e+03, 4.20092e+02, 2.62013e-01, 2.70000e+01]], device='cuda:0')

Since they have different dimensions, they cannot be subtracted ,and from the result we might know that apparently there was an error when running the amp func.I will continue to try to find the root of the problem, but it may take a few weeks as I can only debug in my spare time.

@glenn-jocher
Copy link
Member

@MarkDeia perfect! That's all I need. I'll work on a PR.

@glenn-jocher glenn-jocher self-assigned this May 21, 2022
@glenn-jocher glenn-jocher linked a pull request May 21, 2022 that will close this issue
@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2022

@MarkDeia can you run this code and verify that you get an AMP failure notice before training starts? This tests PR #7917 which automatically disables AMP if the two image results don't match just as I proposed earlier. This won't solve all the problems but hopefully it will help many users.

Screen Shot 2022-05-21 at 8 09 36 PM

git clone https://github.com/ultralytics/yolov5 -b amp_check  # clone
cd yolov5
python train.py --epochs 3

@YipKo
Copy link
Author

YipKo commented May 22, 2022

@MarkDeia can you run this code and verify that you get an AMP failure notice before training starts? This tests PR #7917 which automatically disables AMP if the two image results don't match just as I proposed earlier. This won't solve all the problems but hopefully it will help many users.

Screen Shot 2022-05-21 at 8 09 36 PM
git clone https://github.com/ultralytics/yolov5 -b amp_check  # clone
cd yolov5
python train.py --epochs 3

@glenn-jocher Glad you added amp verification, even if I still have problems with the verification process after turning off amp, but as you say, This won't solve all the problems but hopefully it will help many users.
There is a slight error in the check_amp func in this PR, which I have commented on under #7917

@tahvane1
Copy link

tahvane1 commented Jun 5, 2022

I have this same issue in 1080TI. Even after the fix you issued, sometimes labels are zeroed after training for a while. I tried also with --device cpu flag and I got zero labels at some point as well. Sometimes training succeeds with GPU...

@abadia24
Copy link

abadia24 commented Jun 20, 2022

Same issue here, i followed the tutorial and these are the results from training 1.5 hours xdd. I also have a 1660ti for laptop
results.csv

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 21, 2022

@abadia24 NaN's are unrecoverable so if you ever see an epoch with them then you can immediately terminate as the rest of training will contain them.

In the meantime you might try training in Docker which is a self-contained linux environment with everything verified working correctly.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

@glenn-jocher glenn-jocher removed the TODO label Jun 21, 2022
@evo11x
Copy link

evo11x commented Jun 21, 2022

It seems that this problem is much borader. I have the same problem with NaN, running ray with pytorch and RTX3060 CUDA 11.x (Windows 11 and Ubuntu 20).
I tried many cuda version combinations with older pytorch (except cuda 10.x) with no success.
The problem appears when using multithreading with higher GPU throughput. The faster you run it the sooner it crashes.

@mhw-Parker
Copy link

mhw-Parker commented Jul 12, 2022

Hi, I also meet the same problem
system: win 10
Gpu: nvidia GTX 1660ti
Cuda: cuda&cudatoolkit 11.3
pytorch: 1.11

I have already used torch.cuda.is_available() to check that my GPU environment has been built successfully.
I can also run detect.py by gpu.
But when I want to train my own custom data, error occured
if --device 0 (use gpu)
屏幕截图 2022-07-12 150400

if --device cpu
图片

I can't train by my GPU 1660ti ?

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 12, 2022

@mhw-Parker are you using the latest version of YOLOv5? What does your AMP check say before training starts?

@Raziel619
Copy link

Hi I also have the same issue. I'm running the following:

YOLOv5 version: Latest from master (07/30/2022)
PyTorch version: 1.12.0
CUDA version: 11.6
GPU: GTX 1660

All AMP checks passed. When I run the same script with the same dataset on the CPU, I get valid results.

Note, I had to replace the torch requirements from the repo with the following for torch.cuda.is_available() to be set to true:
torch==1.12.0+cu116
torchaudio==0.12.0+cu116
torchvision==0.13.0+cu116

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 30, 2022

@Raziel619 that's strange that the AMP checks passed yet you're still seeing problems. You might try disabling AMP completely by setting amp=False here, i.e. simulating an AMP check failure.

yolov5/train.py

Line 128 in 1e89807

amp = check_amp(model) # check AMP

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 30, 2022

@Raziel619 you might also try training inside the Docker image for the best stability.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

@Raziel619
Copy link

I disabled the AMP completely and it did improve results somewhat, as I'm no longer getting NaNs for "train/box_loss", "train/obj_loss", "train/cls_loss", but I'm getting all zeros or NaNs for almost everything else. See attached for results.
results.csv

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 31, 2022

@Raziel619 hmm. Validation is done at half precision, maybe if you add half=False here to val.run()?

yolov5/train.py

Lines 367 to 377 in 1e89807

results, maps, _ = val.run(data_dict,
batch_size=batch_size // WORLD_SIZE * 2,
imgsz=imgsz,
model=ema.ema,
single_cls=single_cls,
dataloader=val_loader,
save_dir=save_dir,
plots=False,
callbacks=callbacks,
compute_loss=compute_loss)

@Raziel619
Copy link

That fixes it! Yay! Thank you so much for these swift responses, super excited to get start some training.

@glenn-jocher
Copy link
Member

@Raziel619 good news 😃! Your original issue may now be partially resolved in ✅ in PR #8804. This PR doesn't resolve the original issue, but it does disable FP16 validation if AMP checks fail or simply if you manually set amp=False.

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@Yang-Jianzhang
Copy link

Hi I think this issue not only happened on consumer cards, becasue I also have the same issue training on 8XTesla V100.

When turning off AMP, the trainging time double.

I think the best way to handle this problem is reducing the version of cuda to 10.X.

@Caterina1996
Copy link

Caterina1996 commented Dec 13, 2022

Hello! I was having the same problem here on a NVIDIA GeForce GTX 1650 in Ubuntu 20 and a conda enviroment and cuda 11. I was finding nans when training on coco128. The easiest way to solve it for me was setting cuda 10 in my enviroment with

conda install pytorch torchvision cuda100 -c pytorch

@Tommyisr
Copy link

I disabled cudnn in PyTorch and it solved the issue with nan values, but I'm not sure whether it'll affect perfomance of training process.

torch.backends.cudnn.enabled = False

Windows 10
YOLOv8
Nvidia GTX 1660 Super
Conda env
Nvidia GTX 1660
Python3.9
cudatoolkit-11.3.1
pytorch-1.12

@glenn-jocher
Copy link
Member

@Tommyisr thank you for sharing your experience with the community! Disabling cudnn can indeed resolve the NaN issue for some users, but it may come with a performance tradeoff. We recommend monitoring the training process to evaluate whether there are noticeable impacts on performance.

For anyone encountering similar issues, please feel free to try the solutions mentioned here and share your results. Your feedback helps the community improve the overall YOLOv5 experience.

For more information and troubleshooting tips, please refer to the Ultralytics YOLOv5 Documentation. If you have any further questions or issues, don't hesitate to reach out. Happy training!

Alarmod added a commit to Alarmod/MRI_MedicalAnalysis that referenced this issue Aug 4, 2024
@Alarmod
Copy link

Alarmod commented Aug 4, 2024

Same error with GTX 1650, fix with torch.backends.cudnn.enabled help me...

@glenn-jocher
Copy link
Member

Hi @Alarmod,

Thank you for sharing your solution! Disabling cudnn with torch.backends.cudnn.enabled = False can indeed help resolve NaN issues on certain GPUs, including the GTX 1650. This workaround is useful for those experiencing similar problems.

However, please be aware that disabling cudnn might impact the performance of your training process. It's a good idea to monitor your training times and model performance to ensure that the trade-off is acceptable for your use case.

For those encountering similar issues, here are a few additional steps you can try:

  1. Update to the Latest Versions: Ensure you are using the latest versions of YOLOv5, PyTorch, and CUDA. Sometimes, updates include important bug fixes and performance improvements.
  2. Experiment with Different CUDA Versions: As mentioned earlier, using an older CUDA version (e.g., CUDA 10.x) has resolved the issue for some users.
  3. Disable AMP: If you're using Automatic Mixed Precision (AMP), try disabling it to see if it resolves the issue.

If the problem persists, please provide more details about your environment and setup, and we will do our best to assist you further.

Thank you for being a part of the YOLO community, and happy training! 🚀

@Alarmod
Copy link

Alarmod commented Aug 5, 2024

I check pytorch 2.3.1 and 2.4.0 (latest) for CUDA 11.8 and windows 10. I try disable AMP after model load as model.amp = False, don't help me.

@glenn-jocher
Copy link
Member

Hi @Alarmod,

Thank you for your detailed follow-up. It's unfortunate that disabling AMP didn't resolve the issue for you. Given that you've already tried the latest versions of PyTorch (2.3.1 and 2.4.0) with CUDA 11.8 on Windows 10, let's explore a few more potential solutions:

  1. Disable cudnn: As mentioned by other users, disabling cudnn can sometimes resolve NaN issues. You can do this by adding the following line to your script:

    torch.backends.cudnn.enabled = False
  2. Check for NaNs in Data: Ensure that your input data does not contain NaNs or infinities. You can add a simple check to your data loading pipeline:

    def check_for_nans(tensor):
        if torch.isnan(tensor).any() or torch.isinf(tensor).any():
            raise ValueError("Input tensor contains NaNs or infinities")
    
    # Example usage
    for images, targets in dataloader:
        check_for_nans(images)
        check_for_nans(targets)
  3. Experiment with Different CUDA Versions: Some users have found success by downgrading to an older CUDA version, such as CUDA 10.x. You can create a new conda environment with a different CUDA version to test this:

    conda create -n yolov5_cuda10 python=3.9
    conda activate yolov5_cuda10
    conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
  4. Update YOLOv5: Ensure you are using the latest version of YOLOv5. You can update your local repository by running:

    git pull
    pip install -r requirements.txt
  5. Check for Hardware Issues: Occasionally, hardware issues such as overheating or faulty memory can cause NaN values. Monitoring your GPU temperature and running hardware diagnostics might help identify any underlying issues.

If the problem persists, please provide any additional error messages or logs that might help diagnose the issue further. Your feedback is invaluable in helping us improve YOLOv5 for everyone.

Thank you for your patience and for being an active member of the YOLO community! 🚀

@Alarmod
Copy link

Alarmod commented Aug 5, 2024

Data without NaNs, I update drivers to latest and check PyTorch with CUDA 12.4, same error. Work properly only when I use torch.backends.cudnn.enabled = False. Hardware work fine.

With CUDA 11-12 error with torch.backends.cudnn.enabled = True and FP16 data...

@glenn-jocher
Copy link
Member

Hi @Alarmod,

Thank you for the detailed update! It's great to hear that your data is clean and your hardware is functioning correctly. Given that the issue persists across different CUDA versions and is resolved by disabling cudnn, it seems like the problem might be related to cudnn's handling of FP16 data on your specific hardware.

While disabling cudnn is a viable workaround, it can impact performance. Here are a few additional steps you can consider to potentially resolve this issue without disabling cudnn:

  1. Use a Different cudnn Version: Sometimes, specific versions of cudnn have bugs that are resolved in other versions. You can try using a different cudnn version by installing it manually. For example, if you're using conda, you can specify a different cudnn version:

    conda install cudnn=8.2
  2. Experiment with Different PyTorch Versions: While you've tried the latest versions, sometimes specific combinations of PyTorch and cudnn work better together. You might want to try a few different versions of PyTorch to see if any of them resolve the issue.

  3. Check for Known Issues: Review the PyTorch GitHub issues and forums for any similar reports. There might be ongoing discussions or patches available for this specific issue.

  4. Custom cudnn Settings: Sometimes, tweaking cudnn settings can help. For example, you can set the cudnn benchmark mode to True to allow cudnn to find the best algorithms for your hardware:

    torch.backends.cudnn.benchmark = True
  5. Report the Issue: If the problem persists, consider reporting it to the PyTorch team with detailed information about your setup. This can help them identify and fix the issue in future releases.

Here's a quick summary of the steps you can try:

import torch

# Enable cudnn benchmark mode
torch.backends.cudnn.benchmark = True

# Optionally, try different cudnn versions
# conda install cudnn=8.2

Thank you for your patience and for contributing to the community by sharing your findings. If you have any further questions or need additional assistance, feel free to ask. Happy training! 🚀

@Alarmod
Copy link

Alarmod commented Aug 6, 2024

@glenn-jocher
Enabling the option torch.backends.cudnn.benchmark=True helped, but after testing it many times, I again came across a non-working variant of CUDNN configuration that to be faster

So torch.backends.cudnn.benchmark=True don't garanted good result

@glenn-jocher
Copy link
Member

Hi @Alarmod,

Thank you for your feedback and for sharing your experience with torch.backends.cudnn.benchmark=True. It's great to hear that it helped in some cases, but I understand that it doesn't guarantee consistent results across all configurations.

Given the variability you're encountering, here are a few additional steps you can try to achieve more stable performance:

  1. Use Specific cudnn Algorithms: You can manually set specific cudnn algorithms for convolution operations. This can sometimes bypass problematic configurations. For example:

    torch.backends.cudnn.deterministic = True
  2. Update cudnn and CUDA: Ensure you are using the latest versions of cudnn and CUDA. Sometimes, updates include important bug fixes and performance improvements.

  3. Experiment with Different PyTorch and cudnn Versions: As mentioned earlier, different combinations of PyTorch and cudnn versions might yield better stability. You might want to try a few different versions to see if any of them resolve the issue.

  4. Monitor GPU Utilization and Temperature: Ensure that your GPU is not overheating or being throttled, as this can sometimes cause inconsistent performance.

  5. Check for Known Issues: Review the PyTorch GitHub issues and forums for any similar reports. There might be ongoing discussions or patches available for this specific issue.

Here's a quick summary of the steps you can try:

import torch

# Enable cudnn deterministic mode
torch.backends.cudnn.deterministic = True

# Optionally, try different cudnn versions
# conda install cudnn=8.2

If the problem persists, please ensure that you are using the latest versions of YOLOv5, PyTorch, and CUDA. If the issue is reproducible in the latest versions, consider reporting it to the PyTorch team with detailed information about your setup. This can help them identify and fix the issue in future releases.

Thank you for your patience and for contributing to the community by sharing your findings. If you have any further questions or need additional assistance, feel free to ask. Happy training! 🚀

@Alarmod
Copy link

Alarmod commented Aug 6, 2024

@glenn-jocher
With torch.backends.cudnn.deterministic=True I have same error :)
I use latest versions of libraries...

Thus, it can be argued that the error is clearly in the convolution code inside CUDNN or in CUDA 11.8+

@glenn-jocher
Copy link
Member

Hi @Alarmod,

Thank you for your continued efforts in troubleshooting this issue and for confirming that you're using the latest versions of the libraries. Given that torch.backends.cudnn.deterministic=True didn't resolve the issue and considering your findings, it does seem likely that the problem lies within the convolution code in cudnn or CUDA 11.8+.

Here are a few additional steps you can take to further isolate and potentially resolve the issue:

  1. File an Issue with PyTorch: Since this appears to be a deeper issue with cudnn or CUDA, I recommend filing a detailed issue on the PyTorch GitHub repository. Include all relevant details about your setup, the specific error messages, and the steps you've taken so far. This will help the PyTorch developers investigate and address the issue.

  2. Use CPU for Debugging: As a temporary workaround, you can switch to CPU mode for debugging and development. This can help you continue your work while the issue is being investigated:

    device = torch.device('cpu')
    model.to(device)
  3. Experiment with Different cudnn Algorithms: If you haven't already, you can try experimenting with different cudnn algorithms to see if any specific configuration works better:

    torch.backends.cudnn.benchmark = True
  4. Community Feedback: Engage with the community on forums like the PyTorch Discussion Forum or the Ultralytics YOLOv5 Discussions. Other users might have encountered similar issues and could offer additional insights or solutions.

  5. Monitor Updates: Keep an eye on updates to PyTorch, cudnn, and CUDA. Future releases might include fixes or improvements that address this issue.

Your persistence and detailed feedback are incredibly valuable to the community. If you have any further questions or need additional assistance, feel free to ask here. We're all in this together, and your contributions help make YOLOv5 better for everyone! 🚀

Thank you for being an active member of the YOLO community, and happy training!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet