-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN tensor values problem for GTX16xx users (no problem on other devices) #7908
Comments
👋 Hello @MarkDeia, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com. RequirementsPython>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started: git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit. |
@MarkDeia you may be able to work around this by disabling AMP in train.py. Anywhere that says |
@glenn-jocher Thanks for your reply, by turning off the Automatic mixed precision function, the box obj cls values are back to normal, but the P R mAP value during the validation are still 0. At first, I think the problem should be the cuda/cudnn dependency that comes with pytorch, But NVIDIA claims that this problem has been solved on the 8.2.2 version of cudnn. I am very confused, the amp and fp16 values seem to be fine.It looks like that the problem of returning nan with half precision has been fixed, but the problem still exists in the training and validation of yolov5. |
@MarkDeia 0 labels means you have zero labels. Without labels there won't be any metrics obviously. |
|
@MarkDeia they're two separate issues. The Labels 0 is indicating that there are simply no labels in your validation set, which has nothing to do with CUDA or your environment or hardware. There is no fundamental problem with detecting labels as your training has box and cls losses. |
@glenn-jocher I am not understanding since I am a new to it,so what is causing no labels in my validation set? |
Your dataset is structured incorrectly. To train correctly your data must be in YOLOv5 format. Please see our Train Custom Data tutorial for full documentation on dataset setup and all steps required to start training your first model. A few excerpts from the tutorial: 1.1 Create dataset.yamlCOCO128 is an example small tutorial dataset composed of the first 128 images in COCO train2017. These same 128 images are used for both training and validation to verify our training pipeline is capable of overfitting. data/coco128.yaml, shown below, is the dataset config file that defines 1) the dataset root directory # Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ../datasets/coco128 # dataset root dir
train: images/train2017 # train images (relative to 'path') 128 images
val: images/train2017 # val images (relative to 'path') 128 images
test: # test images (optional)
# Classes
nc: 80 # number of classes
names: [ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
'hair drier', 'toothbrush' ] # class names 1.2 Create LabelsAfter using a tool like Roboflow Annotate to label your images, export your labels to YOLO format, with one
The label file corresponding to the above image contains 2 persons (class 1.3 Organize DirectoriesOrganize your train and val images and labels according to the example below. YOLOv5 assumes ../datasets/coco128/images/im0.jpg # image
../datasets/coco128/labels/im0.txt # label Good luck 🍀 and let us know if you have any other questions! |
@glenn-jocher I think you may not have understood my expression, I ran the same code and the same dataset in both pytorch_cuda11.3 and pytorch_cuda10.2 environments, however the problem only occurred in the pytorch_cuda11.3 environment, furthermore, I was using the yolov5 demo dataset ( coco128 ), so I think there is no problem with the structure of my dataset.(I confirm that my data (coco128)is in YOLOv5 format) In any case, it is certain that what cause part of the problem comes from the autocast function in torch\cuda\amp\autocast_mode.py. |
@MarkDeia well I can't really say what might be the issue. If you can help us recreate the problem with a minimum reproducible example we could get started debugging it, but given your hardware I don't think there's any reproducibility on other environments. In any case I'd always recommend running in our Docker image if you are having issues with a local environment. See https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/ |
@glenn-jocher Thank you for your patience in your busy schedule. 👍 |
@MarkDeia well what we can do, which won't solve your problem, but will probably help a lot of people is to run a check before training to make sure that everything works correctly, and if not refer them to this issue or a tutorial about options. There's definitely been multiple users that have run into issues, usually with a combination of CUDA11, Windows, Conda and consumer cards. I'm not sure what the minimum test might be, after all we don't want to have to run a short COCO128 train before everyone's actual trainings as that would probably do more bad than good. Ok I've got it. We can run inference with and without AMP and the check will be a torch.allclose() on the outputs. If you run this on your system what do you see? On Colab we have the same detections, with boxes accurate to <1 pixel. # PyTorch Hub
import torch
# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
# Images
dir = 'https://ultralytics.com/images/'
imgs = [dir + f for f in ('zidane.jpg', 'bus.jpg')] # batch of images
# Inference
results = model(imgs)
model.amp = True
results_amp = model(imgs)
print(results.xyxy[0] - results_amp.xyxy[0])
tensor([[-0.44983, -0.21283, 0.20471, -0.35834, -0.00050, 0.00000],
[ 0.05951, 0.02808, -0.19067, 0.33899, -0.00065, 0.00000],
[-0.05856, -0.06934, -0.00732, 0.04700, 0.00124, 0.00000],
[-0.10693, 0.35675, 0.36877, 0.09174, -0.00141, 0.00000]], device='cuda:0') |
@glenn-jocher
Since they have different dimensions, they cannot be subtracted ,and from the result we might know that apparently there was an error when running the amp func.I will continue to try to find the root of the problem, but it may take a few weeks as I can only debug in my spare time. |
@MarkDeia perfect! That's all I need. I'll work on a PR. |
@MarkDeia can you run this code and verify that you get an AMP failure notice before training starts? This tests PR #7917 which automatically disables AMP if the two image results don't match just as I proposed earlier. This won't solve all the problems but hopefully it will help many users. git clone https://github.com/ultralytics/yolov5 -b amp_check # clone
cd yolov5
python train.py --epochs 3 |
@glenn-jocher Glad you added amp verification, even if I still have problems with the verification process after turning off amp, but as you say, This won't solve all the problems but hopefully it will help many users. |
I have this same issue in 1080TI. Even after the fix you issued, sometimes labels are zeroed after training for a while. I tried also with --device cpu flag and I got zero labels at some point as well. Sometimes training succeeds with GPU... |
Same issue here, i followed the tutorial and these are the results from training 1.5 hours xdd. I also have a 1660ti for laptop |
@abadia24 NaN's are unrecoverable so if you ever see an epoch with them then you can immediately terminate as the rest of training will contain them. In the meantime you might try training in Docker which is a self-contained linux environment with everything verified working correctly. EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
|
It seems that this problem is much borader. I have the same problem with NaN, running ray with pytorch and RTX3060 CUDA 11.x (Windows 11 and Ubuntu 20). |
@mhw-Parker are you using the latest version of YOLOv5? What does your AMP check say before training starts? |
Hi I also have the same issue. I'm running the following: YOLOv5 version: Latest from master (07/30/2022) All AMP checks passed. When I run the same script with the same dataset on the CPU, I get valid results. Note, I had to replace the torch requirements from the repo with the following for torch.cuda.is_available() to be set to true: |
@Raziel619 that's strange that the AMP checks passed yet you're still seeing problems. You might try disabling AMP completely by setting Line 128 in 1e89807
|
@Raziel619 you might also try training inside the Docker image for the best stability. EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
|
I disabled the AMP completely and it did improve results somewhat, as I'm no longer getting NaNs for "train/box_loss", "train/obj_loss", "train/cls_loss", but I'm getting all zeros or NaNs for almost everything else. See attached for results. |
@Raziel619 hmm. Validation is done at half precision, maybe if you add half=False here to val.run()? Lines 367 to 377 in 1e89807
|
That fixes it! Yay! Thank you so much for these swift responses, super excited to get start some training. |
@Raziel619 good news 😃! Your original issue may now be partially resolved in ✅ in PR #8804. This PR doesn't resolve the original issue, but it does disable FP16 validation if AMP checks fail or simply if you manually set To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
Hi I think this issue not only happened on consumer cards, becasue I also have the same issue training on 8XTesla V100. When turning off AMP, the trainging time double. I think the best way to handle this problem is reducing the version of cuda to 10.X. |
Hello! I was having the same problem here on a NVIDIA GeForce GTX 1650 in Ubuntu 20 and a conda enviroment and cuda 11. I was finding nans when training on coco128. The easiest way to solve it for me was setting cuda 10 in my enviroment with
|
I disabled cudnn in PyTorch and it solved the issue with nan values, but I'm not sure whether it'll affect perfomance of training process.
Windows 10 |
@Tommyisr thank you for sharing your experience with the community! Disabling cudnn can indeed resolve the NaN issue for some users, but it may come with a performance tradeoff. We recommend monitoring the training process to evaluate whether there are noticeable impacts on performance. For anyone encountering similar issues, please feel free to try the solutions mentioned here and share your results. Your feedback helps the community improve the overall YOLOv5 experience. For more information and troubleshooting tips, please refer to the Ultralytics YOLOv5 Documentation. If you have any further questions or issues, don't hesitate to reach out. Happy training! |
Same error with GTX 1650, fix with torch.backends.cudnn.enabled help me... |
Hi @Alarmod, Thank you for sharing your solution! Disabling cudnn with However, please be aware that disabling cudnn might impact the performance of your training process. It's a good idea to monitor your training times and model performance to ensure that the trade-off is acceptable for your use case. For those encountering similar issues, here are a few additional steps you can try:
If the problem persists, please provide more details about your environment and setup, and we will do our best to assist you further. Thank you for being a part of the YOLO community, and happy training! 🚀 |
I check pytorch 2.3.1 and 2.4.0 (latest) for CUDA 11.8 and windows 10. I try disable AMP after model load as model.amp = False, don't help me. |
Hi @Alarmod, Thank you for your detailed follow-up. It's unfortunate that disabling AMP didn't resolve the issue for you. Given that you've already tried the latest versions of PyTorch (2.3.1 and 2.4.0) with CUDA 11.8 on Windows 10, let's explore a few more potential solutions:
If the problem persists, please provide any additional error messages or logs that might help diagnose the issue further. Your feedback is invaluable in helping us improve YOLOv5 for everyone. Thank you for your patience and for being an active member of the YOLO community! 🚀 |
Data without NaNs, I update drivers to latest and check PyTorch with CUDA 12.4, same error. Work properly only when I use torch.backends.cudnn.enabled = False. Hardware work fine. With CUDA 11-12 error with torch.backends.cudnn.enabled = True and FP16 data... |
Hi @Alarmod, Thank you for the detailed update! It's great to hear that your data is clean and your hardware is functioning correctly. Given that the issue persists across different CUDA versions and is resolved by disabling cudnn, it seems like the problem might be related to cudnn's handling of FP16 data on your specific hardware. While disabling cudnn is a viable workaround, it can impact performance. Here are a few additional steps you can consider to potentially resolve this issue without disabling cudnn:
Here's a quick summary of the steps you can try: import torch
# Enable cudnn benchmark mode
torch.backends.cudnn.benchmark = True
# Optionally, try different cudnn versions
# conda install cudnn=8.2 Thank you for your patience and for contributing to the community by sharing your findings. If you have any further questions or need additional assistance, feel free to ask. Happy training! 🚀 |
@glenn-jocher So torch.backends.cudnn.benchmark=True don't garanted good result |
Hi @Alarmod, Thank you for your feedback and for sharing your experience with Given the variability you're encountering, here are a few additional steps you can try to achieve more stable performance:
Here's a quick summary of the steps you can try: import torch
# Enable cudnn deterministic mode
torch.backends.cudnn.deterministic = True
# Optionally, try different cudnn versions
# conda install cudnn=8.2 If the problem persists, please ensure that you are using the latest versions of YOLOv5, PyTorch, and CUDA. If the issue is reproducible in the latest versions, consider reporting it to the PyTorch team with detailed information about your setup. This can help them identify and fix the issue in future releases. Thank you for your patience and for contributing to the community by sharing your findings. If you have any further questions or need additional assistance, feel free to ask. Happy training! 🚀 |
@glenn-jocher Thus, it can be argued that the error is clearly in the convolution code inside CUDNN or in CUDA 11.8+ |
Hi @Alarmod, Thank you for your continued efforts in troubleshooting this issue and for confirming that you're using the latest versions of the libraries. Given that Here are a few additional steps you can take to further isolate and potentially resolve the issue:
Your persistence and detailed feedback are incredibly valuable to the community. If you have any further questions or need additional assistance, feel free to ask here. We're all in this together, and your contributions help make YOLOv5 better for everyone! 🚀 Thank you for being an active member of the YOLO community, and happy training! |
Please ensure you're using the latest YOLOv5 version and try reducing the batch size or disabling AMP to see if it resolves the issue. If the problem persists, consider using a Docker environment for consistency. |
Search before asking
YOLOv5 Component
Training, Validation
Bug
I used yolov5 to test with the demo dataset (coco128) and found that box and obj are nan. Also, there are no detections appear on validation images. This only happens on GTX1660ti devices (GPU mode), when I use CPU or use Google colab(Tesla K80) / RTX2070 for training, everything works fine.
Environment
Minimal Reproducible Example
The command used for training is
python train.py
Additional
There are issues here also discussing the same problem.
However, I have tried pytorch with cuda version 11.5 (whose cudnn version is 8.3.0>8.2.2) and I also tried downloading cuDNN from nvidia and copy/paste the dll files into the relevant folder in torch/lib , the problem still can not be solved.
Another workaround is to downgrade to pytorch with cuda version 10.2(tested and it works), but this is currently not feasible as CUDA-10.2 PyTorch builds are no longer available for Windows.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: