Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Installation failed in bare-bones Ubuntu 20.04. See log at /var/log/cuda-installer.log for details. #57

Open
VbsmRobotic opened this issue Nov 13, 2023 · 8 comments

Comments

@VbsmRobotic
Copy link

VbsmRobotic commented Nov 13, 2023

Hello everyone,

I'm encountering some challenges with bare-bones Ubuntu 20.04 image for installing CUDA. Has anyone come across similar issues? Here's the process I've been following:

1- Download the CUDA installer using the following command:
    $ wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda_11.6.2_510.47.03_linux_sbsa.run
2- Run the installer with elevated privileges:
    $ sudo sh cuda_11.6.2_510.47.03_linux_sbsa.run

Unfortunately, the installation failed, and I'm advised to check the log at /var/log/cuda-installer.log for more details. Any insights or solutions would be greatly appreciated.

CUDA-Driver

error

$ cat /var/log/cuda-installer.log
[INFO]: Driver not installed.
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 510.47.03
[INFO]: Executing NVIDIA-Linux-aarch64-510.47.03.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 36096
[ERROR]: Install of driver component failed.
[ERROR]: Install of 510.47.03 failed, quitting

@Qengineering
Copy link
Owner

Sorry, you can not install CUDA 11 on a Jetson Nano, due to low-level incompatibility.
The 'regular' CUDA version is 10 and is already installed. No need to use the CUDA installer.
Assuming we are talking about the 'old' Jetson Nano, not the Orion

@VbsmRobotic
Copy link
Author

Thank you for your prompt response. I appreciate the clarification about CUDA compatibility on the Jetson Nano. However, when I check the CUDA version using nvcc --version, it seems that I can't find the installed CUDA version. Could you kindly provide guidance on how to resolve this issue? Thank you.
nvcc_V

@Qengineering
Copy link
Owner

nvcc should be located in folder /usr/local/cuda/bin/.
Please incorporate the location into your PATH string

@VbsmRobotic
Copy link
Author

Thank you for your generous help. I've successfully incorporated the changes into the bashrc file and verified the CUDA version is now visible.

@VbsmRobotic
Copy link
Author

Hello,
I am reaching out for guidance based on the information provided in the following link: https://forums.developer.nvidia.com/t/pytorch-for-jetson/72048.

To install PyTorch on my Jetson Nano, I've created a virtual environment using Python 3.6, as specified in the requirements for JetPack 4. However, during the installation process, I encountered the following error:

(py_env) jetson@nano:~$ python3
Python 3.6.15 (default, Nov 15 2023, 11:27:50)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
Traceback (most recent call last):
File "", line 1, in
File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 195, in
_load_global_deps()
File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 148, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/local/lib/python3.6/ctypes/init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: libmpi_cxx.so.20: cannot open shared object file: No such file or directory

I would greatly appreciate your advice on resolving this issue. Your assistance is invaluable to me at this stage.

Thank you in advance

@Qengineering
Copy link
Owner

Tip: ask chatGPT. It can give valuable answers. In your case:
The error you're encountering indicates that the libmpi_cxx.so.20 shared library cannot be found. This library is part of the Message Passing Interface (MPI) library. It seems like there might be an issue with your MPI installation or the environment variables related to it.

Here are a few steps you can take to address this issue:

  1. Check MPI Installation:
    Make sure that MPI is correctly installed on your system. You may need to reinstall MPI or ensure that the required libraries are available. On a Debian-based system, you can use the following command to install MPI:

    sudo apt-get install libopenmpi-dev

    If you are using a different package manager or operating system, adjust the command accordingly.

  2. Set Environment Variable:
    If MPI is correctly installed, you may need to set the LD_LIBRARY_PATH environment variable to include the directory where libmpi_cxx.so.20 is located. You can do this by adding the following line to your shell profile file (e.g., ~/.bashrc or ~/.bash_profile):

    export LD_LIBRARY_PATH=/path/to/mpi/lib:$LD_LIBRARY_PATH

    Replace /path/to/mpi/lib with the actual path to the directory containing the MPI libraries.

  3. Rebuild PyTorch:
    If you are using a virtual environment and installed PyTorch within that environment, consider deactivating the virtual environment, then reactivate it and reinstall PyTorch. This can sometimes resolve compatibility issues:

    deactivate
    source py_env/bin/activate
    pip install torch

    Make sure to replace py_env with the actual name of your virtual environment.

  4. Update PyTorch:
    Ensure that you are using the latest version of PyTorch. You can upgrade PyTorch using the following command:

    pip install --upgrade torch

    This will install the latest version of PyTorch and its dependencies.

After performing these steps, try running your Python script again. If the issue persists, there may be other system-specific factors at play, and additional troubleshooting may be needed.

@VbsmRobotic
Copy link
Author

Thank you for your message. I have successfully set the Environment Variable. To locate the libmpi, I used the following command:
$ find / -name libmpi_cxx* 2>/dev/null
/usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi_cxx.so.40.20.1
/usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi_cxx.so
/usr/lib/aarch64-linux-gnu/libmpi_cxx.so.40.20.1
/usr/lib/aarch64-linux-gnu/libmpi_cxx.so
/usr/lib/aarch64-linux-gnu/libmpi_cxx.so.40
Additionally, I've added the following line to the bashrc file to address the PyTorch installation issue:
export LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH

Following the instructions from this link, I executed the following commands for PyTorch installation:
$ wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
$ sudo apt-get install python3-pip libopenblas-base libopenmpi-dev libomp-dev
$ pip3 install 'Cython<3'
$ pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl

The installation was successful with the following packages installed:
Successfully installed Cython-0.29.36
Successfully installed dataclasses-0.8 numpy-1.19.5 torch-1.8.0 typing-extensions-4.1.1

However, I encountered an issue when attempting to install torchvision. Following the instructions here, I executed the following commands:
$ sudo apt-get install libjpeg-dev zlib1g-dev libpython3-dev libopenblas-dev libavcodec-dev libavformat-dev libswscale-dev
$ git clone --branch v0.9.0 https://github.com/pytorch/vision torchvision
$ cd torchvision
$ export BUILD_VERSION=0.9.0
$ python3 setup.py install --user

Unfortunately, I encountered the same error:
(py_env) jetson@nano:~/vahid_ws/Jetson-Nano-OCR-Detection/PyTorchJetson_JetPack4/torchvision$ python3 setup.py install --user
Traceback (most recent call last):
File "setup.py", line 12, in
import torch
File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 195, in
_load_global_deps()
File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 148, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/local/lib/python3.6/ctypes/init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: libmpi_cxx.so.20: cannot open shared object file: No such file or directory

I appreciate your assistance in resolving this issue. Any advice you can provide would be invaluable at this stage. Thank you in advance.

@KalanaRatnayake
Copy link

Thank you for your generous help. I've successfully incorporated the changes into the bashrc file and verified the CUDA version is now visible.

Can you share what commands did you use? facing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants