Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punctuation and Capitalization Model not working #1210

Closed
ican24 opened this issue Sep 27, 2024 · 6 comments
Closed

Punctuation and Capitalization Model not working #1210

ican24 opened this issue Sep 27, 2024 · 6 comments

Comments

@ican24
Copy link

ican24 commented Sep 27, 2024

Dear Team,

More than a week I am trying to install your TransformerEngine module in my both machines to use Punctuation and Capitalization Model.
Unfortunately all attempts are failed.

I am getting next error, when I try to run simple example from https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/punctuation_and_capitalization.html

from nemo.collections.nlp.models import PunctuationCapitalizationModel

# to get the list of pre-trained models
PunctuationCapitalizationModel.list_available_models()

# Download and load the pre-trained BERT-based model
model = PunctuationCapitalizationModel.from_pretrained("punctuation_en_bert")

# try the model on a few examples
model.add_punctuation_capitalization(['how are you', 'great how about you'])  


Traceback (most recent call last):
  File "/maroc/nemoasr/pc.py", line 1, in <module>
    from nemo.collections.nlp.models import PunctuationCapitalizationModel
  File "/maroc/NeMo/nemo/collections/nlp/__init__.py", line 15, in <module>
    from nemo.collections.nlp import data, losses, models, modules
  File "/maroc/NeMo/nemo/collections/nlp/data/__init__.py", line 16, in <module>
    from nemo.collections.nlp.data.entity_linking.entity_linking_dataset import EntityLinkingDataset
  File "/maroc/NeMo/nemo/collections/nlp/data/entity_linking/__init__.py", line 15, in <module>
    from nemo.collections.nlp.data.entity_linking.entity_linking_dataset import EntityLinkingDataset
  File "/maroc/NeMo/nemo/collections/nlp/data/entity_linking/entity_linking_dataset.py", line 22, in <module>
    from nemo.core.classes import Dataset
  File "/maroc/NeMo/nemo/core/__init__.py", line 16, in <module>
    from nemo.core.classes import *
  File "/maroc/NeMo/nemo/core/classes/__init__.py", line 33, in <module>
    from nemo.core.classes.modelPT import ModelPT
  File "/maroc/NeMo/nemo/core/classes/modelPT.py", line 29, in <module>
    from megatron.core.optimizer import OptimizerConfig, get_megatron_optimizer
  File "/home/deep/.local/lib/python3.10/site-packages/megatron/core/optimizer/__init__.py", line 8, in <module>
    from transformer_engine.pytorch.optimizers import FusedAdam as Adam
  File "/home/deep/.local/lib/python3.10/site-packages/transformer_engine/__init__.py", line 10, in <module>
    import transformer_engine.common
  File "/home/deep/.local/lib/python3.10/site-packages/transformer_engine/common/__init__.py", line 118, in <module>
    _TE_LIB_CTYPES = _load_library()
  File "/home/deep/.local/lib/python3.10/site-packages/transformer_engine/common/__init__.py", line 89, in _load_library
    return ctypes.CDLL(so_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/local/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/deep/.local/lib/python3.10/site-packages/transformer_engine/libtransformer_engine.so: undefined symbol: cudnnBackendExecute

Can someone help me to end this puzzle?
Thank you in advance

The environment configuration shows blow:

OS: Ubuntu 20.04.6 LTS

NVIDIA-SMI 550.54.14
Driver Version: 550.54.14
CUDA Version: 12.4

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

cuDNN: 8.9.6
torch 2.4.1
torchvision 0.19.1

@ptrendx
Copy link
Member

ptrendx commented Sep 27, 2024

Hello @ican24, one clarifying question - how did you install Transformer Engine and cuDNN? The cuDNN team told us that this symbol exists in cuDNN 8.9.6, so maybe the issue comes from the incorrect installation of cuDNN?

@ican24
Copy link
Author

ican24 commented Sep 27, 2024

Dear Przemyslaw Tredak,
Thank you for your reply.

I have tried to install Transformer Engine in all 3 ways described in
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source

The last is

pip install .

As cuDNN I installed runing the next commands

sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb
sudo dpkg -i libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb 
sudo dpkg -i libcudnn8-dev_8.9.7.29-1+cuda12.2_amd64.deb

2-3 times I had tried to install cuDNN.
Do I need to change installation type of cuDNN?

@ican24
Copy link
Author

ican24 commented Sep 28, 2024

Maybe it is an important.
There were troubles to install ninja in Ubuntu 20.04 environment.
More exactly it claims that
ninja: error: loading 'build.ninja': No such file or directory
Finally I had installed from he source by the instructions of
https://github.com/ninja-build/ninja

./configure.py --bootstrap
cmake -Bbuild-cmake
cmake --build build-cmake 

After it the installation error was ended with ninja claims and I could install TransformerEngine successful.

@ican24
Copy link
Author

ican24 commented Sep 28, 2024

I had updated cudnn and installed

sudo dpkg -i cudnn-local-repo-ubuntu2004-9.4.0_1.0-1_amd64.deb
sudo apt-get -y install cudnn-cuda-12
sudo apt-get install --reinstall libcudnn9-cuda-12
sudo apt-get install --reinstall libcudnn9-cuda-dev-12

No result!

OSError: /home/deep/.local/lib/python3.10/site-packages/transformer_engine/libtransformer_engine.so: undefined symbol: cudnnBackendExecute

@timmoon10
Copy link
Collaborator

timmoon10 commented Sep 28, 2024

Transformer Engine searches in the following places for cuDNN:

For more detail:

I'd try setting CUDNN_HOME so that TE can find the correct installation of cuDNN 8.9. If that doesn't work, I'd check your Python environment to make sure there aren't any old versions of cuDNN installed (if this is does turn out to be the root problem: https://github.com/NVIDIA/TransformerEngine/pull/1036/files/4722693e616673b95dcb54cd1ad1d0f2fd34c68c#r1688981168).

@ican24
Copy link
Author

ican24 commented Sep 29, 2024

Transformer Engine searches in the following places for cuDNN:

For more detail:

TransformerEngine/transformer_engine/common/init.py

Line 48 in 7b152a8

def _load_cudnn():
I'd try setting CUDNN_HOME so that TE can find the correct installation of cuDNN 8.9. If that doesn't work, I'd check your Python environment to make sure there aren't any old versions of cuDNN installed (if this is does turn out to be the root problem: https://github.com/NVIDIA/TransformerEngine/pull/1036/files/4722693e616673b95dcb54cd1ad1d0f2fd34c68c#r1688981168).

Thank you!
Finally I had re-installed all packages. It is hard to tell all details.
I can say that It is a horrible difficult and long history.
However, it is working now.

@ican24 ican24 closed this as completed Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants