Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM script not recognizing custom nvmitten wheel on ARM CPU + Nvidia GPU setup #1991

Open
ChrisHuang96 opened this issue Dec 18, 2024 · 17 comments

Comments

@ChrisHuang96
Copy link

I am running the BERT benchmark on a machine equipped with an ARM CPU and Nvidia GPU, using the NVIDIA NGC aarch64-ubuntu22.04 container. The test is executed with the following command:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500

However, the test fails due to an error with NVIDIA mitten:

633.2 /usr/bin/python3 -m pip install "/opt/nvmitten-0.1.3b0-cp38-cp38-linux_x86_64.whl"
633.5 WARNING: Requirement '/opt/nvmitten-0.1.3b0-cp38-cp38-linux_x86_64.whl' looks like a filename, but the file does not exist
633.5 ERROR: nvmitten-0.1.3b0-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform.
633.6 
633.6 CM error: Portable CM script failed (name = get-generic-python-lib, return code = 256)

I have compiled nvmitten-0.1.3b0-cp310-cp310-linux_aarch64.whl from the mitten repository source code, which installs and works correctly. However, I am unsure how to modify the cm script to use nvmitten-0.1.3b0-cp310-cp310-linux_aarch64.whl instead of the default nvmitten-0.1.3b0-cp38-cp38-linux_x86_64.whl. This issue is quite perplexing, and I hope to receive assistance.

@arjunsuresh
Copy link
Contributor

Hi @ChrisHuang96 , the library installation is happening inside the docker - let me add support for aarch64 wheel and update you.

@arjunsuresh
Copy link
Contributor

Hi @ChrisHuang96 can you please test it now after doing cm pull repo? We have now added support for the aarch64 whl file but since we don't have access to an aarch64 machine, it is yet not tested.

@ChrisHuang96
Copy link
Author

Hi @arjunsuresh , I've just re-pulled the repo and everything is working well now. I’m able to install mitten and its dependencies for aarch64. Thanks for your support.

By the way, is there a global flag that can be set to preserve the models and datasets downloaded each time the CM script is run, so that they don't need to be re-downloaded during subsequent executions? With the default commands, every time I re-run the same, it repeats the download process.

@arjunsuresh
Copy link
Contributor

That's nice @ChrisHuang96

Actually by default all the models and downloads are cached in a CM run. If docker is used, cache won't work if we are using non-interactive mode, but we are mounting the model and dataset folders from the host and thus avoiding repeated downloads.

Can you please share the command used and the mode/dataset which is getting re-downloaded?

@ChrisHuang96
Copy link
Author

ChrisHuang96 commented Dec 20, 2024

I use the test dommand on MLPerf Inference Documentation, Nvidia -> datacenter -> Performance Estimation for Offline Scenario

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=bert-99.9 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --quiet \
   --test_query_count=500

Each time the same command is executed, it seems to enter a different Current Directory (looks like /root/CM/repos/local/cache/xxx). When using Docker, is there a way to cache these folders on the host machine rather than re-downloading them every time, especially for large model files below:

model file downloaded from:

dataset downloaded from:

@arjunsuresh
Copy link
Contributor

arjunsuresh commented Dec 20, 2024

Thank you @ChrisHuang96 for sharing the command used. This is the correct link for Nvidia implementation - mainly --docker option is mandatory for Nvidia implementation.
https://docs.mlcommons.org/inference/benchmarks/language/bert/#__tabbed_1_2

"When using Docker, is there a way to cache these folders on the host machine rather than re-downloading them every time, especially for large model files below"

bert model and dataset are not mounted from the host as they are relatively small - model is about a GB and dataset size is much smaller. We can add this option - so far we have done it for models and datasets which takes GBs in size.

@ChrisHuang96
Copy link
Author

Thanks for your reply. I hope to add support for mounting BERT model files from the host, because I have a slow network, and it takes about 1.5 hours to download even GB-level data.

@arjunsuresh
Copy link
Contributor

No worries. We can add that but there are many other downloads like whl files which are also close to a GB size. But once you are inside the container everything is cached right? Is there any reason to relaunch the container?

@ChrisHuang96
Copy link
Author

Because there are multiple mirrors for wheels (or PyPI), I can add a Docker startup option to choose a faster mirror site, whereas the model is generally from a single source.

"Is there any reason to relaunch the container?" -this device is not solely for me; other colleagues may restart it, and we might also adjust the system configuration to optimize benchmarking results.

@arjunsuresh
Copy link
Contributor

Thank you @ChrisHuang96 for explaining the situation.

I have added a quick fix to not redownload the bert model for nvidia implementation - it'll now download once inside the container and then the downloaded files are copied to a mounted folder. cm pull repo can get this change.

Also, if the system is not restarted, docker start can be used to restart a stopped container. And if system restart is expected docker commit can be used to save the state of a docker container: https://docs.docker.com/reference/cli/docker/container/commit/

@ChrisHuang96
Copy link
Author

Thank you @arjunsuresh , I will check it later.

@ChrisHuang96
Copy link
Author

Hi @arjunsuresh , it works well, but I meet a new error later:

CM error: Please envoke cmr "get tensorrt _dev" --tar_file={full path to the TensorRT tar file}!

looks like the TensorRT packet not found. Is there any way to solve the problem?

@arjunsuresh
Copy link
Contributor

@ChrisHuang96 are you using --docker option in the cm run command?

@ChrisHuang96
Copy link
Author

ChrisHuang96 commented Dec 24, 2024

Yes, I use the Performance Estimation for Offline Scenario command on the mlcommon page:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=bert-99 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500

Indeed, it includes the --docker flag. Is there anything wrong with this flag?

@arjunsuresh
Copy link
Contributor

yes, --docker flag is needed. Can you please retry now? We just fixed a bug for tensorrt detection on aarch64.

@ChrisHuang96
Copy link
Author

The same error happened, I used cm rm cache -f first and re-run the docker command on the physical machine, it fetched the model data and cloned repos. May I need to re-install the cm on the physical machine?

The output is:

Successfully installed nvmitten-0.1.3b0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
INFO:root:           ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/get-nvidia-mitten/customize.py
INFO:root:    * cm run script "get cuda _cudnn"
INFO:root:      * cm run script "detect os"
INFO:root:             ! cd /root/CM/repos/local/cache/e4864b9a2c9d4057
INFO:root:             ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
INFO:root:        # Requested paths: /root/miniforge3/bin:/root/miniforge3/condabin:/usr/include/aarch64-linux-gnu:/usr/local/cuda-12.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/cuda/bin:/usr/local/cuda-11/bin:/usr/cuda-11/bin:/usr/local/cuda-12/bin:/usr/cuda-12/bin:/usr/local/packages/cuda
INFO:root:        * /usr/local/cuda-12.2/bin/nvcc
INFO:root:               ! cd /root/CM/repos/local/cache/e4864b9a2c9d4057
INFO:root:               ! call /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/run.sh from tmp-run.sh
INFO:root:               ! call "detect_version" from /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/customize.py
        Detected version: 12.2
INFO:root:        # Found artifact in /usr/local/cuda-12.2/bin/nvcc
INFO:root:           ! cd /root/CM/repos/local/cache/e4864b9a2c9d4057
INFO:root:           ! call /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/customize.py
        Detected version: 12.2
INFO:root:    * cm run script "get nvidia cudnn"
INFO:root:      * cm run script "detect os"
INFO:root:             ! cd /root/CM/repos/local/cache/11ee6c00501549a8
INFO:root:             ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
INFO:root:      * cm run script "detect sudo"
INFO:root:             ! cd /root/CM/repos/local/cache/11ee6c00501549a8
INFO:root:             ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-sudo/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-sudo/customize.py
INFO:root:        # Requested paths: /root/miniforge3/bin:/root/miniforge3/condabin:/usr/include/aarch64-linux-gnu:/usr/local/cuda-12.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/cuda/bin:/usr/local/cuda-11/bin:/usr/cuda-11/bin:/usr/local/cuda-12/bin:/usr/cuda-12/bin:/usr/local/packages/cuda:/usr/local/cuda/lib64:/usr/cuda/lib64:/usr/local/cuda/lib:/usr/cuda/lib:/usr/local/cuda-11/lib64:/usr/cuda-11/lib:/usr/local/cuda-12/lib:/usr/cuda-12/lib:/usr/local/packages/cuda/lib:$CUDNN_ROOT/lib:/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu:/usr/local/lib:/lib:/usr/lib
INFO:root:        # Found artifact in /lib/aarch64-linux-gnu/libcudnn.so
INFO:root:           ! cd /root/CM/repos/local/cache/11ee6c00501549a8
INFO:root:           ! call /root/CM/repos/mlcommons@mlperf-automations/script/get-cudnn/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/get-cudnn/customize.py
INFO:root:ENV[CM_CUDA_PATH_INCLUDE_CUDNN]: /usr/include
INFO:root:ENV[CM_CUDA_PATH_LIB_CUDNN]: /lib/aarch64-linux-gnu
INFO:root:ENV[CM_CUDNN_VERSION]: 8.9.0
INFO:root:ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: yes
INFO:root:ENV[CM_CUDA_VERSION]: 12.2
INFO:root:ENV[CM_CUDA_VERSION_STRING]: cu122
INFO:root:ENV[CM_NVCC_BIN_WITH_PATH]: /usr/local/cuda-12.2/bin/nvcc
INFO:root:ENV[CUDA_HOME]: /usr/local/cuda-12.2
INFO:root:    * cm run script "get tensorrt"
INFO:root:      * cm run script "detect os"
INFO:root:             ! cd /root/CM/repos/local/cache/384e00817d854eae
INFO:root:             ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
INFO:root:      * cm run script "get python3"
INFO:root:           ! load /root/CM/repos/local/cache/f1ac7e837ce84f26/cm-cached-state.json
INFO:root:Path to Python: /root/miniforge3/bin/python3
INFO:root:Python version: 3.11.9

CM error: Please envoke cmr "get tensorrt _dev" --tar_file={full path to the TensorRT tar file}!

My _cm.yaml in script/get-tensorrt is:

alias: get-tensorrt
automation_alias: script
automation_uid: 5b4e0237da074764
cache: true
category: CUDA automation
clean_files: []
default_env: {}
deps:
- tags: detect,os
- names:
  - python
  - python3
  tags: get,python3
docker: {}
input_description:
  input: Full path to the installed TensorRT library (nvinfer)
  tar_file: Full path to the TensorRT Tar file downloaded from the Nvidia website
    (https://developer.nvidia.com/tensorrt)
input_mapping:
  input: CM_INPUT
  tar_file: CM_TENSORRT_TAR_FILE_PATH
new_env_keys:
- CM_TENSORRT_*
- +PATH
- +C_INCLUDE_PATH
- +CPLUS_INCLUDE_PATH
- +LD_LIBRARY_PATH
- +DYLD_FALLBACK_LIBRARY_PATH
- + LDFLAGS
tags:
- get
- tensorrt
- nvidia
uid: 2a84ca505e4c408d
variations:
  dev:
    env:
      CM_TENSORRT_REQUIRE_DEV: 'yes'

Thanks for your help!

@ChrisHuang96
Copy link
Author

Hi @arjunsuresh, there is a new problem in TensorRT on aarch64. The cm scripts automatically download source code of pytorch and cmake, when build the torch by cmake, an Error occured:

-- Building version 2.1.0a0+git32f93b1
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TRTLLM=0 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/root/CM/repos/local/cache/df98f6be06064639/pytorch/torch -DCMAKE_PREFIX_PATH=/root/miniforge3/envs/cmenv/lib/python3.10/site-packages -DCUDA_NVCC_EXECUTABLE=/usr/local/cuda-12.2/bin/nvcc -DNUMPY_INCLUDE_DIR=/root/miniforge3/envs/cmenv/lib/python3.10/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/root/miniforge3/envs/cmenv/bin/python3 -DPYTHON_INCLUDE_DIR=/root/miniforge3/envs/cmenv/include/python3.10 -DPYTHON_LIBRARY=/root/miniforge3/envs/cmenv/lib/libpython3.10.a -DTORCH_BUILD_VERSION=2.1.0a0+git32f93b1 -DTORCH_CUDA_ARCH_LIST=Ampere Ada Hopper -DUSE_CUDA=1 -DUSE_CUDNN=1 -DUSE_NUMPY=True /root/CM/repos/local/cache/df98f6be06064639/pytorch
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - broken
CMake Error at /root/CM/repos/local/cache/567e8b5fdef84850/share/cmake-3.25/Modules/CMakeTestCXXCompiler.cmake:63 (message):
  The C++ compiler

    "/usr/bin/c++"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /root/CM/repos/local/cache/df98f6be06064639/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-2PYROQ
    
    Run Build Command(s):/usr/local/bin/ninja cmTC_328a1 && [1/2] Building CXX object CMakeFiles/cmTC_328a1.dir/testCXXCompiler.cxx.o
    [2/2] Linking CXX executable cmTC_328a1
    FAILED: cmTC_328a1 
    : && /usr/bin/c++ -O3 -O2 -Wno-error=uninitialized -Wno-error=maybe-uninitialized -fno-strict-aliasing /root/CM/repos/local/cache/7a4a89600a234856/TensorRT-8.6.1.6/lib -O3 -O2 CMakeFiles/cmTC_328a1.dir/testCXXCompiler.cxx.o -o cmTC_328a1   && :
    /usr/bin/ld: cannot find /root/CM/repos/local/cache/7a4a89600a234856/TensorRT-8.6.1.6/lib: file format not recognized
    collect2: error: ld returned 1 exit status
    ninja: build stopped: subcommand failed. 

looks like we should use -L/root/CM/repos/local/cache/7a4a89600a234856/TensorRT-8.6.1.6/lib instead of /root/CM/repos/local/cache/7a4a89600a234856/TensorRT-8.6.1.6/lib in gcc line. Is there any suggestion for this problem? Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants