-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CM script not recognizing custom nvmitten wheel on ARM CPU + Nvidia GPU setup #1991
Comments
Hi @ChrisHuang96 , the library installation is happening inside the docker - let me add support for aarch64 wheel and update you. |
Hi @ChrisHuang96 can you please test it now after doing |
Hi @arjunsuresh , I've just re-pulled the repo and everything is working well now. I’m able to install mitten and its dependencies for aarch64. Thanks for your support. By the way, is there a global flag that can be set to preserve the models and datasets downloaded each time the CM script is run, so that they don't need to be re-downloaded during subsequent executions? With the default commands, every time I re-run the same, it repeats the download process. |
That's nice @ChrisHuang96 Actually by default all the models and downloads are cached in a CM run. If docker is used, cache won't work if we are using non-interactive mode, but we are mounting the model and dataset folders from the host and thus avoiding repeated downloads. Can you please share the command used and the mode/dataset which is getting re-downloaded? |
I use the test dommand on MLPerf Inference Documentation, Nvidia -> datacenter -> Performance Estimation for Offline Scenario cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
--model=bert-99.9 \
--implementation=nvidia \
--framework=tensorrt \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--quiet \
--test_query_count=500 Each time the same command is executed, it seems to enter a different model file downloaded from:
dataset downloaded from: |
Thank you @ChrisHuang96 for sharing the command used. This is the correct link for Nvidia implementation - mainly "When using Docker, is there a way to cache these folders on the host machine rather than re-downloading them every time, especially for large model files below" bert model and dataset are not mounted from the host as they are relatively small - model is about a GB and dataset size is much smaller. We can add this option - so far we have done it for models and datasets which takes GBs in size. |
Thanks for your reply. I hope to add support for mounting BERT model files from the host, because I have a slow network, and it takes about 1.5 hours to download even GB-level data. |
No worries. We can add that but there are many other downloads like whl files which are also close to a GB size. But once you are inside the container everything is cached right? Is there any reason to relaunch the container? |
Because there are multiple mirrors for wheels (or PyPI), I can add a Docker startup option to choose a faster mirror site, whereas the model is generally from a single source. "Is there any reason to relaunch the container?" -this device is not solely for me; other colleagues may restart it, and we might also adjust the system configuration to optimize benchmarking results. |
Thank you @ChrisHuang96 for explaining the situation. I have added a quick fix to not redownload the bert model for nvidia implementation - it'll now download once inside the container and then the downloaded files are copied to a mounted folder. Also, if the system is not restarted, |
Thank you @arjunsuresh , I will check it later. |
Hi @arjunsuresh , it works well, but I meet a new error later: CM error: Please envoke cmr "get tensorrt _dev" --tar_file={full path to the TensorRT tar file}! looks like the TensorRT packet not found. Is there any way to solve the problem? |
@ChrisHuang96 are you using |
Yes, I use the Performance Estimation for Offline Scenario command on the mlcommon page: cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
--model=bert-99 \
--implementation=nvidia \
--framework=tensorrt \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--docker --quiet \
--test_query_count=500 Indeed, it includes the |
yes, |
The same error happened, I used The output is: Successfully installed nvmitten-0.1.3b0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/get-nvidia-mitten/customize.py
INFO:root: * cm run script "get cuda _cudnn"
INFO:root: * cm run script "detect os"
INFO:root: ! cd /root/CM/repos/local/cache/e4864b9a2c9d4057
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
INFO:root: # Requested paths: /root/miniforge3/bin:/root/miniforge3/condabin:/usr/include/aarch64-linux-gnu:/usr/local/cuda-12.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/cuda/bin:/usr/local/cuda-11/bin:/usr/cuda-11/bin:/usr/local/cuda-12/bin:/usr/cuda-12/bin:/usr/local/packages/cuda
INFO:root: * /usr/local/cuda-12.2/bin/nvcc
INFO:root: ! cd /root/CM/repos/local/cache/e4864b9a2c9d4057
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/run.sh from tmp-run.sh
INFO:root: ! call "detect_version" from /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/customize.py
Detected version: 12.2
INFO:root: # Found artifact in /usr/local/cuda-12.2/bin/nvcc
INFO:root: ! cd /root/CM/repos/local/cache/e4864b9a2c9d4057
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/get-cuda/customize.py
Detected version: 12.2
INFO:root: * cm run script "get nvidia cudnn"
INFO:root: * cm run script "detect os"
INFO:root: ! cd /root/CM/repos/local/cache/11ee6c00501549a8
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
INFO:root: * cm run script "detect sudo"
INFO:root: ! cd /root/CM/repos/local/cache/11ee6c00501549a8
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-sudo/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-sudo/customize.py
INFO:root: # Requested paths: /root/miniforge3/bin:/root/miniforge3/condabin:/usr/include/aarch64-linux-gnu:/usr/local/cuda-12.2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/cuda/bin:/usr/local/cuda-11/bin:/usr/cuda-11/bin:/usr/local/cuda-12/bin:/usr/cuda-12/bin:/usr/local/packages/cuda:/usr/local/cuda/lib64:/usr/cuda/lib64:/usr/local/cuda/lib:/usr/cuda/lib:/usr/local/cuda-11/lib64:/usr/cuda-11/lib:/usr/local/cuda-12/lib:/usr/cuda-12/lib:/usr/local/packages/cuda/lib:$CUDNN_ROOT/lib:/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu:/usr/local/lib:/lib:/usr/lib
INFO:root: # Found artifact in /lib/aarch64-linux-gnu/libcudnn.so
INFO:root: ! cd /root/CM/repos/local/cache/11ee6c00501549a8
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/get-cudnn/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/get-cudnn/customize.py
INFO:root:ENV[CM_CUDA_PATH_INCLUDE_CUDNN]: /usr/include
INFO:root:ENV[CM_CUDA_PATH_LIB_CUDNN]: /lib/aarch64-linux-gnu
INFO:root:ENV[CM_CUDNN_VERSION]: 8.9.0
INFO:root:ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: yes
INFO:root:ENV[CM_CUDA_VERSION]: 12.2
INFO:root:ENV[CM_CUDA_VERSION_STRING]: cu122
INFO:root:ENV[CM_NVCC_BIN_WITH_PATH]: /usr/local/cuda-12.2/bin/nvcc
INFO:root:ENV[CUDA_HOME]: /usr/local/cuda-12.2
INFO:root: * cm run script "get tensorrt"
INFO:root: * cm run script "detect os"
INFO:root: ! cd /root/CM/repos/local/cache/384e00817d854eae
INFO:root: ! call /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /root/CM/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
INFO:root: * cm run script "get python3"
INFO:root: ! load /root/CM/repos/local/cache/f1ac7e837ce84f26/cm-cached-state.json
INFO:root:Path to Python: /root/miniforge3/bin/python3
INFO:root:Python version: 3.11.9
CM error: Please envoke cmr "get tensorrt _dev" --tar_file={full path to the TensorRT tar file}! My alias: get-tensorrt
automation_alias: script
automation_uid: 5b4e0237da074764
cache: true
category: CUDA automation
clean_files: []
default_env: {}
deps:
- tags: detect,os
- names:
- python
- python3
tags: get,python3
docker: {}
input_description:
input: Full path to the installed TensorRT library (nvinfer)
tar_file: Full path to the TensorRT Tar file downloaded from the Nvidia website
(https://developer.nvidia.com/tensorrt)
input_mapping:
input: CM_INPUT
tar_file: CM_TENSORRT_TAR_FILE_PATH
new_env_keys:
- CM_TENSORRT_*
- +PATH
- +C_INCLUDE_PATH
- +CPLUS_INCLUDE_PATH
- +LD_LIBRARY_PATH
- +DYLD_FALLBACK_LIBRARY_PATH
- + LDFLAGS
tags:
- get
- tensorrt
- nvidia
uid: 2a84ca505e4c408d
variations:
dev:
env:
CM_TENSORRT_REQUIRE_DEV: 'yes' Thanks for your help! |
Hi @arjunsuresh, there is a new problem in TensorRT on aarch64. The cm scripts automatically download source code of -- Building version 2.1.0a0+git32f93b1
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TRTLLM=0 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/root/CM/repos/local/cache/df98f6be06064639/pytorch/torch -DCMAKE_PREFIX_PATH=/root/miniforge3/envs/cmenv/lib/python3.10/site-packages -DCUDA_NVCC_EXECUTABLE=/usr/local/cuda-12.2/bin/nvcc -DNUMPY_INCLUDE_DIR=/root/miniforge3/envs/cmenv/lib/python3.10/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/root/miniforge3/envs/cmenv/bin/python3 -DPYTHON_INCLUDE_DIR=/root/miniforge3/envs/cmenv/include/python3.10 -DPYTHON_LIBRARY=/root/miniforge3/envs/cmenv/lib/libpython3.10.a -DTORCH_BUILD_VERSION=2.1.0a0+git32f93b1 -DTORCH_CUDA_ARCH_LIST=Ampere Ada Hopper -DUSE_CUDA=1 -DUSE_CUDNN=1 -DUSE_NUMPY=True /root/CM/repos/local/cache/df98f6be06064639/pytorch
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - broken
CMake Error at /root/CM/repos/local/cache/567e8b5fdef84850/share/cmake-3.25/Modules/CMakeTestCXXCompiler.cmake:63 (message):
The C++ compiler
"/usr/bin/c++"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: /root/CM/repos/local/cache/df98f6be06064639/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-2PYROQ
Run Build Command(s):/usr/local/bin/ninja cmTC_328a1 && [1/2] Building CXX object CMakeFiles/cmTC_328a1.dir/testCXXCompiler.cxx.o
[2/2] Linking CXX executable cmTC_328a1
FAILED: cmTC_328a1
: && /usr/bin/c++ -O3 -O2 -Wno-error=uninitialized -Wno-error=maybe-uninitialized -fno-strict-aliasing /root/CM/repos/local/cache/7a4a89600a234856/TensorRT-8.6.1.6/lib -O3 -O2 CMakeFiles/cmTC_328a1.dir/testCXXCompiler.cxx.o -o cmTC_328a1 && :
/usr/bin/ld: cannot find /root/CM/repos/local/cache/7a4a89600a234856/TensorRT-8.6.1.6/lib: file format not recognized
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed. looks like we should use |
I am running the BERT benchmark on a machine equipped with an ARM CPU and Nvidia GPU, using the NVIDIA NGC aarch64-ubuntu22.04 container. The test is executed with the following command:
However, the test fails due to an error with NVIDIA mitten:
I have compiled
nvmitten-0.1.3b0-cp310-cp310-linux_aarch64.whl
from the mitten repository source code, which installs and works correctly. However, I am unsure how to modify the cm script to usenvmitten-0.1.3b0-cp310-cp310-linux_aarch64.whl
instead of the defaultnvmitten-0.1.3b0-cp38-cp38-linux_x86_64.whl
. This issue is quite perplexing, and I hope to receive assistance.The text was updated successfully, but these errors were encountered: