Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotImplementedError: Cannot copy out of meta tensor; no data! #50

Open
zdaoguang opened this issue Feb 10, 2023 · 9 comments
Open

NotImplementedError: Cannot copy out of meta tensor; no data! #50

zdaoguang opened this issue Feb 10, 2023 · 9 comments

Comments

@zdaoguang
Copy link

zdaoguang commented Feb 10, 2023

Hi,

I now employ the deepspeed framework to speed up the inference of BLOOM 7.1B, as shown below:

deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-7b1

But instead I got the following bugs:

(bloom) xxx@HOST-xxx:~/projects/transformers-bloom-inference/bloom-inference-scripts$ bash run_deepspeed.sh 
[2023-02-10 17:46:16,148] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-10 17:46:16,202] [INFO] [runner.py:548:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name bigscience/bloom-7b1
[2023-02-10 17:46:19,604] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-10 17:46:19,604] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-10 17:46:19,604] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-10 17:46:19,604] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-10 17:46:19,604] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-10 17:46:23,455] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom-7b1
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 33951.40it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 8339.85it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 7358.43it/s]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26572.10it/s]
[2023-02-10 17:46:33,775] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-10 17:46:33,778] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-10 17:46:33,779] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data/xxx/.cache/torch_extensions/py310_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.1198277473449707 seconds
[2023-02-10 17:46:34,344] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 32, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False}
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0038442611694335938 seconds
Loading 2 checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00,  9.94s/it]checkpoint loading time at rank 0: 21.33984684944153 sec
Loading 2 checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.67s/it]
Traceback (most recent call last):
  File "/data/xxx/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 181, in <module>
    model = deepspeed.init_inference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1749, in to
    return super().to(*args, **kwargs)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
    return self._apply(convert)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-10 17:46:57,652] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 25235
[2023-02-10 17:46:57,653] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloom-7b1'] exits with return code = 1

My main conda environment is:

accelerate               0.16.0
deepspeed                0.8.0
deepspeed-mii            0.0.2
huggingface-hub          0.12.0
tokenizers               0.12.1
torch                    1.13.1
transformers             4.26.0

My nvidia-smi info is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P0    37W / 250W |   1253MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   37C    P0    40W / 250W |   2411MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Can you help me to solve this bug? Thank you very much!

@mayank31398
Copy link
Collaborator

This is a bug in DeepSpeed. Can you report it there?
Also, fyi DS-inference doesn't work with pytorch 1.13.1 yet.
I would suggest to fall back to 1.12.1

@zdaoguang
Copy link
Author

zdaoguang commented Feb 12, 2023

Thanks for your reply. When I changged torch down to 1.12.1 and brought cuda up to the suitable version (10.2.89), the previous error indeed disappeared, but a new one came, as shown below.

[2023-02-12 10:19:51,085] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-12 10:19:51,252] [INFO] [runner.py:548:main] cmd = /usr/local/tools/Python-3.10.9/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 10:19:53,867] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-12 10:19:53,868] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-12 10:19:53,868] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-12 10:19:53,868] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-12 10:19:53,868] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-12 10:19:56,839] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 10:20:01,592] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-12 10:20:01,594] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-12 10:20:01,594] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /root/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu102/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] //usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o 
FAILED: gelu.cuda.o 
//usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00006b7b_00000000-6_gelu.cpp1.ii".
[2/4] //usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o 
FAILED: relu.cuda.o 
//usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00006b7c_00000000-6_relu.cpp1.ii".
[3/4] //usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o 
FAILED: layer_norm.cuda.o 
//usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(520): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(409): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_size"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(414): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(421): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(422): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_size"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(447): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=2, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=2, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=6, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=6, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=12, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=12, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=16, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=16, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

7 errors detected in the compilation of "/tmp/tmpxft_00006b7d_00000000-6_layer_norm.cpp1.ii".
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
    subprocess.run(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/zandaoguang/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 183, in <module>
    model = deepspeed.init_inference(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 126, in __init__
    self._apply_injection_policy(config)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 792, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1061, in replace_module
    replaced_module, _ = _replace_module(model, policy)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1088, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1088, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1078, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 782, in replace_fn
    new_module = replace_with_policy(child,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 473, in replace_with_policy
    new_module = transformer_inference.DeepSpeedTransformerInference(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
    inference_cuda_module = builder.load()
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 462, in load
    return self.jit_load(verbose)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
    op_module = load(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1202, in load
    return _jit_compile(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'transformer_inference'
[2023-02-12 10:20:03,879] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 27397
[2023-02-12 10:20:03,880] [ERROR] [launch.py:324:sigkill_handler] ['/usr/local/tools/Python-3.10.9/bin/python3.10', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', '/home/zandaoguang/downloads/bloom-7b1'] exits with return code = 1

My conda env ls (Python 3.10.9):

Package            Version
------------------ ----------
accelerate         0.16.0
certifi            2022.12.7
charset-normalizer 3.0.1
deepspeed          0.8.0
filelock           3.9.0
hjson              3.1.0
huggingface-hub    0.12.0
idna               3.4
ninja              1.11.1
numpy              1.24.2
packaging          23.0
pip                22.3.1
psutil             5.9.4
py-cpuinfo         9.0.0
pydantic           1.10.4
PyYAML             6.0
regex              2022.10.31
requests           2.28.2
setuptools         65.5.0
tokenizers         0.12.1
torch              1.12.1
tqdm               4.64.1
transformers       4.26.0
typing_extensions  4.4.0
urllib3            1.26.14

The nvcc -V result is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Can you help me to solve it? Thanks.

@mayank31398
Copy link
Collaborator

mayank31398 commented Feb 12, 2023

I am not really sure. Haven't seen this before but seems like CUDA is not able to compile some kernels in DeepSpeed.
I am using CUDA 11.6 with 8x A100 80GB GPUs.
Can you try to switch to CUDA 11.6?
If not, there is a dockerfile that is tested and it works fine.

However, you will need to modify it a bit for the standalone script.
I am using it for the inference server.

@zdaoguang
Copy link
Author

Actually, I can only use cuda with version 10.2, as I am using other versions of cuda that report the following error:

[2023-02-12 16:48:25,193] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-12 16:48:25,352] [INFO] [runner.py:508:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 bloom-ds-inference.py --name /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 16:48:27,793] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-12 16:48:27,793] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-12 16:48:27,793] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-12 16:48:27,793] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-12 16:48:27,793] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-12 16:48:30,664] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 16:48:35,960] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.6, git-hash=unknown, git-branch=unknown
[2023-02-12 16:48:35,963] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-12 16:48:35,963] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
  File "/data/zandaoguang/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 183, in <module>
    model = deepspeed.init_inference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 124, in __init__
    self._apply_injection_policy(config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 349, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 881, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1139, in replace_module
    replaced_module, _ = _replace_module(model, policy)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1166, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1166, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1156, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 871, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 454, in replace_with_policy
    new_module = transformer_inference.DeepSpeedTransformerInference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
    inference_cuda_module = builder.load()
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 459, in load
    return self.jit_load(verbose)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 474, in jit_load
    assert_no_cuda_mismatch()
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 100, in assert_no_cuda_mismatch
    raise Exception(
Exception: Installed CUDA version 11.1 does not match the version torch was compiled with 10.2, unable to compile cuda/cpp extensions without a matching cuda version.
[2023-02-12 16:48:37,805] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15223
[2023-02-12 16:48:37,805] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', '/home/zandaoguang/downloads/bloom-7b1'] exits with return code = 1

The pip list result is:

Package                  Version
------------------------ ----------
accelerate               0.15.0
aiohttp                  3.8.3
aiosignal                1.3.1
anyio                    3.6.2
asttokens                2.2.1
async-timeout            4.0.2
asyncio                  3.4.3
attrs                    22.2.0
backcall                 0.2.0
certifi                  2022.12.7
charset-normalizer       2.1.1
click                    8.1.3
comm                     0.1.2
datasets                 2.9.0
debugpy                  1.6.6
decorator                5.1.1
deepspeed                0.7.6
deepspeed-mii            0.0.4
dill                     0.3.6
executing                1.2.0
fastapi                  0.89.1
filelock                 3.9.0
Flask                    2.2.2
Flask-API                3.0.post1
Flask-Cors               3.0.10
frozenlist               1.3.3
fsspec                   2023.1.0
grpcio                   1.51.1
grpcio-tools             1.50.0
gunicorn                 20.1.0
h11                      0.14.0
hjson                    3.1.0
huggingface-hub          0.10.1
idna                     3.4
ipdb                     0.13.11
ipykernel                6.21.0
ipython                  8.9.0
itsdangerous             2.1.2
jedi                     0.18.2
Jinja2                   3.1.2
joblib                   1.2.0
jupyter_client           8.0.2
jupyter_core             5.2.0
MarkupSafe               2.1.2
matplotlib-inline        0.1.6
multidict                6.0.4
multiprocess             0.70.14
ninja                    1.11.1
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
packaging                23.0
pandas                   1.5.3
parso                    0.8.3
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.4.0
pip                      23.0
platformdirs             2.6.2
prompt-toolkit           3.0.36
protobuf                 4.21.12
psutil                   5.9.4
ptyprocess               0.7.0
pure-eval                0.2.2
py-cpuinfo               9.0.0
pyarrow                  11.0.0
pydantic                 1.10.2
Pygments                 2.14.0
python-dateutil          2.8.2
pytz                     2022.7.1
PyYAML                   6.0
pyzmq                    25.0.0
regex                    2022.10.31
requests                 2.28.2
responses                0.18.0
sacremoses               0.0.53
sentencepiece            0.1.97
setuptools               65.6.3
six                      1.16.0
sniffio                  1.3.0
stack-data               0.6.2
starlette                0.22.0
tokenizers               0.12.1
tomli                    2.0.1
torch                    1.12.1
torchvision              0.13.1
tornado                  6.2
tqdm                     4.64.1
traitlets                5.9.0
transformers             4.25.1
typing_extensions        4.4.0
urllib3                  1.26.14
uvicorn                  0.19.0
wcwidth                  0.2.6
Werkzeug                 2.2.2
wheel                    0.37.1
xxhash                   3.2.0
yarl                     1.8.2

The nvcc -V is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

The nvidia-smi result is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   33C    P0    28W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It feels like a version issue, but I've tried to make sure the version is the same as your docker file. So, have you encountered this problem? Thank you again.

@mayank31398
Copy link
Collaborator

I think your environment is configured with CUDA 11.1 and torch is compiled using 10.2.
Can you install torch using the same CUDA version?

@wohenniubi
Copy link

Hi @mayank31398, I ran into a similar issue when employing the deepspeed framework to speed up the inference of BLOOM 7.1B. Could you please take a look? Many thanks

The cmd is shown below:
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

The log is listed as follows:

[root@7656ea32130c transformers-bloom-inference]# deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[2023-03-09 06:03:27,119] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-09 06:03:30,403] [INFO] [runner.py:508:main] cmd = /opt/conda/envs/inference/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-devel-2.12.10-1+cuda11.6
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl-2.12.10-1+cuda11.6
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-devel
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_VERSION=2.12.10
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-03-09 06:03:32,070] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-03-09 06:03:32,070] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-03-09 06:03:32,070] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-03-09 06:03:32,070] [INFO] [launch.py:162:main] dist_world_size=8
[2023-03-09 06:03:32,070] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-03-09 06:03:34,840] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 573/573 [00:00<00:00, 41.8kB/s]
/cos/HF_cache/models--bigscience--bloom/snapshots/ea51bbb9a58423efb336e2d6c900a8b3dc64b2eb
[2023-03-09 06:03:44,806] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.6, git-hash=unknown, git-branch=unknown
[2023-03-09 06:03:44,807] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,807] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu116/transformer_inference...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu -o dequantize.cuda.o
[2/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o
[3/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu -o transform.cuda.o
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(56): warning #177-D: variable "lane" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(95): warning #177-D: variable "half_dim" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(112): warning #177-D: variable "vals_half" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(113): warning #177-D: variable "output_half" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(130): warning #177-D: variable "lane" was declared but never referenced

[4/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -o apply_rotary_pos_emb.cuda.o
[5/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu -o softmax.cuda.o
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu(275): warning #177-D: variable "alibi_offset" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu(430): warning #177-D: variable "warp_num" was declared but never referenced

[6/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o
[7/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o
[8/9] c++ -MMD -MF pt_binding.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp -o pt_binding.o
[9/9] c++ pt_binding.o gelu.cuda.o relu.cuda.o layer_norm.cuda.o softmax.cuda.o dequantize.cuda.o apply_rotary_pos_emb.cuda.o transform.cuda.o -shared -lcurand -L/opt/conda/envs/inference/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o transformer_inference.so
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.044259786605835 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.013221502304077 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 23.90252995491028 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.007809162139893 seconds
Time to load transformer_inference op: 24.017361402511597 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 23.906622886657715 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.007588863372803 seconds
Time to load transformer_inference op: 24.015749216079712 seconds
[2023-03-09 06:04:09,565] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False}
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06393146514892578 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...Time to load transformer_inference op: 0.061557769775390625 seconds

No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.061757564544677734 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06235527992248535 seconds
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06160426139831543 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06882047653198242 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06495046615600586 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.07005953788757324 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05634450912475586 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05931544303894043 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06092071533203125 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05466651916503906 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.058559417724609375 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05735135078430176 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05769968032836914 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06432437896728516 seconds
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 6: 0.0035653114318847656 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 4: 0.0038051605224609375 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 3: 0.0014710426330566406 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 119, in <module>
    main()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 115, in main
    benchmark_end_to_end(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 48, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(partial(ModelDeployment, args=args, grpc_allowed=False))
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/utils/utils.py", line 152, in run_and_log_time
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 7: 0.002664327621459961 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
    results = execs()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
    self.model = get_model_class(args.deployment_framework)(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/models/ds_inference.py", line 53, in __init__
Traceback (most recent call last):
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 119, in <module>
    main()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 115, in main
    benchmark_end_to_end(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 48, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(partial(ModelDeployment, args=args, grpc_allowed=False))
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/utils/utils.py", line 152, in run_and_log_time
    results = execs()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
    self.model = get_model_class(args.deployment_framework)(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/models/ds_inference.py", line 53, in __init__
    self.model = deepspeed.init_inference(
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
        self.model = deepspeed.init_inference(engine = InferenceEngine(model, config=ds_inference_config)

  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    self.module.to(device)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1749, in to
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
    self.module.to(device)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1749, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    return self._apply(convert)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Here, I use the docker image generated by the Dockerfile from https://github.com/huggingface/transformers-bloom-inference/blob/main/Dockerfile. The pip list shows

Package            Version
------------------ ------------
accelerate         0.16.0
anyio              3.6.2
certifi            2022.12.7
charset-normalizer 3.1.0
click              8.1.3
deepspeed          0.7.6
fastapi            0.89.1
filelock           3.9.0
Flask              2.2.3
Flask-API          3.0.post1
grpcio             1.51.3
grpcio-tools       1.50.0
gunicorn           20.1.0
h11                0.14.0
hjson              3.1.0
huggingface-hub    0.12.1
idna               3.4
importlib-metadata 6.0.0
itsdangerous       2.1.2
Jinja2             3.1.2
MarkupSafe         2.1.2
ninja              1.11.1
numpy              1.24.2
packaging          23.0
pip                23.0.1
protobuf           4.22.1
psutil             5.9.4
py-cpuinfo         9.0.0
pydantic           1.10.2
PyYAML             6.0
regex              2022.10.31
requests           2.28.2
setuptools         65.6.3
sniffio            1.3.0
starlette          0.22.0
tokenizers         0.13.2
torch              1.12.1+cu116
tqdm               4.65.0
transformers       4.26.1
typing_extensions  4.5.0
urllib3            1.26.14
uvicorn            0.19.0
Werkzeug           2.2.3
wheel              0.38.4
zipp               3.15.0

The nvidia-smi shows:

 +-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:27:00.0 Off |                    0 |
| N/A   32C    P0    69W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:2A:00.0 Off |                    0 |
| N/A   29C    P0    66W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:51:00.0 Off |                    0 |
| N/A   31C    P0    69W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   33C    P0    63W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:9E:00.0 Off |                    0 |
| N/A   32C    P0    65W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:A4:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C7:00.0 Off |                    0 |
| N/A   29C    P0    64W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   32C    P0    66W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The nvcc-V shows

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

@mayank31398
Copy link
Collaborator

The dockerfile works out of the box.
Can you give it a shot?

@wohenniubi
Copy link

Many thanks for your prompt response @mayank31398

The dockerfile is as follows:

root@super-klb:~/test/transformers-bloom-inference-GPU# cat Dockerfile
FROM nvidia/cuda:11.6.1-devel-ubi8 as base

RUN dnf install -y --disableplugin=subscription-manager make git && dnf clean all --disableplugin=subscription-manager

# taken form pytorch's dockerfile
RUN curl -L -o ./miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    chmod +x ./miniconda.sh && \
    ./miniconda.sh -b -p /opt/conda && \
    rm ./miniconda.sh

ENV PYTHON_VERSION=3.9 \
    PATH=/opt/conda/envs/inference/bin:/opt/conda/bin:${PATH}

# create conda env
RUN conda create -n inference python=${PYTHON_VERSION} pip -y

# change shell to activate env
SHELL ["conda", "run", "-n", "inference", "/bin/bash", "-c"]

FROM base as conda

# update conda
RUN conda update -n base -c defaults conda -y
# cmake
RUN conda install -c anaconda cmake -y

# necessary stuff
RUN pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 \
    transformers==4.26.1 \
    deepspeed==0.7.6 \
    accelerate==0.16.0 \
    gunicorn==20.1.0 \
    flask \
    flask_api \
    fastapi==0.89.1 \
    uvicorn==0.19.0 \
    jinja2==3.1.2 \
    pydantic==1.10.2 \
    huggingface_hub==0.12.1 \
        grpcio-tools==1.50.0 \
    --no-cache-dir

# clean conda env
RUN conda clean -ya

# change this as you like 🤗
ENV TRANSFORMERS_CACHE=/cos/HF_cache \
    HUGGINGFACE_HUB_CACHE=${TRANSFORMERS_CACHE}

FROM conda as app

WORKDIR /src
RUN chmod -R g+w /src

RUN mkdir /.cache && \
    chmod -R g+w /.cache

ENV PORT=5000 \
    UI_PORT=5001
EXPOSE ${PORT}
EXPOSE ${UI_PORT}

#CMD git clone https://github.com/huggingface/transformers-bloom-inference.git && \
#    cd transformers-bloom-inference && \
#    # install grpc and compile protos
#    make gen-proto && \
#    make bloom-560m

I simply commend the last 5 lines and do them in the docker manually (to avoid repeated git clone the repo when I docker exec the created instance with another terminal)
Especially, here are my steps to create the docker and launch the instance:

git clone https://github.com/huggingface/transformers-bloom-inference transformers-bloom-inference-GPU
cd transformers-bloom-inference-GPU
comment out the last 5 lines of Dockerfile as mentioned above
docker build -t transformers-bloom:v1.0 .
docker run --gpus all -it --name="bloom" -v /nfs/users/test:/nfs/users/test -w /nfs/users/test transformers-bloom:v1.0

Then in the docker, make bloom-176b, launch the benchmark, and hit the NotImplementedError: Cannot copy out of meta tensor; no data!

git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference && \
    # install grpc and compile protos
    make gen-proto && \
    make bloom-176b
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

To supplement: I could successfully run the benchmark of bloom3b and get the perf data. First add the following lines in Makefile

bloom-3b:
        make ui

        TOKENIZERS_PARALLELISM=false \
        MODEL_NAME=bigscience/bloom-3b \
        MODEL_CLASS=AutoModelForCausalLM \
        DEPLOYMENT_FRAMEWORK=ds_inference \
        DTYPE=fp16 \
        MAX_INPUT_LENGTH=32 \
        MAX_BATCH_SIZE=4 \
        CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
        gunicorn -t 0 -w 1 -b 127.0.0.1:5000 inference_server.server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s'

Then

bloom-3b
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom-3b --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

Untitled picture

@mayank31398
Copy link
Collaborator

Not sure why 176b is not working. I will try to look into it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants