Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

执行openai_demo.py 出现CUDA error: device-side assert triggered 错误 #28

Closed
1 of 2 tasks
leeaction opened this issue Jan 22, 2025 · 4 comments
Closed
1 of 2 tasks
Assignees

Comments

@leeaction
Copy link

System Info / 系統信息

在执行app里的 python openai_demo.py --model_path THUDM/cogagent-9b-20241220 --host 0.0.0.0 --port 7870

出现以下错误

../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
Exception in thread Thread-2 (generate_text):
Traceback (most recent call last):
File "/data1/liangzengyan/programs/miniconda3/envs/cognew/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data1/liangzengyan/programs/miniconda3/envs/cognew/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/work/CogAgent/app/openai_demo.py", line 348, in generate_text
model.generate(**model_inputs, **gen_kwargs)
File "/data1/liangzengyan/programs/miniconda3/envs/cognew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/data1/liangzengyan/programs/miniconda3/envs/cognew/lib/python3.10/site-packages/transformers/generation/utils.py", line 2255, in generate
result = self._sample(
File "/data1/liangzengyan/programs/miniconda3/envs/cognew/lib/python3.10/site-packages/transformers/generation/utils.py", line 3300, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:21:04.0 Off | 0 |
| N/A 34C P0 37W / 250W | 2508MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Name: transformers
Version: 4.48.1

Name: torch
Version: 2.5.1

Name: torchvision
Version: 0.20.1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

python openai_demo.py --model_path THUDM/cogagent-9b-20241220 --host 0.0.0.0 --port 7870

Expected behavior / 期待表现

期望了解问题原因

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Jan 23, 2025
@zRzRzRzRzRzRzR
Copy link
Member

是否使用了BF16进行推理呢,这个看上去像是词表溢出等问题,FP16可能出现

@leeaction
Copy link
Author

是否使用了BF16进行推理呢,这个看上去像是词表溢出等问题,FP16可能出现
由于V100不支持BF16 我指定了dtype为float16 还是会出现这个问题 是不是需要将模型权重转换为float16才可以?

@zRzRzRzRzRzRzR
Copy link
Member

估计都不行,V100这个卡正常来说就是有出现上述问题的风险,我们没有做过这个部分的测试,可以转FP16推理,但是效果会差非常多,而且不保证能解决这个问题。

@leeaction
Copy link
Author

我量化到int8可以运行了,推理效果看起来也还可以

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants