expected scalar type BFloat16 but found Float #223

darmenliu · 2024-05-14T09:52:13Z

Hi, I met the below issues when I try to serve GPT2 as in guide, any one could help me to check if this is a error relate configuration:

$ python examples/inference/api_server_openai/query_http_requests.py
chunk content:  {"generated_text":null,"tool_calls":null,"num_input_tokens":null,"num_input_tokens_batch":null,"num_generated_tokens":null,"num_generated_tokens_batch":null,"preprocessing_time":null,"generation_time":null,"timestamp":1715708425.9959323,"finish_reason":null,"error":{"object":"error","message":"Internal Server Error","internal_message":"Internal Server Error","type":"InternalServerError","param":{},"code":500}}
Traceback (most recent call last):
  File "/home/rcp_user/yongqiang/llm-on-ray/examples/inference/api_server_openai/query_http_requests.py", line 90, in <module>
    raise e
  File "/home/rcp_user/yongqiang/llm-on-ray/examples/inference/api_server_openai/query_http_requests.py", line 85, in <module>
    choices = json.loads(chunk)["choices"]
              ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'


check the logs with "ray logs cluster"

$ ray logs cluster worker-f07bb5a711a3c88ae7720f125167d0d2a7e64799231f33910f71ac72-01000000-1624109.err
--- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. Set to -1 for getting the entire file. ---

:job_id:01000000
:actor_name:ServeReplica:router:PredictorDeployment
/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2024-05-14 13:39:51,629 - _logger.py - IPEX - WARNING - [NotSupported]fail to apply ipex.llm.optimize due to: Could not run 'ipex_prepack::linear_prepack' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'ipex_prepack::linear_prepack' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/jit/cpu/kernels/RegisterOpContextClass.cpp:192 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:154 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:324 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:86 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:53 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:57 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:65 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:69 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:77 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:61 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:90 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:73 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:81 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:297 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:378 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:244 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:202 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:162 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:166 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:158 [backend fallback]
, fallback to the origin model
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
ERROR 2024-05-14 13:40:25,987 router_PredictorDeployment dh04lq3r 442f8811-5cef-45f9-916b-e4e964ef92dc /v1/chat/completions replica.py:359 - Request failed:
ray::ServeReplica:router:PredictorDeployment.handle_request_with_rejection() (pid=1624109, ip=10.97.102.172)
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/utils.py", line 168, in wrap_to_ray_error
    raise exception
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1131, in call_user_method
    result = await self._handle_user_method_result(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1038, in _handle_user_method_result
    async for r in result:
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 444, in openai_call
    yield await self.handle_non_streaming(input, config)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 242, in handle_non_streaming
    return await self.handle_dynamic_batch((input, config))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/batching.py", line 579, in batch_wrapper
    return await enqueue_request(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/batching.py", line 265, in _assign_func_results
    results = await func_future
              ^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 269, in handle_dynamic_batch
    batch_results = self.predictor.generate(prompts, **config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/yongqiang/llm-on-ray/llm_on_ray/inference/predictors/transformer_predictor.py", line 123, in generate
    gen_tokens = self.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
              ^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1305, in forward
    transformer_outputs = self.transformer(
                          ^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1119, in forward
    outputs = block(
              ^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 616, in forward
    hidden_states = self.ln_1(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 201, in forward
    return F.layer_norm(
           ^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/functional.py", line 2573, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: expected scalar type BFloat16 but found Float
INFO 2024-05-14 13:40:25,988 router_PredictorDeployment dh04lq3r 442f8811-5cef-45f9-916b-e4e964ef92dc /v1/chat/completions replica.py:373 - OPENAI_CALL ERROR 751.0ms

Below is the environment:

$ nvidia-smi
Tue May 14 13:42:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:1B:00.0 Off |                  Off |
|  0%   37C    P8              30W / 450W |  17514MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:3D:00.0 Off |                  Off |
|  0%   38C    P8              20W / 450W |  14502MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |

I use the conda as virtual ENV, the python version is :

$ python --version
Python 3.11.9

GPT2 configuration:

$ cat llm_on_ray/inference/models/gpt2.yaml
port: 8000
name: gpt2
route_prefix: /gpt2
num_replicas: 1
cpus_per_worker: 0
gpus_per_worker: 1
deepspeed: false
workers_per_group: 2
device: cuda
ipex:
  enabled: true
  precision: bf16
model_description:
  model_id_or_path: gpt2
  tokenizer_name_or_path: gpt2
  chat_processor: ChatModelGptJ
  gpt_base_model: true
  prompt:
    intro: ''
    human_id: ''
    bot_id: ''
    stop_words: []

The text was updated successfully, but these errors were encountered:

xwu99 · 2024-05-15T01:27:02Z

Thanks for raising the issue. I think you need to set ipex enabled to false as ipex only works on Intel devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expected scalar type BFloat16 but found Float #223

expected scalar type BFloat16 but found Float #223

darmenliu commented May 14, 2024

xwu99 commented May 15, 2024

expected scalar type BFloat16 but found Float #223

expected scalar type BFloat16 but found Float #223

Comments

darmenliu commented May 14, 2024

xwu99 commented May 15, 2024