DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

rileyhun · 2024-04-25T19:14:56Z

Description

(A clear and concise description of what the bug is.)

I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit: TypeError: Got unsupported ScalarType BFloat16

Expected Behavior

(what's the expected behavior?)
Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.

Error Message

(Paste the complete error message, including stack trace.)

2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: weights = convert_hf_llama(
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(),
2024-04-25T11:17:01.227-07:00	[INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Using TensorRT-LLM inference container derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile)
Inference Image Pushed to ECR
Model checkpoint for Zephyr-7B compressed as tarball file
Create model on Sagemaker:

from sagemaker.utils import name_from_base

model_name = name_from_base(f"my-model-djl-tensorrt")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": code_artifact,
        "Environment": {
            "ENGINE": "MPI",
            "OPTION_TENSOR_PARALLEL_DEGREE": "8",
            "OPTION_USE_CUSTOM_ALL_REDUCE": "false",
            "OPTION_OUTPUT_FORMATTER": "json",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
            "OPTION_MODEL_LOADING_TIMEOUT": "1000",
            "OPTION_MAX_INPUT_LEN": "5000",
            "OPTION_MAX_OUTPUT_LEN": "1000",
            "OPTION_DTYPE": "bf16"
        }
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Create endpoint config:

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

Create sagemaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

The text was updated successfully, but these errors were encountered:

ydm-amazon · 2024-04-29T17:48:40Z

Hi Riley, thanks for raising the issue. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type. I'd suggest creating a ticket in the TensorRT-LLM repo about this issue.

To work-around this issue in the meantime, you could manually convert and save the model in fp32 before loading it.

rileyhun · 2024-04-29T18:05:28Z

Hello @ydm-amazon,

Thanks for following up. I'll check w/ the TensorRT-LLM repo about the issue.

Also wanted to point out that I don't get this issue using the following args in the dockerfile:

ARG djl_version=0.27.0~SNAPSHOT

# Base Deps
ARG cuda_version=cu122
ARG python_version=3.10
ARG torch_version=2.1.0
ARG pydantic_version=2.6.1
ARG cuda_python_version=12.2.0
ARG ammo_version=0.5.0
ARG janus_version=1.0.0
ARG pynvml_version=11.5.0
ARG s5cmd_version=2.2.2

# HF Deps
ARG transformers_version=4.36.2
ARG accelerate_version=0.25.0

# Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5
ARG trtllm_toolkit_version=0.7.1
ARG trtllm_version=v0.7.1

ydm-amazon · 2024-04-29T21:27:51Z

That's right - we know that TensorRT-LLM switched to a different way of loading the model from 0.7.1 to 0.8.0, so that may have caused the issue. We're also looking into our trtllm toolkit 0.8.0 to see if there's something there that may also contribute to the issue.

rileyhun added the bug Something isn't working label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

rileyhun commented Apr 25, 2024

ydm-amazon commented Apr 29, 2024

rileyhun commented Apr 29, 2024

ydm-amazon commented Apr 29, 2024

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

Comments

rileyhun commented Apr 25, 2024

Description

Expected Behavior

Error Message

How to Reproduce?

ydm-amazon commented Apr 29, 2024

rileyhun commented Apr 29, 2024

ydm-amazon commented Apr 29, 2024