Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

Open
rileyhun opened this issue Apr 25, 2024 · 3 comments
Open

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16 #1816

rileyhun opened this issue Apr 25, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@rileyhun
Copy link

Description

(A clear and concise description of what the bug is.)

I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit: TypeError: Got unsupported ScalarType BFloat16

Expected Behavior

(what's the expected behavior?)
Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.

Error Message

(Paste the complete error message, including stack trace.)

2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: weights = convert_hf_llama(
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(),
2024-04-25T11:17:01.227-07:00	[INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

from sagemaker.utils import name_from_base

model_name = name_from_base(f"my-model-djl-tensorrt")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": code_artifact,
        "Environment": {
            "ENGINE": "MPI",
            "OPTION_TENSOR_PARALLEL_DEGREE": "8",
            "OPTION_USE_CUSTOM_ALL_REDUCE": "false",
            "OPTION_OUTPUT_FORMATTER": "json",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
            "OPTION_MODEL_LOADING_TIMEOUT": "1000",
            "OPTION_MAX_INPUT_LEN": "5000",
            "OPTION_MAX_OUTPUT_LEN": "1000",
            "OPTION_DTYPE": "bf16"
        }
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")
  • Create endpoint config:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response
  • Create sagemaker endpoint:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")
@rileyhun rileyhun added the bug Something isn't working label Apr 25, 2024
@ydm-amazon
Copy link
Contributor

Hi Riley, thanks for raising the issue. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type. I'd suggest creating a ticket in the TensorRT-LLM repo about this issue.

To work-around this issue in the meantime, you could manually convert and save the model in fp32 before loading it.

@rileyhun
Copy link
Author

Hello @ydm-amazon,

Thanks for following up. I'll check w/ the TensorRT-LLM repo about the issue.

Also wanted to point out that I don't get this issue using the following args in the dockerfile:

ARG djl_version=0.27.0~SNAPSHOT

# Base Deps
ARG cuda_version=cu122
ARG python_version=3.10
ARG torch_version=2.1.0
ARG pydantic_version=2.6.1
ARG cuda_python_version=12.2.0
ARG ammo_version=0.5.0
ARG janus_version=1.0.0
ARG pynvml_version=11.5.0
ARG s5cmd_version=2.2.2

# HF Deps
ARG transformers_version=4.36.2
ARG accelerate_version=0.25.0

# Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5
ARG trtllm_toolkit_version=0.7.1
ARG trtllm_version=v0.7.1

@ydm-amazon
Copy link
Contributor

That's right - we know that TensorRT-LLM switched to a different way of loading the model from 0.7.1 to 0.8.0, so that may have caused the issue. We're also looking into our trtllm toolkit 0.8.0 to see if there's something there that may also contribute to the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants