Additional Outputs from vLLM

The vLLM backend supports sending additional outputs from vLLM on top of the usual text_output when requested.

All additional outputs are disabled by default and they need to be enabled on a per-request basis. If enabled, the corresponding output tensor will be set for all responses from the request.

Supported Additional Outputs

Finish Reason

The reason why the sequence is finished. See here for more details.

To enable, set return_finish_reason input tensor to True. The reason will be sent as a string on the finish_reason output tensor.

Cumulative Log Probabilities

The cumulative log probability of the generated output text. See here for more details.

To enable, set return_cumulative_logprob input tensor to True. The floating point value will be sent on the cumulative_logprob output tensor.

Log Probabilities

The log probabilities of the top probability tokens at each position of the logprobs are requested. Only the log probabilities of the new tokens generated since the last response are returned on each new response. See here for more details on the log probabilities.

To enable, set return_logprobs input tensor to True. The log probabilities will be sent on the logprobs output tensor as a serialized JSON string.

Number of Input Tokens

The number of token IDs of the prompt. See here for more details.

To enable, set return_num_input_tokens input tensor to True. The unsigned integer value will be sent on the num_input_tokens output tensor.

Number of Output Tokens

The number of token IDs of the generated output text sent on this response. It is the difference in length of the token IDs generated from the last response to this response. If this is the first response, the last response length is presumed to be zero. See here for more details on the token IDs of the generated output text.

To enable, set return_num_output_tokens input tensor to True. The unsigned integer value will be sent on the num_output_tokens output tensor.

Examples

Add Finish Reason to Outputs

import numpy as np
import tritonclient.grpc as grpcclient

inputs = []

inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
inputs[-1].set_data_from_numpy(
    np.array(["example prompt".encode("utf-8")], dtype=np.object_)
)

inputs.append(grpcclient.InferInput("return_finish_reason", [1], "BOOL"))
inputs[-1].set_data_from_numpy(np.array([True], dtype=bool))

def callback(result, error):
    ...
    print(result.as_numpy(name="finish_reason"))

with grpcclient.InferenceServerClient("localhost:8001") as client:
    client.start_stream(callback)
    client.async_stream_infer("vLLM_model_name", inputs=inputs, ...)
    client.stop_stream()

Notes

Enabling additional outputs may impact performance, only add additional outputs when necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

additional_outputs.md

additional_outputs.md

Additional Outputs from vLLM

Supported Additional Outputs

Finish Reason

Cumulative Log Probabilities

Log Probabilities

Number of Input Tokens

Number of Output Tokens

Examples

Add Finish Reason to Outputs

Notes

Files

additional_outputs.md

Latest commit

History

additional_outputs.md

File metadata and controls

Additional Outputs from vLLM

Supported Additional Outputs

Finish Reason

Cumulative Log Probabilities

Log Probabilities

Number of Input Tokens

Number of Output Tokens

Examples

Add Finish Reason to Outputs

Notes