This tool is designed to load test large language models running in different runtimes / behind different Azure APIs.
- Generate a custom testing dataset tailored to settings similar to your use case.
- Load test with customizable concurrency, duration, and warmup options.
- Support for various Azure AI models, including Azure OAI, Azure OAI Embedding, Azure Model Catalog serverless and managed-compute.
- Detailed results on throughput and latency statistics in JSON format.
- Modular design, enabling developers to integrate new endpoints with minimal effort.
- UI Support.
- Batch support for embedding endpoints.
We are using cl100k_base encoder from "tiktoken" package to count tokens. To change the token encoder, you can modify the function "num_tokens_from_string" in "plugins/plugin.py"
- Azure Model as a Platform (Managed Compute).
- Azure Model as a Service (Serverless API).
- Azure-OpenAI.
- Azure-OpenAI-Embeddings.
- Cohere-Embeddings.
- Python 3.9 or newer
1- Generate Dataset (Optional):
python datasets/generate_dataset.py --tok_input_length 250 --tok_output_length 50 --N 100 --output_file datasets/random_text_dataset.jsonl
--tok_input_length
: The length of the input. minimum 25.--tok_output_length
: The length of the output.--N
: The number of samples to generate.--output_file
: The name of the output file (default israndom_text_dataset.jsonl
).
2- Running the Tool:
usage: load_test.py [-h] [-c CONFIG] [-log {warn,warning,info,debug}]
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
config YAML file name
-log {warn,warning,info,debug}, --log_level {warn,warning,info,debug}
Provide logging level. Example --log_level debug, default=warning
The tool's behavior can be customized using a YAML configuration file. Take a look at config.yaml
for an example.
- format: Specifies the format of the output file. Currently supports "json". Future support for "pickle" may be added.
- dir: Directory where the output file will be saved.
- file: Name of the output file.
- warmup: Boolean flag to enable or disable warmup requests before the actual load test.
- warmup_options:
- requests: Number of warmup requests to send.
- timeout_sec: Timeout for each warmup request in seconds.
- type: Specifies the type of storage. Currently, only "local" is supported. Future support for other storage types is planned.
- file: Path to the dataset file in JSONL format. You can also auto-generate a dataset using
datasets/generate_dataset.py
. - max_queries: Maximum number of queries to use from the dataset.
- min_input_tokens: Minimum number of input tokens per query.
- max_input_tokens: Maximum number of input tokens per query.
- max_output_tokens: Maximum number of output tokens per query.
- max_sequence_tokens: Maximum number of tokens in a sequence.
- type: Type of load pattern. Currently supports "constant". Future options include "loadgen" and "stair-step".
- concurrency: Number of concurrent requests.
- duration: Duration of the load test in seconds. Future support for different time formats like "100s", "10m", etc., is planned.
- plugin: "azure_maap_plugin"
- plugin_options:
- url: Endpoint URL for the Azure MaaP service.
- key: API key for authentication.
- deployment: Deployment name.
- plugin: "azure_serverless_plugin"
- plugin_options:
- streaming: Boolean flag to enable or disable streaming.
- url: Endpoint URL for the Azure Serverless service. Don't forget to include
/v1/chat/completions
. - model_name: Name of the model.
- key: API key for authentication.
- plugin: "azure_openai_plugin"
- plugin_options:
- streaming: Boolean flag to enable or disable streaming.
- url: Endpoint URL for the Azure-OpenAI service.
- key: API key for authentication.
- deployment: Deployment name.
- api_version: API version to use.
- plugin: "azure_oai_embeddings_plugin"
- plugin_options:
- url: Endpoint URL for the Azure-OpenAI Embeddings service.
- key: API key for authentication.
- deployment: Deployment name.
- api_version: API version to use.
The tool will produce a results summary logged to stdout, and detailed test results along with its summary in json format. The json output will have following:
- Array of results with one element per request sent during the test.
- Detailed summary report of the run.
- All the config metadata related to run.
For example:
"results": [
...
],
"config": {
...
"load_options": {
"type": "constant",
"concurrency": 8,
"duration": 20
...
},
"summary": {
"output_tokens_throughput": 159.25729928295627,
"input_tokens_throughput": 1592.5729928295625,
"full_duration": 20.093270540237427,
"total_requests": 16,
"complete_request_per_sec": 0.79, # number of competed requests / full_duration
"total_failures": 0,
"failure_rate": 0.0
#time per ouput_token
"tpot": {
"min": 0.010512285232543946,
"max": 0.018693844079971312,
"median": 0.01216195583343506,
"mean": 0.012808671338217597,
"percentile_80": 0.012455177783966065,
"percentile_90": 0.01592913103103638,
"percentile_95": 0.017840550780296324,
"percentile_99": 0.018523185420036312
},
#time to first token
"ttft": {
"min": 0.4043765068054199,
"max": 0.5446293354034424,
"median": 0.46433258056640625,
"mean": 0.4660029411315918,
"percentile_80": 0.51033935546875,
"percentile_90": 0.5210948467254639,
"percentile_95": 0.5295632600784301,
"percentile_99": 0.54161612033844
},
#input token latency
"itl": {
"min": 0.008117493672586566,
"max": 0.01664590356337964,
"median": 0.009861880810416522,
"mean": 0.010531313198552402,
"percentile_80": 0.010261738599844314,
"percentile_90": 0.013813444118403915,
"percentile_95": 0.015781731761280615,
"percentile_99": 0.016473069202959836
},
#time to ack
"tt_ack": {
"min": 0.404374361038208,
"max": 0.544623851776123,
"median": 0.464330792427063,
"mean": 0.46600091457366943,
"percentile_80": 0.5103373527526855,
"percentile_90": 0.5210925340652466,
"percentile_95": 0.5295597910881042,
"percentile_99": 0.5416110396385193
},
"response_time": {
"min": 2.102457046508789,
"max": 3.7387688159942627,
"median": 2.3843793869018555,
"mean": 2.5091602653265,
"percentile_80": 2.4795608520507812,
"percentile_90": 2.992232322692871,
"percentile_95": 3.541854977607727,
"percentile_99": 3.6993860483169554
},
"output_tokens": {
"min": 200,
"max": 200,
"median": 200.0,
"mean": 200.0,
"percentile_80": 200.0,
"percentile_90": 200.0,
"percentile_95": 200.0,
"percentile_99": 200.0
},
"input_tokens": {
"min": 2000,
"max": 2000,
"median": 2000.0,
"mean": 2000.0,
"percentile_80": 2000.0,
"percentile_90": 2000.0,
"percentile_95": 2000.0,
"percentile_99": 2000.0
},
}
}
Contributions to this tool are welcome!