Introduction

This page is project tracker to get halo models like llama3, grok1 etc. working on one or more MI3xx using shark/iree.

2024 Release Goals

V1 (Nov 2024) SDXL on MI300X for SPX mode and claims on 8x MI300X CPX mode
V2 (Dec 2024) llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch (Fused Ops Eager Mode)

Glossary

TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR

TTFT: Time To First Token (time taken from processing of prompt to first token generated by the prefill stage)

ITL: Average time between each new token generated in decode phase (second token onwards)

Run Instructions

shortfin SDXL
sglang-shortfin llama3.1

Nightly Test Reports

See latest Nightly Laama Test Report. Use Nod.AI Lab page to ssh into machine SharkMi300X to find logs and artifacts to triage the failures. File an issue (if not already filed/listed) and add to # Issues

Issues

category	issue link	assigned to	status
quark quantization	QUARK-71	Bowen Bow	FP8 matmul should be used in attention
iree codegen	18864	Ian Wood	OOM for 70B
iree Negative Memory	19077	unassigned	op uses -131072 bytes of shared memory

Schedule

(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")

Item	Current Week (Nov 4-8)	Next Week (Nov 11-15)
Machine and Storage	- @saienduri: Setup one more 8X MI300 air-cooled machine (SharkMi300X-4) with 60TB (ETA: 11/5)
Sharktank Modeling	- @kyle: Generate, verify, compile-to-vmfb 405B TP8 fp16 non-decomposed MLIR (ETA 11/1) - @dan: Add perplexity test for eager mode (pytorch run) 8B fp16 and refresh MLIR in table and on SharkMi300X machine (ETA: 11/1) - @Dan Get quark fp8 attention model (ETA: 11/8) - @archana: Debug Perplexity test numerics for vmfb for 8B FP16 (Done: 11/7) - @Ian Add VAE support in sharktank (ETA 11/8) - @George Add CLIP support through sharktank (ETA: 11/6) - @Kyle Add mmdit for flux through sharktank (ETA: 11/15) - @Boian T5XXL implementation through sharktank (ETA: 11/15)	@Dan Debugging numeric fp8 issue (ETA: 11/11) - @Stephen Debug Python 3.11/12 issues in sharktank (ETA: 11/13)
Sharding	- @boian: 8 CPU core sharded FP16 numerically verified PR (Wrong numerics issue) ETA:11/4
Performance Tuning	- @rob: Reduce IR size and complexity (ETA:11/4)
IREE codegeneration	- @mahesh support for non deocmposed decode (ETA: 11/5)	- @stan: FP8 attention (ETA: 11/15)
Serving	- @xida: Fix the KV Cache corruption issue for large prompt (DONE: 11/1) - @xida: Get shortfin working for llama 3.18b fp16 on MI300 (ETA: 11/4) - @ean Instructions to run sdxl shortfin (Done: 11/4) - @Stephen Landing integration (ETA: 11/5) - @Stephen batch size shortfin benchmark tests with sglang gpu (ETA: 11/5) - @xida: add some search algorithm (beam search) beyond greedy to improve chatbot output (ETA: ) - @ean fix and get batching working properly (ETA: 11/6)	@Ean get multi gpu working (ETA: 11/13) - @xida: add some search algorithm (beam search) beyond greedy to improve chatbot output (ETA: 11/15) - @Xida debug shortfin numerics (ETA: 11/12) - @Stephen Select gpu with llama shortfin (ETA:11/12) - @Xida rectify config's for sharktank and shortfin (ETA: 11/12) - @xida verify numeric issue is within shortfin or sharktank (ETA: 11/13) - @Avi Help find SDXL shortfin multi-gpu config for perf (ETA: 11/13)
Test Automation	- @avi: Finish 8B Fp16 automation (ETA: 11/1) - Have automation dashboard showing llama3.1 tests running (Done:11/5)	@Avi Benchmark latency 8b Numbers (decomposed and not for prefill) and TP8 (ETA: 11/11)

Status-Numerics

Following naming convention should be used for weights and artifacts (on SharkMI300x and other similar machines)

UnSharded Weights:

/data/<model_name>/weights/<model_size>/<modelname_modelsize_datatype>.irpa

Example: /data/llama-3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa

Sharded Weights:

/data/<model_name>/weights/<model_size>/<shard_size>/<modelname_modelsize_shardsize_ranksuffix>.irpa

Example: /data/llama-3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank0.irpa

Artifacts:

/data/<model_name>/artifacts/<model_size>/<model_name>_<model_size>_<data_type>_<attention_kind>__<batch_size>.[mlir | vmfb]

Example: /data/llama-3.1/artifacts/405b/llama3.1_405b_fp16_nondecomposed_tp8_bs4.mlir

decomposed

To generate artifacts, on SharkMI300x, follow sharktank setup instructions, then: python -m sharktank.examples.export_paged_llm_v1 --irpa-file=/data/llama-3.1/weights/8b/fp16/llama3.1_8b_fp16.irpa --output-mlir f16_dc.mlir --bs=1 --attention-kernel=decomposed

405B TP8 commands

Shard irpa file:

python3 -m sharktank.examples.sharding.shard_llm_dataset --irpa-file llama3_405b_f16.irpa --output-irpa test.irpa --tensor-parallelism-size 8

Export to MLIR:

python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=test.irpa --output-mlir=405b_f16_tp8_decomposed.mlir --output-config=405b_f16_tp8_decomposed.json --bs=4 --attention-kernel decomposed

Compile (FAIL compile error, seems related to this PR):

iree-compile 405b_f16_tp8_decomposed.mlir --iree-hip-target=gfx942 --iree-hal-target-backends=rocm -o=405b_f16_tp8_decomposed.vmfb --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-hal-force-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
llama3.1-8B-FP16-decomposed	PASS TP1 mlir gguf [irpa](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/ llama3_8b/8b_f16.irpa)	PASS vmfb	PASS	tbd	tbd
llama3.1-8B-FP16-decomposed-TP8	PASS (MLIR)	PASS	PASS	FAIL (probably)	tbd
llama3.1-70B-FP16-decomposed	PASS TP1 mlir gguf irpa	PASS vmfb	FAIL OOM	tbd	tbd
llama3.1-405B-FP16-decomposed	PASS TP1 mlir gguf	tbd	tbd	tbd	tbd
llama3.1-405B-FP16-decomposed-TP8	PASS MLIR	PASS vmfb	FAIL Registers	tbd	tbd
llama3.1-8B-FP8-decomposed	PASS TP1 mlir irpa	Fails in iree, patch	tbd	tbd	tbd
llama3.1-70B-FP8-decomposed	PASS TP1 mlir irpa	Fails in iree, patch	tbd	tbd	tbd
llama3.1-405B-FP8-decomposed	tbd	tbd	tbd	tbd	tbd

non decomposed

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
llama3.1-8B-FP16 (prefill)	PASS mlir	PASS compile	PASS run	tbd	tbd
llama3.1-8B-FP16	PASS mlir	Fails in iree, patch	tbd	tbd	tbd
llama3.1-70B-FP16	PASS mlir	Fails in iree, patch	tbd	tbd	tbd
llama3.1-405B-FP16	PASS mlir_tp8	PASS	FAIL OOM	tbd	tbd
llama3.1-8B-FP8	PASS mlir	tbd	tbd	tbd	tbd
llama3.1-70B-FP8	ETA: 11/1	tbd	tbd	tbd	tbd
llama3.1-405B-FP8	ETA: 11/5	tbd	tbd	tbd	tbd
llama-toy-size-FP32-TP2-CPU	PASS	PASS	tbd	tbd	tbd

decodeposed

(only decode is decomposed) (MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
llama3.1-8B-FP16	PASS mlir_tp1	tbd	tbd	tbd	tbd
llama3.1-70B-FP16	tbd	tbd	tbd	tbd	tbd
llama3.1-405B-FP16	PASS mlir_tp8	tbd	tbd	tbd	tbd
llama3.1-8B-FP8	PASS mlir	Fail (attention, Dan currently looking into this)	tbd	tbd	tbd
llama3.1-70B-FP8	tbd	tbd	tbd	tbd	tbd
llama3.1-405B-FP8	tbd	tbd	tbd	tbd	tbd
llama-toy-size-FP32-TP2-CPU	tbd	tbd	tbd	tbd	tbd

AMD GPU Machines

MI300

MLIR generation and Compilation

Generate IR

python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file <input_irpa path with correct sharding and dtype> --output-mlir <output-mlir> --bs <batch size> --tensor-parallelism-size <TP size if sharding> --attention-kernel <decomposed or torch_sdpa> --no-fake-quant <only for fp8>

Generate vmfb

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o <output-vmfb path>

Evaluation tests

Perplexity

Follow the steps here

Accessing sharkblobs on Azure:

In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"

Or, use command line by first installing az cli as:

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:

az storage blob upload --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

To download:

az storage blob download --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:

az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf --file llama8b_f16.gguf --account-key <key string>

Export With `sharktank` and Server with `shortfin`:

Follow the steps here.

Setup SGLang With Shortfin

Follow the steps here

SGLang/Shortfin Feature Enablement

Feature	Description	Enabled	Enablement Requirements	Reference(s)
`gen`	Generate shortfin completion, given a prompt	Yes	Enabled	Shortfin Implementation
`streaming`	Stream shortfin completion, given a prompt	Yes	Enabled	Shortfin Implementation
`run_batch`	Run batch of disjoint requests with continous batching	Yes	Enabled	Batch Docs
`fork`	Launch parallel prompts	Yes	Enabled	Fork Docs
`choices`	Given set of choices, generate response based on best log probs	No	Should work with greedy. Needs backend implementation	Greedy Token Selection OpenAI Implementation
`image`	Pass image as part of multi-modal prompt	No	Multi-Modal not supported by SF	sgl.image Docs
`regex`	Specify regular expression as decoding constraint	No	Only supported for local models	Regex Docs

SGLang Benchmark Results

The latest benchmark results for the SGLang integration can be found here

Models	compile	inference (SPX mode)	tracy
llama3.1-8b-Q4_1	PASS	prefill (1817 ms), decode (57.3 ms), commands	prefill decode
llama3.1-8b-Q4_k	PASS
llama3.1-70b-Q4_1	PASS	prefill (3543 ms), decode (213 ms), commands	prefill decode
grok-1-Q4_1	PASS	FAIL, out of memory	prefill decode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halo-models.md

halo-models.md

Introduction

2024 Release Goals

Glossary

Run Instructions

Nightly Test Reports

Issues

Schedule

Status-Numerics

decomposed

405B TP8 commands

non decomposed

decodeposed

AMD GPU Machines

MLIR generation and Compilation

Evaluation tests

Perplexity

Accessing sharkblobs on Azure:

Export With `sharktank` and Server with `shortfin`:

Setup SGLang With Shortfin

SGLang/Shortfin Feature Enablement

SGLang Benchmark Results

Archive

Status (Old)

Artifacts (Old)

Guideline:

TP1

Models	FP16	FP8	Q4_1	Q4_K	Attention IRs
llama2-7b		irpa mlir			Attention IRs
llama3-8b	mlir gguf	mlir irpa	mlir gguf	mlir gguf
llama3-70b	mlir gguf	mlir irpa	mlir gguf	mlir gguf
llama3-405b	mlir gguf		mlir gguf	mlir gguf
grok-1	mlir gguf	NA	mlir gguf	gguf

Files

halo-models.md

Latest commit

History

halo-models.md

File metadata and controls

Introduction

2024 Release Goals

Glossary

Run Instructions

Nightly Test Reports

Issues

Schedule

Status-Numerics

decomposed

405B TP8 commands

non decomposed

decodeposed

AMD GPU Machines

MLIR generation and Compilation

Evaluation tests

Perplexity

Accessing sharkblobs on Azure:

Export With sharktank and Server with shortfin:

Setup SGLang With Shortfin

SGLang/Shortfin Feature Enablement

SGLang Benchmark Results

Archive

Status (Old)

Artifacts (Old)

Guideline:

TP1

Export With `sharktank` and Server with `shortfin`: