This page is project tracker to get halo models like llama3, grok1 etc. working on one or more MI3xx using shark/iree.
- V1 (Nov 2024) SDXL on MI300X for SPX mode and claims on 8x MI300X CPX mode
- V2 (Dec 2024) llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch (Fused Ops Eager Mode)
TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR
TTFT: Time To First Token (time taken from processing of prompt to first token generated by the prefill stage)
ITL: Average time between each new token generated in decode phase (second token onwards)
See latest Nightly Laama Test Report. Use Nod.AI Lab page to ssh into machine SharkMi300X to find logs and artifacts to triage the failures. File an issue (if not already filed/listed) and add to # Issues
category | issue link | assigned to | status |
---|---|---|---|
quark quantization | QUARK-71 | Bowen Bow | FP8 matmul should be used in attention |
iree codegen | 18864 | Ian Wood | OOM for 70B |
iree Negative Memory | 19077 | unassigned | op uses -131072 bytes of shared memory |
(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")
Item | Current Week (Nov 4-8) | Next Week (Nov 11-15) |
---|---|---|
Machine and Storage | - @saienduri: Setup one more 8X MI300 air-cooled machine (SharkMi300X-4) with 60TB (ETA: 11/5) | |
Sharktank Modeling | - @kyle: Generate, verify, compile-to-vmfb 405B TP8 fp16 non-decomposed MLIR (ETA 11/1) - @dan: Add perplexity test for eager mode (pytorch run) 8B fp16 and refresh MLIR in table and on SharkMi300X machine (ETA: 11/1) - @Dan Get quark fp8 attention model (ETA: 11/8) - @archana: Debug Perplexity test numerics for vmfb for 8B FP16 (Done: 11/7) - @Ian Add VAE support in sharktank (ETA 11/8) - @George Add CLIP support through sharktank (ETA: 11/6) - @Kyle Add mmdit for flux through sharktank (ETA: 11/15) - @Boian T5XXL implementation through sharktank (ETA: 11/15) |
@Dan Debugging numeric fp8 issue (ETA: 11/11) - @Stephen Debug Python 3.11/12 issues in sharktank (ETA: 11/13) |
Sharding | - @boian: 8 CPU core sharded FP16 numerically verified PR (Wrong numerics issue) ETA:11/4 | |
Performance Tuning | - @rob: Reduce IR size and complexity (ETA:11/4) | |
IREE codegeneration | - @mahesh support for non deocmposed decode (ETA: 11/5) | - @stan: FP8 attention (ETA: 11/15) |
Serving | - @xida: Fix the KV Cache corruption issue for large prompt (DONE: 11/1) - @xida: Get shortfin working for llama 3.18b fp16 on MI300 (ETA: 11/4) - @ean Instructions to run sdxl shortfin (Done: 11/4) - @Stephen Landing integration (ETA: 11/5) - @Stephen batch size shortfin benchmark tests with sglang gpu (ETA: 11/5) - @xida: add some search algorithm (beam search) beyond greedy to improve chatbot output (ETA: ) - @ean fix and get batching working properly (ETA: 11/6) |
@Ean get multi gpu working (ETA: 11/13) - @xida: add some search algorithm (beam search) beyond greedy to improve chatbot output (ETA: 11/15) - @Xida debug shortfin numerics (ETA: 11/12) - @Stephen Select gpu with llama shortfin (ETA:11/12) - @Xida rectify config's for sharktank and shortfin (ETA: 11/12) - @xida verify numeric issue is within shortfin or sharktank (ETA: 11/13) - @Avi Help find SDXL shortfin multi-gpu config for perf (ETA: 11/13) |
Test Automation | - @avi: Finish 8B Fp16 automation (ETA: 11/1) - Have automation dashboard showing llama3.1 tests running (Done:11/5) |
@Avi Benchmark latency 8b Numbers (decomposed and not for prefill) and TP8 (ETA: 11/11) |
Following naming convention should be used for weights and artifacts (on SharkMI300x and other similar machines)
UnSharded Weights:
/data/<model_name>/weights/<model_size>/<modelname_modelsize_datatype>.irpa
Example: /data/llama-3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa
Sharded Weights:
/data/<model_name>/weights/<model_size>/<shard_size>/<modelname_modelsize_shardsize_ranksuffix>.irpa
Example: /data/llama-3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank0.irpa
Artifacts:
/data/<model_name>/artifacts/<model_size>/<model_name>_<model_size>_<data_type>_<attention_kind>__<batch_size>.[mlir | vmfb]
Example: /data/llama-3.1/artifacts/405b/llama3.1_405b_fp16_nondecomposed_tp8_bs4.mlir
To generate artifacts, on SharkMI300x, follow sharktank setup instructions, then:
python -m sharktank.examples.export_paged_llm_v1 --irpa-file=/data/llama-3.1/weights/8b/fp16/llama3.1_8b_fp16.irpa --output-mlir f16_dc.mlir --bs=1 --attention-kernel=decomposed
- Shard irpa file:
python3 -m sharktank.examples.sharding.shard_llm_dataset --irpa-file llama3_405b_f16.irpa --output-irpa test.irpa --tensor-parallelism-size 8
- Export to MLIR:
python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=test.irpa --output-mlir=405b_f16_tp8_decomposed.mlir --output-config=405b_f16_tp8_decomposed.json --bs=4 --attention-kernel decomposed
- Compile (FAIL compile error, seems related to this PR):
iree-compile 405b_f16_tp8_decomposed.mlir --iree-hip-target=gfx942 --iree-hal-target-backends=rocm -o=405b_f16_tp8_decomposed.vmfb --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-hal-force-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false
(MI300X GPU, SPX Mode)
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
llama3.1-8B-FP16-decomposed | PASS TP1 mlir gguf [irpa](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/ llama3_8b/8b_f16.irpa) | PASS vmfb | PASS | tbd | tbd |
llama3.1-8B-FP16-decomposed-TP8 | PASS (MLIR) | PASS | PASS | FAIL (probably) | tbd |
llama3.1-70B-FP16-decomposed | PASS TP1 mlir gguf irpa | PASS vmfb | FAIL OOM | tbd | tbd |
llama3.1-405B-FP16-decomposed | PASS TP1 mlir gguf | tbd | tbd | tbd | tbd |
llama3.1-405B-FP16-decomposed-TP8 | PASS MLIR | PASS vmfb | FAIL Registers | tbd | tbd |
llama3.1-8B-FP8-decomposed | PASS TP1 mlir irpa | Fails in iree, patch | tbd | tbd | tbd |
llama3.1-70B-FP8-decomposed | PASS TP1 mlir irpa | Fails in iree, patch | tbd | tbd | tbd |
llama3.1-405B-FP8-decomposed | tbd | tbd | tbd | tbd | tbd |
(MI300X GPU, SPX Mode)
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
llama3.1-8B-FP16 (prefill) | PASS mlir | PASS compile | PASS run | tbd | tbd |
llama3.1-8B-FP16 | PASS mlir | Fails in iree, patch | tbd | tbd | tbd |
llama3.1-70B-FP16 | PASS mlir | Fails in iree, patch | tbd | tbd | tbd |
llama3.1-405B-FP16 | PASS mlir_tp8 | PASS | FAIL OOM | tbd | tbd |
llama3.1-8B-FP8 | PASS mlir | tbd | tbd | tbd | tbd |
llama3.1-70B-FP8 | ETA: 11/1 | tbd | tbd | tbd | tbd |
llama3.1-405B-FP8 | ETA: 11/5 | tbd | tbd | tbd | tbd |
llama-toy-size-FP32-TP2-CPU | PASS | PASS | tbd | tbd | tbd |
(only decode is decomposed) (MI300X GPU, SPX Mode)
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
llama3.1-8B-FP16 | PASS mlir_tp1 | tbd | tbd | tbd | tbd |
llama3.1-70B-FP16 | tbd | tbd | tbd | tbd | tbd |
llama3.1-405B-FP16 | PASS mlir_tp8 | tbd | tbd | tbd | tbd |
llama3.1-8B-FP8 | PASS mlir | Fail (attention, Dan currently looking into this) | tbd | tbd | tbd |
llama3.1-70B-FP8 | tbd | tbd | tbd | tbd | tbd |
llama3.1-405B-FP8 | tbd | tbd | tbd | tbd | tbd |
llama-toy-size-FP32-TP2-CPU | tbd | tbd | tbd | tbd | tbd |
Generate IR
python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file <input_irpa path with correct sharding and dtype> --output-mlir <output-mlir> --bs <batch size> --tensor-parallelism-size <TP size if sharding> --attention-kernel <decomposed or torch_sdpa> --no-fake-quant <only for fp8>
Generate vmfb
iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o <output-vmfb path>
Follow the steps here
In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"
Or, use command line by first installing az cli as:
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:
az storage blob upload --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>
To download:
az storage blob download --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>
if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:
az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf --file llama8b_f16.gguf --account-key <key string>
Follow the steps here.
Follow the steps here
Feature | Description | Enabled | Enablement Requirements | Reference(s) |
---|---|---|---|---|
gen |
Generate shortfin completion, given a prompt | Yes | Enabled | Shortfin Implementation |
streaming |
Stream shortfin completion, given a prompt | Yes | Enabled | Shortfin Implementation |
run_batch |
Run batch of disjoint requests with continous batching | Yes | Enabled | Batch Docs |
fork |
Launch parallel prompts | Yes | Enabled | Fork Docs |
choices |
Given set of choices, generate response based on best log probs | No | Should work with greedy. Needs backend implementation | Greedy Token Selection OpenAI Implementation |
image |
Pass image as part of multi-modal prompt | No | Multi-Modal not supported by SF | sgl.image Docs |
regex |
Specify regular expression as decoding constraint | No | Only supported for local models | Regex Docs |
The latest benchmark results for the SGLang integration can be found here
(Note: Do not update this one)
Models | compile | inference (SPX mode) | tracy |
---|---|---|---|
llama3.1-8b-Q4_1 | PASS | prefill (1817 ms), decode (57.3 ms), commands | prefill decode |
llama3.1-8b-Q4_k | PASS | ||
llama3.1-70b-Q4_1 | PASS | prefill (3543 ms), decode (213 ms), commands | prefill decode |
grok-1-Q4_1 | PASS | FAIL, out of memory | prefill decode |
(Note: Update Schedule-Numerics table for llama3.1 artifacts instead of this table (10/20/2024 onwards))
- small files and MLIR files check into llm-dev
- large files upload to sharkblobs -> "halo-models" container on Azure and put link to that in the table(s) below
- Very large files, store on GPU server and note the name/location of/on the machine in table(s) below
Note: If a link to Azure sharkblob below gives you an error, either use az cli to download (see section Accessing sharkblobs on Azure) or click on sharkblobs , then click on "Blob containers" and then navigate to the file manually and download it.
Models | FP16 | FP8 | Q4_1 | Q4_K | Attention IRs |
---|---|---|---|---|---|
llama2-7b | irpa mlir | Attention IRs | |||
llama3-8b | mlir gguf | mlir irpa | mlir gguf | mlir gguf | |
llama3-70b | mlir gguf | mlir irpa | mlir gguf | mlir gguf | |
llama3-405b | mlir gguf | mlir gguf | mlir gguf | ||
grok-1 | mlir gguf | NA | mlir gguf | gguf |