🚀 Please read this issue for LLaMa GPU inference
Download onnx models here:
Model | Precision | Size | URL | Demo |
---|---|---|---|---|
LLaMa-7B | fp32 | 26GB | huggingface | demo_llama.py |
LLaMa-7B | fp16 | 13GB | huggingface or 硬件模型库 | demo_llama.py |
RWKV-4-palm-430M | fp16 | 920MB | huggingface or 硬件模型库 | demo_rwkv.py |
05/18 release RWKV-4 onnx models, standalone script and LLM structure comparison
05/09 trt output wrong value until issue 2928 solved
04/19 remove GPTQ zero point guidance
04/18 export mixed-precision quant table from GPTQ-for-LLaMa
04/11 add 13GB onnx-fp16 models
04/11 add memory pool, support 2GB RAM laptop ⭐
04/10 reduce onnx model size to 26GB
04/10 support temperature
add topk
logits warp
04/07 add onnxruntime demo
04/05 init project
- Release LLaMa-7B and RWKV-400M onnx models and their onnxruntime standalone demo
- No
torch
ortransformers
required - Support memory pool, works on 2GB laptop/PC (very slow 🐢)
Why do this ?
- Visualization.
graphviz
crashed on LLaMa model. LLM visualization tool must support nest or operator folding feature - Quatization. LLM often repeat itself, just like fractal. For LLaMa quantization, loading part of decoder backbone would be enough (400MB). It could be quantized partially
- Embeded device. Small board IO error occurs when
dd
a big single file - Distributed system. Inference LLM on many hybrid (FPGA/NPU/GPGPU) devices would be simple
- onnx tools. Device manufacturer has support onnx well, there is no reason to neglect it
Here is the graph to call LLaMa (RWKV is similar):
Try LLaMa onnxruntime
demo, no torch
required, and the precision has been checked.
$ python3 -m pip install -r requirements.txt
$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
..
# If you only have 4GB memory, use `--poolsize`
$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour" --poolsize 4
..
Bonjour.
# Try more options
$ python3 demo_llama.py --help
Use demo_rwkv.py to run RWKV:
$ python3 demo_rwkv.py ${FP16_ONNX_DIR}
- git clone RWKV and download its models
- copy onnx_RWKV_in_150_lines.py to ChatRWKV
$ git clone https://github.com/BlinkDL/ChatRWKV --depth=1
$ cp llama.onnx/tools/onnx_RWKV_in_150_lines.py ChatRWKV
$ cd ChatRWKV
$ mkdir models
$ python3 onnx_RWKV_in_150_lines.py
Then you would get onnx files.
$ ls -lah models
..
STEP1 Convert to HF format
These models converted from alpaca huggingface.
-
If you are using LLaMa or llama.cpp, convert it to HF format first. Here are steps:
# install transformers master $ git clone https://github.com/huggingface/transformers $ cd transformers && python3 setup.py install .. $ cd src/transformers $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${LLaMa_PATH} --model_size 7B --output_dir ${HF_PATH}
-
If you are using alpaca-lora, use this script to merge LoRA weights.
-
If you are using alpaca, go STEP2.
STEP2 torch.onnx.export
Checkout transformers to this hacking branch, run single inference.
$ python3 tools/export-onnx.py ${PATH_ALPACA_7B}
STEP3 convert to fp16/tvm
Use onnxconverter-common.float16
$ cd tools
$ python3 -m pip install -r requirements.txt
$ python3 convert-fp32-to-fp16.py ${FP32_PATH} ${FP16_PATH}
Or use relay.vm
to convert tvm
$ cd tools
$ python3 convert-to-tvm.py ${ONNX_PATH} ${OUT_DIR}
- For model structure, please read LLaMa 和 RWKV 结构对比
- I have compared the output values of
onnxruntime-cpu
andtorch-cuda
, and the maximum error is 0.002, not bad - Now
demo_llama.py
state is equivalent to these configurations
temperature=0.1
total_tokens=2000
top_p=1.0
top_k=40
repetition_penalty=1.0
- Mixed-precision kernel optimization is on the way. Here is a part of guidance.