Clean C language version of quantizing llama2 model and running quantized llama2 model.
The code contains some modifications (mainly about quantization and running quantized model) based on llama.c (Inference Llama 2 in one file of pure C) from Andrej Karpathy.
Simple instructions:
gcc -O3 -o quantize quantize_8bit.c -lm
./quantize {model_name}.bin
gcc -O3 -march=native runq.c -o runq -lm
./runq llama2_7b_8bit.bin -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"
gcc -O3 -o quantize quantize_8bit_64block.c -lm
More details can be found in the README.md .