Flask API Rest for Testing LLM Models with Llama.cpp Library. Try large language models without code.
This Flask application offers a unique local chat experience for testing Large Language Model (LLM) using the Llama.cpp library. Unlike many online platforms, this chat operates entirely offline, ensuring user privacy by eliminating the need for internet access and avoiding data sharing with third-party companies. Users can confidently mount and evaluate LLM models in the GGUF format without compromising their data security. The app is under active development, with a focus on enhancing features and maintaining robust privacy measures.
For production environments use another wsgi server.
- Flask
- flask_socketio
- CORS
- llama-cpp-python
pip install Flask flask-socketio flask-cors
Or in the proyect folder
pip install -r requirements.txt
There are different options for installing the llama-cpp package:
- CPU Usage
- CPU + GPU (using one of the many BLAS backends)
- Metal GPU (MacOS with Apple Silicon chip)
pip install --upgrade --quiet llama-cpp-python
llama.cpp supports multiple BLAS backends for faster processing. Use the FORCE_CMAKE=1 environment variable to force the use of cmake and install the pip package for the desired BLAS backend.
Example installation with cuBLAS backend:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
IMPORTANT: If you have already installed the CPU-only version of the package, you must reinstall it from scratch. Consider the following command:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
llama.cpp supports Apple Silicon as a first-class citizen, optimized through ARM NEON, Accelerate, and Metal frameworks. Use the FORCE_CMAKE=1 environment variable to force the use of cmake and install the pip package for Metal support.
Open a terminal and check these examples.
Example installation with Metal support:
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
IMPORTANT: If you have already installed a CPU-only version of the package, you must reinstall it from scratch: consider the following command:
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
export CXX=$(which g++)
export CUDA_PATH=/usr/local/cuda
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
If you get an error for not finding the cuda architecture, you can try this command:
CUDA_ARCH=sm_86 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
In my case the architecture of the card is ampere so it corresponds to sm_86, change it to the one that corresponds to your graphics card
If you want to install llama-cpp-python by compiling it from the source, you can follow most of the instructions in the repository itself. However, there are some specific instructions for Windows that might be helpful.
-
git
-
python
-
cmake
-
Visual Studio Community / Enterprise (ensure you install this with the following setup)
- Desktop development with C++
- Python development
- Embedded Linux development with C++
-
Download and install CUDA Toolkit 12.3 from the official Nvidia website.
Verify the installation with
nvcc --version
andnvidia-smi
.Copy the files from:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.\extras\visual_studio_integration\MSBuildExtensions
To the folder:
For Enterprise version:
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations
For Community version:
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations
Clone the git repository recursively to also get the llama.cpp submodule.
git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git
Open a command prompt and set the following environment variables.
set FORCE_CMAKE=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=OFF
If you have an NVIDIA GPU, make sure DLLAMA_CUBLAS is set to ON.
Now you can navigate to the llama-cpp-python directory and install the package.
python3 -m pip install -e .
IMPORTANT: If you have already installed a CPU-only version of the package, you must reinstall it from scratch: consider the following command:
python3 -m pip install -e . --force-reinstall --no-cache-dir
After installation:
Go to the llama-cpp installation folder
If you don't know where the directory is located, you can always do a quick search for the file llama_chat_format.py
Rename the file llama_chat_format.py to llama_chat_format.bk and paste the attached file into this repository.
This solution is provisional but necessary to be able to use models like Mixtral and others that need templates that are not included by default in the llama-cpp library. In the future I will implement a template editor to create, load and save templates without having to replace library files.
Download the model and place it in the models/llama folder. The path looks like this:
models/llama/llama-2-7b-chat.Q8_0.gguf
This model is used as the default model, (dont need llama_chat_format.py file sustitution).
You can add more models in .gguf format on Hugging Face and they will be added directly to the list in the interface.
TheBloke/mixtral_7bx2_moe[ Size - 8.87 GB | Max ram required - 11.37 GB ] (need llama_chat_format.py sustitution and use Custom-IALab chat format)
The path looks like this:
models/TheBloke/mixtral_7bx2_moe/mixtral_7bx2_moe.Q5_0.gguf
IMPORTANT: Remember to use models of size according to the available RAM of your graphics card. In the case of MacOs with Metal, the maximum memory that can be used for inference is limited, around 65-75% of the total memory. For use with CPU the limit is the total memory of the CPU on both Windows and Mac.
Run the App:
python app.py
Contributions are welcome! If you find any bugs or have suggestions to improve this Framework, feel free to open an issue or submit a pull request.
OR...
!pullrequest && !putIssue ? user.donate(CoffeeWhitPaypal) : null;
Tested in MacBook-Pro M3-Pro 11 cpu cores , 14 gpu cores, 18 unify memory (Sonoma 14.1) & AMD Ryzen 5600x, Nvidia RTX 3060 gaming OC 12GB, 32GB cpu Memory (Linux/Windows). Tested with 12gb max size models | Python version 3.11.7 |
This code released under the MIT License