Releases: LostRuins/koboldcpp
koboldcpp-1.0.9beta
koboldcpp-1.0.9beta
- Integrated support for GPT2! This also should theoretically work with Cerebras models, but I have not tried those yet. This is a great way to get started as now you can try models so tiny even a potato CPU can run them. Here's a good one to start with: https://huggingface.co/ggerganov/ggml/resolve/main/ggml-model-gpt-2-117M.bin with which I can generate 100 tokens in a second.
- Upgraded embedded Kobold Lite to support a Stanford Alpaca compatible Instruct Mode, which can be enabled in settings.
- Removed all
-march=native
and-mtune=native
flags when building the binary. Compatibility should be more consistent with different devices now. - Fixed an incorrect flag name used to trigger the ACCELERATE library for mac OSX. This should give you greatly increased performance of OSX users for GPT-J and GPT2 models, assuming you have ACCELERATE support.
- Added Rep Pen for GPT-J and GPT-2 models, and by extension pyg.cpp, this means that repetition penalty now works similar to the way it does in llama.cpp.
To use, download and run the koboldcpp.exe
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
koboldcpp-1.0.8beta
koboldcpp-1.0.8beta
- Rebranded to koboldcpp (formerly llamacpp-for-kobold). Library file names and references are changed too, Please let me know if anything is broken!
- Added support for the original GPT4ALL.CPP format!
- Added support for GPT-J formats, including the original 16bit legacy format as well as the 4bit version from Pygmalion.cpp
- Switched compiler flag from
-O3
to-Ofast
. This should increase generation speed even more, but I dunno if anything will break, please let me know if so. - Changed default threads to scale according to physical Core counts instead of
os.cpu_count()
. This will generally result in fewer threads being utilized, but it should provide a better default for slower systems. You can override this manually with--threads
parameter.
To use, download and run the koboldcpp.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.7
llamacpp-for-kobold-1.0.7
- Added support for new version of the ggml llamacpp model format (magic=ggjt, version 3). All old versions will continue to be supported.
- Integrated speed improvements from parent repo.
- Fixed an encoding issue with utf-8 in the outputs.
- Improved console debug information during generation, now shows token progress and time taken directly.
- Set non-streaming to be the default mode. You can enable streaming with
--stream
To use, download and run the llamacpp-for-kobold.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.6-beta
llamacpp-for-kobold-1.0.6-beta
- This is an experimental release containing new integrations for OpenBLAS, which should increase initial prompt processing speed on compatible systems by over 2 times!
- Updated Embedded Kobold Lite with the latest version which supports pseudo token streaming. This should make the UI feel much more responsive during prompt generation.
- Switched to argparse, you can view all command line flags with
llamacpp-for-kobold.exe --help
- To disable OpenBLAS, you can run it with
--noblas
. Please tell me if you have issues with it, and include which specific OS and platform.
To use, download and run the llamacpp-for-kobold.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.5
llamacpp-for-kobold-1.0.5
- Merged the upstream fixes for 65b
- Clamped max thread count to 4, it actually provides better results as it is memory bottlenecked.
- Added support for select kv data type, defaulting to f32 instead of f16
- Added more default build flags
- Added softprompts endpoint
To use, download and run the llamacpp_for_kobold.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.4
llamacpp-for-kobold-1.0.4
- Added a script to make standalone pyinstaller .exes, which will be used for all future releases. The
llamacpp.dll
andllama-for-kobold.py
files are still available by cloning the repo and will be included and updated there. - Added token caching for prompts, allowing fast forwarding through partially duplicated prompts. This make edits towards the end of the previous prompt much faster.
- Merged improvements from parent repo.
- Weights not included.
To use, download and run the llamacpp_for_kobold.exe
Alternatively, drag and drop a compatible quantized model for llamacpp on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.3
llamacpp-for-kobold-1.0.3
- Applied the massive refactor from the parent repo. It was a huge pain but I managed to keep the old tokenizer untouched and retained full support for the original model formats.
- Reduced default batch sizes greatly, as large batch sizes were causing bad output and high memory usage
- Support dynamic context lengths sent from client.
- TavernAI is working although I wouldn't recommend it, they spam the server with multiple requests of huge contexts so you're going to have a very painful time getting responses.
Weights not included.
To use, download, extract and run (defaults port is 5001):
llama_for_kobold.py [ggml_quant_model.bin] [port]
and then you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.2
llamacpp-for-kobold-1.0.2
- Added an embedded version of Kobold Lite inside (AGPL Licensed)
- Updated to new ggml model format, but still maintain support for the old one and the old tokenizer.
- Changed license to AGPL v3. The original GGML library and llama.cpp are still under MIT license in their original repos.
Weights not included.
To use, download, extract and run (defaults port is 5001):
llama_for_kobold.py [ggml_quant_model.bin] [port]
and then you can connect like this (or use the full koboldai client):
http://localhost:5001
llamacpp-for-kobold-1.0.1
llamacpp-for-kobold-1.0.1
- Bugfixes for OSX, and KV caching allows continuing a previous generation without reprocessing the whole prompt
- Weights not included.
To use, download, extract and run (defaults port is 5001):
llama_for_kobold.py [ggml_quant_model.bin] [port]
and then you can connect like this (or use the full koboldai client):
https://lite.koboldai.net/?local=1&port=5001
llamacpp-for-kobold-1.0.0
llamacpp-for-kobold-1.0.0
Initial version
Weights not included.
To use, download, extract and run:
llama_for_kobold.py [ggml_quant_model.bin] [port]