-
Notifications
You must be signed in to change notification settings - Fork 503
FAQs
Generally, directly putting the latest version of exe in an empty directory, starting the program, waiting for the download list to finish, then clicking on install dependencies and following the guide to start should not have any problems.
Additionally, the preset configuration is relatively conservative. If your computer can successfully enable Custom CUDA kernel acceleration, you can generally use a configuration that is 1-2GB more than your actual graphics memory. If you have changed the configurations and want to reset them, or if you want to pull the newest preset configurations, please delete the local config.json file, restart the program. If you understand the parameters in the configs page, you can ignore the preset configs or even delete them all. For users before v1.0.8, I recommend deleting the config.json file, pulling the latest preset configs, i added and adjusted the 8G, 12G, and 16G presets, and all presets have enabled Custom CUDA kernel acceleration by default.
Q1: Starting with the CUDA kernel acceleration will cause the initial failure of the 16 and 40 series graphics cards.
A1: Delete the cache.json file under the directory, then restart the program to pull the newest kernel. You can view them in the download list.
A2: Like Q1, delete cache.json, then restart to pull the latest API program, check the download list, and wait until it is complete.
A typical example of calling an API is to open the browser console, paste and execute the following code. You should see the output answer:
fetch("http://127.0.0.1:8000/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages: [{ role: "user", content: "Hello" }] }),
})
.then((r) => r.json())
.then(console.log);
Q3: If the software automatic update download is stuck, or you want to manually download and overwrite it, the correct operation step is:
A3: If you want to pull the latest relevant dependencies at the same time, delete cache.json and then start the new version of exe. If you are deploying in an offline environment, please keep cache.json, or at least create a new empty cache.json file to avoid pulling the newest dependencies. If you want to update the API manually in the offline environment, please refer to Q6 below.
A4: Please check whether all contents in the download list have been downloaded. After the download is complete, click on install dependencies. If it is not downloading, please click on the continue button manually to continue. If the download list is blank, it means that the local files are normal and dependencies can be installed.
If it still doesn't download, you can go to GitHub to download it manually, place this folder (https://github.com/josStorer/RWKV-Runner/tree/master/backend-python) next to the exe, and download get-pip.py from https://cdn.jsdelivr.net/gh/pypa/get-pip/public/get-pip.py, put it into backend-python folder too.
A5: download the novel model here (https://huggingface.co/BlinkDL/rwkv-4-novel/tree/main), then place it into the models directory, refresh the models list, and then use it in the completion page.
Note that the novel model is not suitable for chat, only limited to writing.
A6: Just like Q4, go to GitHub to download it manually, place this folder (https://github.com/josStorer/RWKV-Runner/tree/master/backend-python) next to the exe
A7: One possibility is insufficient VRAM. Before starting, open task manager and check the VRAM usage. If it reaches maximum during the startup process, this error may be caused by insufficient VRAM. Try reducing the number of layers loaded in the configs page.
Another reason could be if the error message contains "not enough memory". You can copy the entire error message to Notepad and then search for it. If found, it means that your memory is insufficient, which generally occurs when int8 quantization enabled. If your computer has a lot of memory, try closing some unused programs before starting.
If you have low memory, typically 16GB of memory, but your graphics card is either 3060 or 4060, which is adequate in terms of performance but unable to run int8 quantization, you can try increasing virtual memory. If it still fails, you can go to this link (https://huggingface.co/appleatiger/rwkv_cuda_i8/tree/main) to download the quantized model and put it into the models directory, then refresh models list.
Note that this link is fully quantized in int8. You have to run it with filling the loaded layers to the maximum. If your graphics memory can only load some of the layers, you can ask someone else to convert it for you according to the number of your layers and then send it to you.
With 8GB of graphics memory, you can run the complete 7B int8 model with CUDA kernel enabled. After downloading from the address above, use the following configuration. Note that Enable High Precision For Last Layer
should not be enabled.
Q8: Failed to start because the int8 quantization model was obtained from other sources instead of using rwkv-runner conversion.
A8: Usually, the ones obtained from other sources are full-layer quantized. In order to use it in rwkv-runner, you have to maximize the layers loaded into the VRAM, turn off Enable High Precision For Last Layer
and select int8 precision, then enable CUDA kernel acceleration. It is just like the image shown in A7 above.
Q9: You are a 10, 16, 20, 30, or 40 series graphics card user, but the CUDA kernel acceleration failed to start.
A9: Check whether there is a folder named torch-1.13.1+cu117.dist-info under the directory py310\Lib\site-packages. If not, it is often because you have installed torch-2.0.x. Delete the two torch directories, then run rwkv-runner and let it reinstall dependencies.
A10: Please update the graphics card driver.
A11: Same as Q9 above.