You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your wonderful work. I have a question about, does AirLLM run faster than llama.cpp? Do you have any data on that?
As I know, llama.cpp uses mmap to manage memory. When computation meets page faults, mmap automatically loads tensor weights from disk to memory and continue computation, and it also unloads less-used tensor weights when the memory load is high, all managed by the OS. So llama.cpp also supports very large LLMs, like the feature AirLLM provides.
I noticed that AirLLM uses prefetching to overlap disk IO latency and computation, will this be faster than llama.cpp (with mmap enabled)? And how much is the improvement?
The text was updated successfully, but these errors were encountered:
Dear Lyogavin,
Thanks for your wonderful work. I have a question about, does AirLLM run faster than llama.cpp? Do you have any data on that?
As I know, llama.cpp uses mmap to manage memory. When computation meets page faults, mmap automatically loads tensor weights from disk to memory and continue computation, and it also unloads less-used tensor weights when the memory load is high, all managed by the OS. So llama.cpp also supports very large LLMs, like the feature AirLLM provides.
I noticed that AirLLM uses prefetching to overlap disk IO latency and computation, will this be faster than llama.cpp (with mmap enabled)? And how much is the improvement?
The text was updated successfully, but these errors were encountered: