Let's collaborate #71

philpax · 2023-03-14T16:08:21Z

philpax
Mar 14, 2023
Maintainer

[apologies for early send, accidentally hit enter]

Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama

Couple of differences I spotted on my quick perusal:

My version builds on both Windows and Linux, but fails to infer correctly past the first round. Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.
I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.
I vendored llama.cpp in so that I could track it more directly and use its ggml.c/h, and to make it obvious which version I was porting.

Given yours actually works, I think that it's more promising :p

What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.

setzer22 · 2023-03-14T21:20:45Z

setzer22
Mar 14, 2023
Maintainer

Hi! Thanks for your post and for all the help so far 😄

Turns out we think on extremely similar wavelengths

Glad to hear I'm not the only one who saw the potential in this project! I think having the potential to build something this huge and making it a CLI app is not aiming high enough, heh.

Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.

To be fair, I haven't even tested this on Windows. Maybe it builds just fine. But I didn't know what flags to set to compile with AVX and inference times without AVX are pretty bad (if you're telling me there's no multithreading on top of that, it's probably gonna be unusably slow anyway, unfortunately). I don't have a Windows machine to test this, so I didn't want to promise support for untested systems 😅

I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.

I thought about this! Even started with this design. But didn't want to force all the code to use a 'ctx lifetime annotation, since those typically become infectious and are a bit unergonomic. So I did the Arc / Weak thing instead, since the performance of cloning or operating with tensor pointers doesn't really matter. Still, seeing how it turned out in the end, I think the lifetime annotation wouldn't have been so bad 🤔.

That said, the "bindings" in ggml.rs do not aim to be a safe abstraction, at least not in their current state! We would need to put a bit more thought into it, because the ownership model is a bit weird there. A tensor is tied to a context, but you can then use tensors from one context to operate with tensors from another context, which makes a graph computation on one context access data from other contexts. So not even lifetime annotations would help here.

What are your immediate plans, and what do you want people to help you out with?

Very good question! So far I was most concerned with whether I could do it 🤣 But now that the library exists, I'm thinking it would be pretty good to start improving this. A few things off the top of my head:

Library-fication: This needs to happen ASAP. Split the current llama-rs crate into two crates, llama-rs would be a library, and llama-rs-cli would be the simple example CLI app we have now. I don't have much interest in making the CLI experience better (porting things like the interactive mode or terminal colors from llama.cpp), but they're welcome in case someone wants to contribute.
Add a server mode, perhaps as an addition to llama-rs-cli that would allow spawning a long-running process that can serve multiple queries. The current usage model doesn't make any sense. You spend a lot of time loading the models from disk (especially if you're using the larger ones) only to throw all that away after a single prompt generation.
Prompt caching: Another thing that doesn't make sense with the current model is when you have a huge prompt, because you need to feed it through the network every time you want to do inference with that "pre-prompt" + some user input. This is described in an issue in llama.cpp (currently unimplemented). If I understood correctly, the gist of it is that we need to dump the contents of the memory_k and memory_v tensors to disk, and load them back, and that would be the same as feeding the model the same prompt again. Choosing a fast compression algorithm would be a good way to mitigate the cost of storing massive tensors on disk.
Another thing I have on my radar is this famous "GPTQ 4bit" quantization. It is well known (I mean, you just need to run a trivial example) that the 4-bit quantization in llama.cpp affects the results of the network. If this GPT4 quantization is capable of keeping the same quality as the f16 version as some sources claim, this would be huge. But I would need to investigate more to be sure.

investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency

Yup, also considered that, I'd love to do this if possible :) Not sure how long it would take, but none of the tensor operations I ported seem too complicated. The code should be pretty straightforward to port to a different library, and I'm sure some of the Rust options achieve a more ergonomic (and safe!) API. We just have to keep an eye on performance, but having an already working ggml version means we can just benchmark.

If that change also helps us support GPU inference, that'd be pretty cool. But I don't want to add GPU support if that means people having to mess with CUDA or rocm drivers and report all sorts of issues. Unless it's something portable that works out of the box for people, I'm not interested.

I'm not sure how crazy it would be to build a tensor library on top of wgpu compute shaders. Just throwing that out there for anyone who feels crazy and/or adventurous enough 🤔 But that would eventually mean tensors on wasm, which is pretty darn cool I guess?

Anyway, I'm happy to have someone else on board 😄 If there's anything I mentioned above you'd like to take the lead on, please say so! I'm not going to be making any big changes to the code in a few days.

0 replies

setzer22 · 2023-03-14T21:59:41Z

setzer22
Mar 14, 2023
Maintainer

Just something to follow w.r.t to GPTQ quantization 👀 https://github.com/ggerganov/llama.cpp/issues/9

0 replies

mwbryant · 2023-03-14T22:21:27Z

mwbryant
Mar 14, 2023

Also count me in for any future work! I've been obsessed with llama for the past few weeks and getting a solid Rust implementation of a modern machine learning model like this is really impressive. I might try tackling breaking the app into a library in the next few days (unless someone else beats me to it 😄)

0 replies

philpax · 2023-03-15T00:46:02Z

philpax
Mar 15, 2023
Maintainer Author

To be fair, I haven't even tested this on Windows. Maybe it builds just fine. But I didn't know what flags to set to compile with AVX and inference times without AVX are pretty bad (if you're telling me there's no multithreading on top of that, it's probably gonna be unusably slow anyway, unfortunately). I don't have a Windows machine to test this, so I didn't want to promise support for untested systems 😅

I've been testing on Windows with my patches applied and it seems to work fine. It's probably not as fast as it could be, but it's plenty fast enough!

I thought about this! Even started with this design. But didn't want to force all the code to use a 'ctx lifetime annotation, since those typically become infectious and are a bit unergonomic. So I did the Arc / Weak thing instead, since the performance of cloning or operating with tensor pointers doesn't really matter. Still, seeing how it turned out in the end, I think the lifetime annotation wouldn't have been so bad 🤔.

Yeah, I actually went back and forth on this. I started without any kind of checking, promptly got owned by accessing freed memory, fixed that, and then bolted on the PhantomData afterwards. Turns out it's not too bad, as you've noticed, because the only place where the actual borrows come up is LlamaModel with reference to the context that's created during the loading process. That's a little annoying, but I worked around it by splitting the load into two so that the model could borrow from the separately-stored context:

https://github.com/philpax/ggllama/blob/7b69eb984dc32f8bcd199eb75484c33f24f9ec1f/src/llama.rs#L157-L169

Ideally, we'd still maintain the same LlamaModel interface to the outside world - but it might get annoyingly self-referential. Will play around with it sometime!

That said, the "bindings" in ggml.rs do not aim to be a safe abstraction, at least not in their current state! We would need to put a bit more thought into it, because the ownership model is a bit weird there. A tensor is tied to a context, but you can then use tensors from one context to operate with tensors from another context, which makes a graph computation on one context access data from other contexts. So not even lifetime annotations would help here.

Huh, you're right - hadn't even thought about that. That's... pretty gnarly. I wonder if it's possible for the operands in a binary operation A + B = C to each have their own lifetimes, such that 'a A + 'b = 'c C where 'a and 'b outlive ''c? I must admit I've never really delved into that kind of lifetime trickery!

Library-fication: This needs to happen ASAP. Split the current llama-rs crate into two crates, llama-rs would be a library, and llama-rs-cli would be the simple example CLI app we have now. I don't have much interest in making the CLI experience better (porting things like the interactive mode or terminal colors from llama.cpp), but they're welcome in case someone wants to contribute.

Well... I'd actually started on this before you replied 😂 Here's the PR. I wrote a Discord bot to prove that it works, too. Hell of a thing to run a binary and be able to run a LLM with friends with an evening's work!

Add a server mode, perhaps as an addition to llama-rs-cli that would allow spawning a long-running process that can serve multiple queries. The current usage model doesn't make any sense. You spend a lot of time loading the models from disk (especially if you're using the larger ones) only to throw all that away after a single prompt generation.

Yeah, I've also thought about this. Seems easy enough to do; I'd do it as a separate application just to keep the concerns separate, and to offer up a simple "API server" that anyone can run. My closest point of comparison is the API for the Automatic1111 Stable Diffusion web UI - it's not the best API, but it does prove that all you need to do is offer up a HTTP interface and They Will Come:tm:.

Prompt caching: Another thing that doesn't make sense with the current model is when you have a huge prompt, because you need to feed it through the network every time you want to do inference with that "pre-prompt" + some user input. This is described in an issue in llama.cpp (currently unimplemented). If I understood correctly, the gist of it is that we need to dump the contents of the memory_k and memory_v tensors to disk, and load them back, and that would be the same as feeding the model the same prompt again. Choosing a fast compression algorithm would be a good way to mitigate the cost of storing massive tensors on disk.

I think this could be exposed through the API, but it's not necessarily something that should be part of the API by default. I'd break apart the inference_with_prompt function into easy-to-manipulate steps, so that users could save the state of the LLM at any given moment, and make that easy to do.

That being said, that sounds pretty reasonable to do for both the CLI and/or the API. Either/or could serve as a "batteries-included" example of how to ship something that's consistently fast with the library.

Another thing I have on my radar is this famous "GPTQ 4bit" quantization. It is well known (I mean, you just need to run a trivial example) that the 4-bit quantization in llama.cpp affects the results of the network. If this GPT4 quantization is capable of keeping the same quality as the f16 version as some sources claim, this would be huge. But I would need to investigate more to be sure.

Oh yeah, it's pretty cool. I haven't played around with it much myself, but the folks over at the text-generation-webui have used it to get LLaMA 30B into 24GB VRAM without much quality loss. Seems like it's something that upstream is looking at, though, so I'm content to wait and see what they do first.

Yup, also considered that, I'd love to do this if possible :) Not sure how long it would take, but none of the tensor operations I ported seem too complicated. The code should be pretty straightforward to port to a different library, and I'm sure some of the Rust options achieve a more ergonomic (and safe!) API. We just have to keep an eye on performance, but having an already working ggml version means we can just benchmark.

Yeah, I think most of the existing Rust ML libraries should be able to handle this. I was surprised at how few operations it used while porting it myself! It's certainly much simpler than Stable Diffusion.

If that change also helps us support GPU inference, that'd be pretty cool. But I don't want to add GPU support if that means people having to mess with CUDA or rocm drivers and report all sorts of issues. Unless it's something portable that works out of the box for people, I'm not interested.

Agreed. I have lost far too much of my time trying to set up CUDA + Torch.

I'm not sure how crazy it would be to build a tensor library on top of wgpu compute shaders. Just throwing that out there for anyone who feels crazy and/or adventurous enough 🤔 But that would eventually mean tensors on wasm, which is pretty darn cool I guess?

Check out wonnx! I'm not sure if it can be used independently from ONNX, but it would be super cool to figure out. Worth having a chat with them at some point.

You could also just run the existing CPU inference in WASM, I think - you might have to get a little clever with how you deal with memory, given the 32-bit memory space, but I think it should be totally feasible to run 7B on the web. The only reason I haven't looked into it is because the weights would have to be hosted somewhere 😅

Anyway, I'm happy to have someone else on board 😄 If there's anything I mentioned above you'd like to take the lead on, please say so! I'm not going to be making any big changes to the code in a few days.

I think we're in a pretty good place! I think it's just figuring out what the best "library API" would look like, and building some applications around it to test it. From there, we can figure out next steps / see what other people need.

0 replies

asukaminato0721 · 2023-03-15T05:14:58Z

asukaminato0721
Mar 15, 2023

Actually, I did a bit of port too lol. But I am not familiar with c bindings, and I am not so brave to port the ggml library, so I just rewrite some code and leave it there. it's the utils.cpp

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=46f060c6953d228fcb46ea67dea8e8b8

~~and I am also not sure whether it works properly. But it compiles.~~

0 replies

erlend-sh · 2023-03-15T09:04:21Z

erlend-sh
Mar 15, 2023

@Noeda of https://github.com/Noeda/rllama might wanna tag along here ☺️

Also, a Tauri-app equivalent to https://github.com/lencx/ChatGPT would pair very well with this. Good task for anyone who wants to be involved but doesn’t quite feel comfortable with the low level internals.

0 replies

setzer22 · 2023-03-15T10:52:14Z

setzer22
Mar 15, 2023
Maintainer

I've been testing on Windows with my patches applied and it seems to work fine. It's probably not as fast as it could be, but it's plenty fast enough!

Glad to hear it! Then I said nothing :) If you're able to test things there, we can aim for good Windows support too then. This is probably going to be something where Rust makes it a lot more simple to get going than the C++ version.

I've never really delved into that kind of lifetime trickery!

Me neither, I'm not even sure it is possible 🤔 But definitely interesting! Still, I'd rather go down the route of replacing ggml with a pure Rust solution than to spend a lot of time building safe bindings to ggml.

Well... I'd actually started on this before you replied 😂

❤️

it's not the best API, but it does prove that all you need to do is offer up a HTTP interface and They Will Come ™️

I think that's a very good point! As for the HTTP interface, one interesting requirement the image generation APIs don't have is that with text, you generally want to stream the outputs.

A good way to do this without complicating the protocol is use something called "chunked transfer encoding" where the server sends bits of the response one piece at a time, and a compatible client can fetch the results as they come without waiting for the end of the HTTP response. Chunked transfer is a pretty old thing and should be well supported in every HTTP client. I know @darthdeus already did a little proof of concept and this works well.

That being said, that sounds pretty reasonable to do for both the CLI and/or the API. Either/or could serve as a "batteries-included" example of how to ship something that's consistently fast with the library.

Yes :) I'm really interested in making this as simple as possible. What we could do on the library side, is to have the main inference_with_prompt returns some MemoryOut<'model> struct containing several byte slice(s) for the context memory (lifetime would make sure refs never outlive the Model). We could also make that same function take an Option<MemoryIn>, which is the same struct but with owned vecs instead of slices, and that would replace the working memory from a pre-computed cache.

By default, callers just pass in None and ignore the result, and that gives them the original experience. So it's up to the caller to manage / store / serialize / whatever this cache if they want to.

Check out wonnx! I'm not sure if it can be used independently from ONNX, but it would be super cool to figure out. Worth having a chat with them at some point.

Will do!! 👀 The idea of wgpu tensors is just so appealing in that it basically works anywhere with no driver issues and on any GPU.

I think we're in a pretty good place! I think it's just figuring out what the best "library API" would look like, and building some applications around it to test it. From there, we can figure out next steps / see what other people need.

Sounds good :)

0 replies

setzer22 · 2023-03-15T10:55:12Z

setzer22
Mar 15, 2023
Maintainer

@Noeda of https://github.com/Noeda/rllama might wanna tag along here ☺️

Also, a Tauri-app equivalent to https://github.com/lencx/ChatGPT would pair very well with this. Good task for anyone who wants to be involved but doesn’t quite feel comfortable with the low level internals.

Indeed! The more we are working on this, the better 😄

As a first contact, some benchmarks comparing the ggml here and rllama's OpenCL implementations on CPU would be a good first step to evaluate whether other Rust tensor libraries would fit the bill :)

0 replies

Noeda · 2023-03-15T16:54:22Z

Noeda
Mar 15, 2023

Howdy :) I am very happy too LLM stuff picking up in Rust.

rllama is currently a chimera hybrid of 16-bit and 32-bit floats, where 16-bit floats are used in OpenCL and 32-bit floats in operations not involving OpenCL.

As a first contact, some benchmarks comparing the ggml here and rllama's OpenCL implementations on CPU would be a good first step to evaluate whether other Rust tensor libraries would fit the bill :)

Currently in terms of performance or memory use rllama is not competetive with any of the ggml stuff. I have no quantization whatsoever.

I just checked my latest commit and on CPU only OpenCL I got 678ms per token. (with GPU, ~230ms). The llama.cpp project mentions in README.md that they are at around 60ms per token which is 4x faster than even my GPU version.

I have two ideas how to collaborate in near future:

Verification of results. I get reasonable text in my implementation but I don't know if it's really done all correctly, especially tokenization. Would our projects get same output if we set top_k=1, and use the same prompt?
Apples-to-apples benchmarking scripts. I currently run a shell script that tests each configuration of rllama (GPU on/off, LLaMA-7B vs LLaMA-13B). It's very ad-hoc.

I am currently working on removing more performance bottlenecks out which might improve my rllama performance and memory, but after that I can offer to make a simple verification + benchmark suite that knows how to run our projects and verify they get the same results. I also wanted to make pretty graphs showing memory or CPU utilization use over time etc. Maybe this would go into a new repository. If you have any ideas here, I'm all ears.

Excited for all us :) 👍

0 replies

setzer22 · 2023-03-15T21:34:26Z

setzer22
Mar 15, 2023
Maintainer

I just checked my latest commit and on CPU only OpenCL I got 678ms per token. (with GPU, ~230ms). The llama.cpp project mentions in README.md that they are at around 60ms per token which is 4x faster than even my GPU version.

I guess it depends on the CPU, but my times for the f16 models are closer to 230ms, so I'd be inclined to say GPU and CPU speed is comparable. This also matches my results from when I tried another gpu implementation. On the quantized models, I do get ~100ms/token.

Verification of results. I get reasonable text in my implementation but I don't know if it's really done all correctly, especially tokenization. Would our projects get same output if we set top_k=1, and use the same prompt?

That's a very good idea :) Other than setting top_k and the same prompt, we would need to make sure rng happens in the exact same way. We're currently using whatever rand's thread_rng gives by default, which is bad for reproducibility. Sampling is done using a rand WeightedIndex, and that's the only time the rng is invoked for each of the sampled tokens:

let dist = WeightedIndex::new(&probs).expect("WeightedIndex error");
let idx = dist.sample(rng);

So my guess is that as long as we're both using the rand crate, results should be comparable.

I can offer to make a simple verification + benchmark suite that knows how to run our projects and verify they get the same results

That would be amazing! :)

0 replies

philpax · 2023-03-16T11:16:04Z

philpax
Mar 16, 2023
Maintainer Author

It's worth noting that quantisation affects both speed and quality, so any benchmarks should be done with the original weights (which will probably limit the maximum size that can be used). Additionally, llama.cpp seems to have some existing bugs around tokenisation and inference at f16.

That is to say - let's get this benchmark on the road, but I think we'll be returning slightly incorrect results until we can address those issues.

1 reply

Noeda Mar 25, 2023

I have maybe one preliminary result I can report: in past week I implemented a simple 4-bit quantization scheme in rllama. It is a somewhat simple lookup-table based scheme, where every 512 block of floating point value on a row is considered a bucket, and the values are converted to 16 distinct values within that bucket. https://github.com/Noeda/rllama/blob/k4bit/src/weight_compression.rs#L6 I was looking for something that can quantize fast enough that it can be used at model load time.

The quality is obviously worse. It does still generate meaningful text but it gets into repetition loops much more easily, and makes typos and goes off rails much easier than before, even in LLaMA-65B.

I also have some very rudimentary scripts that can compare token probabilities comparing different settings, e.g. LLaMA-65B at 4-bit quantization vs 16-bit floats. The top tokens are still mostly the same, but I think occasionally there will be glitches and unwanted tokens become very probable.

I'm working on getting 8-bit quantization working instead in rllama because the current 4-bit scheme is just too low quality to be usable. Later on I would like to try proper GPTQ-quantization to see how much better it is than my simple scheme.

Also, once the 8-bit quantization is working, I'll try to make my scripts and setup more formalized and documented so we can compare across implementations.

Tl;dr; naive 4-bit quantization destroys the usefulness of the model. You need smarter quantization or more bits.

Aisuko · 2023-05-21T02:24:35Z

Aisuko
May 21, 2023

Hi, guys. I am so excited to find you guys. Wow, I am a contributor to other communities (related to the AI domain). I am planning to write a rust-llama.cpp, but I also believe it is not a good idea dependency on c/c++ packages. It is so hard to maintain the code and combine them together. And I saw the binding in ggml` is such magic. I'd like to do some contributions, so I do not need to learn to write a new one. Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's collaborate #71

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Let's collaborate #71

philpax Mar 14, 2023 Maintainer

Replies: 12 comments · 1 reply

setzer22 Mar 14, 2023 Maintainer

setzer22 Mar 14, 2023 Maintainer

mwbryant Mar 14, 2023

philpax Mar 15, 2023 Maintainer Author

asukaminato0721 Mar 15, 2023

erlend-sh Mar 15, 2023

setzer22 Mar 15, 2023 Maintainer

setzer22 Mar 15, 2023 Maintainer

Noeda Mar 15, 2023

setzer22 Mar 15, 2023 Maintainer

philpax Mar 16, 2023 Maintainer Author

Noeda Mar 25, 2023

Aisuko May 21, 2023

philpax
Mar 14, 2023
Maintainer

Replies: 12 comments 1 reply

setzer22
Mar 14, 2023
Maintainer

setzer22
Mar 14, 2023
Maintainer

mwbryant
Mar 14, 2023

philpax
Mar 15, 2023
Maintainer Author

asukaminato0721
Mar 15, 2023

erlend-sh
Mar 15, 2023

setzer22
Mar 15, 2023
Maintainer

setzer22
Mar 15, 2023
Maintainer

Noeda
Mar 15, 2023

setzer22
Mar 15, 2023
Maintainer

philpax
Mar 16, 2023
Maintainer Author

Aisuko
May 21, 2023