-
Notifications
You must be signed in to change notification settings - Fork 368
Support for RWKV #75
Comments
RWKV is really interesting, but the Python code is absolutely horrendous. I made an attempt to clean it up a little but the author wasn't interested. I'd look at this one if you were interested in trying to make the Rust code just a frontend to a Python backend: https://github.com/harrisonvanderbyl/rwkvstic But I sort of don't know what the benefit would be of doing that for a CLI app, because there's already stuff like Oobabooga's text thingy that handles a lot of different model formats with a front end. I believe the rwkvstic repo I linked is on a path to get added to the actual Transformers repo as one of the supported models. If you wanted to do a Python frontend to stuff, maybe it would make more sense to look at making the frontend for Transformers rather than RWKV specifically. Then you'll just pretty much get that for free one support is merged. huggingface/transformers#20737 huggingface/transformers#20809 Sort of related, but the TextSynth server (closed source) actually supports/publishes some 4bit RWKV models: https://bellard.org/ts_server/ It can do pretty fast inference on CPU. I contact the author and asked if he'd be willing to disclose the model format for those files and he said he probably would post a Python converter soon which would document the format as well. |
Nah, we'd want to port the actual inference code to Rust, similarly to what we did for LLaMA itself. The less Python we have in the codebase, the better 🦀 |
After a whole lot of struggling because I have essentially no idea of what I'm doing with NN or math stuff: https://github.com/KerfuffleV2/smolrsrwkv Obviously that's not useful as anything more than an example of the basics, but it does work and generates tokens not all that much slower than the simple Torch version. |
Oh wow, nice one! That's an excellent start - do you want to try porting it to ggml-rs? (#81 exposes it as a separate library, so you should be able to use a Git dependency) |
(By the way, are you in the Discord? It'd be great to chat synchronously if need be) |
Thanks, I didn't really do much more than port the Python example here though: https://johanwind.github.io/2023/03/23/rwkv_details.html
I'd have to say probably no. I don't think I'm qualified to write a version that's actually worth using in real software. So if you or anyone else wants to use my code as an example or starting point, you're certainly very welcome. I may mess around with trying to get it to work on backends other than ndarray but I still don't think it'll be suitable for production. Also, sad to say, I have a very short attention span. Just generally speaking, you're probably better off not depending on me for anything long term. |
Eh, none of us are really LLM experts - it's more a matter of engineering and refinement. I think you're just as qualified as the rest of us. Happy to help any efforts get across the line.
Don't worry, I know the feeling very well. No worries at all; a driveby contribution is fine as long as it's maintainable, and from what I've seen of your implementation, you're doing fine in that regard. I'd encourage you to give it a try, but no pressure at all - even if you don't do it, your port will still be a great resource for those coming after (especially with those sweet static types!) |
I don't know know about that! I also really don't know any math stuff either. So I think I'd probably more focus on trying to make it a good example for understanding how the model works in its basic form rather than something that could be used directly.
I did look, but one thing I noticed is it doesn't seem like GGML supports the operations that are required (or I don't know how to work around/refactor based on it). For example, how to do stuff like subtract tensors from a value, or map something like exponent to the elements? It doesn't seem obvious how to do that with the API it provides. Could be it provides a higher level way to accomplish the same stuff, but unfortunately I don't have a high level understanding of how it works so that's kind of beyond my ability. Anyway, maybe if you or someone showed me how to do that, porting it to GGML would be more practical. Here's an example of the kind of thing I mean: https://github.com/KerfuffleV2/smolrsrwkv/blob/c21cb8008b51aa10fb2c0eaa2a81714e8c27f76f/src/model.rs#LL125C70-L125C70 (that's elementwise map).
Haha, yeah, I honestly found the Python example very hard to follow since there were no types and it was using weird NumPy shortcuts like I cleaned up/reorganized the Rust version a bit, so hopefully it's easier to understand now. The state part was particularly bad in the initial version. |
You know your way around a type system, that's good enough for me :P
Oh... yeah. Just had some free time to look at this, and yeah, For subtracting tensors with a value, you'd use
Unfortunately, this is something I recognize from my MATLAB days - it's convenient, but it is absolutely bewildering the first time you see it 😅
Yeah, nice, appreciate it! Given that you can't actually implement a Still, tremendously cool to have a RWKV implementation in easy-to-understand Rust - you should be proud! |
After profiling it, 80-90% or more of the time is just spent doing the matrix multiplication. The rest isn't really significant. How crazy would it be to just use ggml for only that one operation. Even having to copy the data every time it probably still would be worth it. It may be a lot faster, but that still wouldn't really make that code suitable for general use though since it does everything in |
I'm not convinced that |
Well, the big deal is currently it only uses a single thread. Supposedly ndarray supports threading but I've failed to get it to actually use more than one. I also messed around with Rayon just running the 3 and 2 groups of So just running those matmul ops in parallel would make a huge, huge difference I think. |
Hmm, yeah, makes sense. Do the matmuls happen at the end or do they need to be copied back and forth? |
Probably have to be copied back and forth. You can look here: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/src/model.rs
Each group is basically all together, but they need data from other calculations and further calculations depend on those results as well. (By the way, I'm getting reasonably happy with how readable it is now that it's all split into logical modules. The |
Maybe it's possible to construct Setzer's OK with adding more operations to Your definition of the model looks really clean! Nice work - it's definitely much easier to understand when it's split up like that. I wonder if we can split up |
I found a another way: https://github.com/KerfuffleV2/smolrsrwkv/blob/ab3df0c706ff786a943a3896d7d96b597b45fc98/src/util.rs#L121 It actually runs at acceptable speed now (relatively speaking), even on the 3B model. Pretty simple now, but getting here was quite the struggle. ndarray claims to use the matrixmultiply crate which supports threading and even has a feature flag to turn it on, but it does absolutely nothing. That's because ndarray just never will even call matrixmultiply for matrix-vector multiplication. The big issue now is 32bit models just use an insanely impractical amount of memory. I think I may have taken this as far as it can go because stuff like quantization or using non-float types looks to be a huge undertaking. |
Fantastic! Yeah, it doesn't look like ndarray supports f16 or lower types - I'm not sure what the best plan of attack is after that. As far as I can tell, none of the other libraries support <f32 inference. Once #85 lands, we will have opened ggml to extension by us, so we can investigate implementing the functionality needed to implement RWKV. I don't have any other suggestions until then 😅 |
Looks like someone's going for it https://github.com/saharNooby/rwkv.cpp |
Oh, nice. I look forward to ripping off their ide... I mean collaborating in the spirit of open source. This is probably poor etiquette but there's no ability to create issues in that repo: @saharNooby Just in case it's useful, tagging you so you're aware of the efforts/discussion here. Also (in even more poor taste) I'll plug my own little Rust implementation: https://github.com/KerfuffleV2/smolrsrwkv It's possible a C/C++ programmer could look at that and get something out of it (good chance you already know more than I do though). |
Hi all! I've completely missed that Issues were not enabled for my repo, but I've enabled them now, thanks for pointing it out! Looks like our work is closely related by extending For the context, here's what I know so far:
My next goal is to implement |
What I linked is a separate project (the only relation here is llama-rs wants to implement RWKV and I've also contributed a little bit of code here). My project just uses a general math library to implement a simple version of RWKV by doing the calculations directly on matrices/vectors. So basically it's just useful for understanding how RWKV fits together. Just from the other stuff you said just now, it seems like you're already past the point of something like that being helpful. |
I see. I meant your work here in repo Anyway, you may be interested in newly added With new operators, I managed to get non- |
Interesting. Is there a reason to implement those elementwise operations all separately instead of adding a generic elementwise map operation? The matrix multiplications matter so much with this, it's crazy. I'm thinking about just setting up a GGML graph for those operations only. The rest can be done in a naive/simple way and still looks to achieve 95% of the maximum performance. |
I guess it was simpler for me to just add new operations, copy-pasting code of existing operations; than to invent a new (for ggml) way of doing things... But I agree that generic map operation would work too, if someone implemented it. Not sure how to do it in C/C++. |
I can't really help you with the C++ part. Come over to the Rust side! In seriousness though, you may well end up doing less work overall if you take that approach because it's just one place to worry about changes. You can just have the operation take a function pointer and then apply that function pointer to the element. Basically same as you're doing in inline static void ggml_vec_mapv_f32 (
const int n, float * y, const float * x,
float (*fun)(float)) {
for (int i = 0; i < n; ++i) y[i] = fun(x[i]);
}
static void ggml_compute_forward_mapv_f32(
const struct ggml_compute_params * params,
const struct ggml_tensor * src0,
struct ggml_tensor * dst) {
assert(params->ith == 0);
assert(ggml_are_same_shape(src0, dst));
if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
return;
}
const int n = ggml_nrows(src0);
const int nc = src0->ne[0];
assert(dst->nb[0] == sizeof(float));
assert(src0->nb[0] == sizeof(float));
fun = GET_FUNCTION_DEFINITION_SOMEHOW();
for (int i = 0; i < n; i++) {
ggml_vec_mapv_f32(nc,
(float *) ((char *) dst->data + i*( dst->nb[1])),
(float *) ((char *) src0->data + i*(src0->nb[1])), fun);
}
} It looks like storing metadata like a function pointer could be pretty annoying. From looking at other functions, it seems like the approach that was used was to use the Note: Many years ago I was a C programmer but I never got into C++. Consider the above pseudocode. |
Might be getting annoying me writing so many comments here, but: I've been working on my Rust RWKV implementation and got 8bit quantization working. I also managed to split it into a library so it would be pretty easy to interface with: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv-cli/src/main.rs Interestingly, it seems about as fast as the official Python version when running on CPU, however it's basically still too slow to really be practical. the 3B model runs about as fast as a 13B model on I don't know if this is something that |
@KerfuffleV2 Do I understand quantization algo correctly, that for each matrix row, you determine min and max value, and then represent each element as |
Uhhh... I basically cargo culted it from the official version so I don't know that I can give you a good answer here. See:
The second one is the CPU-based matrix multiplication function for when 8bit quantization is in effect. So those extra values are required at the point you do the MM. Also worth pointing out is my implementation is based on the example code here: https://johanwind.github.io/2023/03/23/rwkv_details.html You probably noticed, but only some of the tensors are (or can be) quantized. The smaller and 1D ones are left as 32bit. I think there are some differences between that and the official version (although the 8bit quantization stuff still seemed to fit in — with some changes). For one thing, the official version transposes some of the tensors while mine doesn't. It also has an extra state item. quick edit: The Rust 8bit MM function is here, by the way: https://github.com/KerfuffleV2/smolrsrwkv/blob/076d14882be2ca471796c555d3c967d8e4d2585d/smolrwkv/src/util.rs#L159 |
I've been messing around trying to allow GGML to map arbitrary operations: https://github.com/KerfuffleV2/llama-rs/blob/5fd882035e95501d4127e30c84a838afbffcc95e/ggml/src/lib.rs#L207 This what it looks like in use: https://github.com/KerfuffleV2/llama-rs/blob/5fd882035e95501d4127e30c84a838afbffcc95e/llama-rs/src/lib.rs#L1310 The first one is just replacing the Those two ops should enable supporting other models like RWKV where GGML doesn't currently have the required operations. I am going to try to see if I can get this (or something similar) added to GGML but in the meantime or if that fails, we could potentially use a patched version of GGML that enables this stuff. The obvious downside is that it would make merging GGML changes more of a pain. (Also, I'm not really a C developer anymore, so while it appears to work there may be other issues with the current implementation.) |
Interesting! Yes, we're open to having our own fork of GGML - we'll just have to manage the patches ourselves. The primary reason that I think they would shy away from it is because a native implementation would almost always be faster (especially with vectorised operations); that being said, it may be sufficient for operations that aren't on the hot path. I'm not sure if it'd be better to go with @saharNooby's custom implemented operations or to use your solution (or both). How much of RWKV's unsupported operations are on the hot path? |
Operations on the hot path:
Options not on the hot path:
Okay, that's a little bit of an exaggeration but seriously, most of the stuff (especially for RWKV as far as I can see) is basically insignificant next to the matrix multiplication. You're 100% write about the map approach likely not being as performant (although note that the map functions work on rows, not individual items so it could potentially be optimized to use SIMD, etc). I'd say it's mostly something to enable flexibility rather than the end of the optimized solution. I'd rather be able to do something slightly suboptimally than not at all in most cases.
I'm obviously going to be be biased here, but I really prefer the flexible approach rather than hard-coding and having to maintain operations that may only help one particular model. We'll have to see if there's a big performance difference between the two approaches. My intuition tells me that it's probably not going to be that significant but we'll have to see.
I'm not really sure how to answer that. What constitutes "the hot" path when setting up a GGML model? You can see where they set up the RWKV tensors here: https://github.com/saharNooby/rwkv.cpp/blob/e84c446d9533dabef2d8d60735d5924db63362ff/rwkv.cpp#L315
I'm planning to rip off saharNooby's approach with my map-based operations instead and see how it works out in my separate RWKV project. |
This is still a fairly long way off, but you can get a better idea of what ops are missing here: https://github.com/KerfuffleV2/smolrsrwkv/blob/e58d2e39de1c9627199fc1f89cebd35f3bc41a61/smolrwkv/src/ggml/graph.rs#L11 (and I only had to rip off my own pre-existing code for this, since it's basically the same except with the operations spelled out). Setting up the model looks relatively clean using impls on the component parts. The same approach could probably be used for llama based models as well. |
Hopefully these posts aren't annoying (let me know if so - maybe this should be in a discussion instead). By the way, this is happening in my separate Finally got the first version of inference with GGML working: KerfuffleV2/smolrsrwkv@995841c It is currently only set up to use full 32bit floats. One interesting thing is relation to llama-rs I noticed is llama-rs builds up the whole graph again for every evaluation step. It looks like llama.cpp does that too but I really don't know why. It doesn't seem necessary. Also maybe relevant for @saharNooby is it seems like they're copying all the state in and out of the model at every step. As far as I can tell though, that's not actually necessary. You can just do this: https://github.com/KerfuffleV2/smolrsrwkv/blob/995841c57d4a92976af53fe69db934461f06a66a/smolrwkv/src/ggml/graph.rs#L190 Copy the state from the tensor with the new value into the tensor that was set up for state. Then it just naturally loops into itself every evaluation and it's actually not necessary for the rest of the code to look at that at all. The only thing I need to copy out is the probabilities at the end of the computation: https://github.com/KerfuffleV2/smolrsrwkv/blob/995841c57d4a92976af53fe69db934461f06a66a/smolrwkv/src/ggml/context.rs#L79 Next step is to get quantized models working + various cleanups. edit: Doing some profiling, I could barely even find the map functions. The total time spent in both binary and unary map functions appears to be around 0.1%. The time spent in matrix multiplication? Basically everything else. So that part doesn't seem like it's going to matter at all. Oddly enough, the GGML version seems about the same speed as the ndarray based version which is disappointing. Hopefully there will be more of a difference with the quantized versions. |
No worries! These have been very interesting to keep up with, I just haven't had much to add. Awesome to see the work you've done! Yeah, I think I've heard that traditional backends are on par with GGML for 32-bit. I think it pulls far ahead with quantized values. |
You're right about that. I just got quantization working ( KerfuffleV2/smolrsrwkv@4208bc7 ). I got about 1-2 TPS on the 3B model and NDArray 8bit quantization. I can actually load the 7B model with GGML 4bit quantization and it gets 2+ TPS. Right now there's no ability to save a prequantized file so it's a little slow to load (takes about 90sec for the 7B model), however that's only using one thread for quantization.
One weird thing is it actually uses around 17GB RAM even though GGML says it only needed 6.4GB. (This actually occurs just during loading, so it's nothing to do with actually evaluating the model.) |
Yeah, agree. I decided to not do anything with it because it does not look like a performance bottleneck.
Please do quality checks, you may be surprised how badly RWKV may pefrorm with quantization. Here is a rough script for calculating perplexity, may be useful. If you confirm that RWKV breaks after simple 4-bit quantization implemented in I also suggest to compare |
As an aside, are there any GGML format RWKV models floating around? Is there a standard for them? |
@philpax I haven't seen any, but saharNooby's project uses the GGMF format with version 100 and includes a converter. So you could make some if you want. I'd probably use a slightly different approach if I implemented it because it's possible to also precompute the embeddings which I don't think that project does. quick edit: (By the way, it looks like llama.cpp is open to merging my map operations stuff, so we won't have to maintain a separate version. I just need to clean it up and get it ready for actual merging, hopefully tomorrow.) |
@KerfuffleV2 You mean merge
@philpax No "official" standard, but indeed, I just used
Here is PyTorch -> ggmf converter code (note that doc comment there is a little out of date, I'll update it soon). And looks like someone already uploaded converted RWKV models in this format to HuggingFace... |
Interesting - you don't include the vocabulary in there? |
Fair enough, I doubt it makes a difference from the performance perspective. I wasn't sure if you knew if you didn't have to do that or not. I've been figuring out what I need to do as I go with this stuff. (Also, not unnecessarily copying stuff stuffs nicer to me, but that's just my own aesthetic preference.)
I do plan to do more testing like that. Right now, I've only been trying to get the basic functionality going. I did notice that By the way, I definitely have been looking at your repo and approach to doing stuff. It's been very helpful! I didn't end up actually copying the part where you set up the model but there was plenty of other useful information. Especially the part about the shapes being reversed in GGML, definitely could have wasted a lot of time there.
That might be a bit too sad for me, I'm really hoping to get something that can run at around the same speed as the equivalent llama model. Right now based on testing the 14B RWKV model vs 13B Vicuna, RWKV is about twice as slow (1.6TPS vs 3.3TPS). That's using I'm curious about how your
Also, I'd guess the reason why the performance for
Are you doing anything special to convert from FP32 (or BF16 which is basically the same) into 16 bit? I looked at your code, and it seemed like the answer is no. Did you actually check to make sure all the values in the model could actually be represented as FP16 without actually doing special stuff like quantizing to 16 bit? We're talking about normal machine 16 bit floats here, right? Not bfloat16 (as far as I could see, GGML doesn't have any bfloat16 stuff in it, but I might be mistaken). (I'm also not sure how that would work with my map ops stuff which currently works with FP32.)
Yeah, it probably doesn't make a performance difference. It also saves a small amount of memory/reduces model file sizes too, since After thinking about it a bit more, there might be another reason: If you precompute then you can do the calculation at the point the model is still FP32 with higher precision. If you do it later, then you have to perform the calculation on the quantized values/converted values. I guess those tensors aren't usually quantized to 4 bit but it might make a difference if you're converting to 16 bit.
Not who you asked, but RWKV has its own HuggingFace Tokenizer's style tokenizer definition with the vocab stored outside of the model files. There would be a way to embed it in every copy of the model files, of course, but that would be a bit of a different approach from the status quo. |
These posts keep getting longer...
@philpax Yes, tokenizer remained on the Python side, since this is the side I'm most familiar with; and more objectively, Python has more tooling for LLM, like sampling, etc. Tokenizer is also not a performance-critical part for inference, so I saw no real reason to re-implement it in C/C++.
@KerfuffleV2 First: I've tried my best to write AVX2 impl of For me, quantization is not about performance, but about the ability to run way larger models than RAM allows. I understand that this is not everyone's viewpoint tho :) Second: comparing performance of RWKV to Transformers (GPT, LLAMA and its variants, etc.) can be tricky, and depends on the use case. When generating texts with empty or small prompt, latency per token of RWKV vs Transformer should be comparable (and maybe slower, as you noticed). But consider generating text with a huge prompt, that exceeds Transformer context length and needs to be cut. Relevant use case is having a long conversation with a chat bot, or collaborative novel writing. We want to generate You may say: okay, we can wait for KV cache calculation once, when running the model on the prompt for the first time. But here is the thing: to generate next I've tried to "shift" the cache, but this fundamentally does not work with Transformers -- values in the cache depend on position of the token, and after shifting the cache values lose their meaning to the model. RWKV is not free of these tens of minutes of computation, of course. To compute the next token for the a huge prompt, we need to have the This is the reason why I invest time in RWKV at all -- it may be a worse architecture/model by quality, but the fact is that on CPU it is actually usable with long prompts, in contrast to Transformers.
I just do
Hmm, this prompted me to think. I see that original model files have size 2x of parameter count, which suggests Though after researching range of weights in RWKV I see that there is not much range -- something like
I didn't see anything related to
Probably, you don't need to do anything -- activations (hidden state vector) are in Edit: added later
Looks like these improvement PRs are not much related to Also, I've tried naively minimizing some error when quantizing, and this did not help much; I figured it is not worth the complexity and increasing quantization time.
Because we need to store/load an outlier weight at an arbitrary index in the block, and do matmul in |
Why did I only see this now. Those implementations are stupidly simple: My server using custom fork of HF Tokenizer (Rust, Python, etc): https://github.com/huggingface/tokenizers Performance wise, numpy+numba is as fast as ggml. dfdx can only use 4 threads for matmul.. Matmul takes up 80%-90% of the time. Using all the CPU cores is good enough. |
You need Zig. Zig has cross-arch SIMD. Optimize cycle count with llvm-mca. Any Rust crate that depends on matrixmultiply is not fast enough. q4_2 and q4_3 is coming, see this. |
Interesting, thanks! I'm not sure about adding another language into the mix tho. Ideally, I would like to not even need to fork ggml...
Already merged, it's fast and works good with Raven 7B. |
To me, it's better than writing AVX, NEON and whatever. You can distribute the pre-compiled assembly (by Zig) with the repo. |
I took a stab at this but couldn't get it across the finish line. If anyone would like to pick up where I've left off, please feel free to reuse the code here https://github.com/danforbes/llama-rs/blob/dfo/model/rwkv/crates/models/rwkv/src/lib.rs |
So this is a pretty immense task and I'd start with #45, but...
It's entirely open-source, so not legally burdened like LLaMA, and (from what I've seen) is more powerful than BLOOM at the same parameter count.
I asked the RWKV Discord which implementation would be worth looking at, and this is what I was told:
So it sounds like
rwkv_pip_package
is the way to go as source material:https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py
The following articles are very useful for understanding how RWKV works:
An interesting detail from the latter is the following:
This may pose issues for the GGML 4-bit quantisation format, which is non-optimal. We would likely want GPTQ quantisation.
The text was updated successfully, but these errors were encountered: