Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Support for RWKV #75

Open
philpax opened this issue Mar 26, 2023 · 47 comments
Open

Support for RWKV #75

philpax opened this issue Mar 26, 2023 · 47 comments
Labels
issue:enhancement New feature or request topic:model-support Support for new models

Comments

@philpax
Copy link
Collaborator

philpax commented Mar 26, 2023

So this is a pretty immense task and I'd start with #45, but...

RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

It's entirely open-source, so not legally burdened like LLaMA, and (from what I've seen) is more powerful than BLOOM at the same parameter count.

I asked the RWKV Discord which implementation would be worth looking at, and this is what I was told:

RWKV-LM/RWKV-v4neo/src/model.py is the implementation that's actually used to train the large models, it's cuda only and has tons of features you probably don't need.
rwkv_pip_package only implements inference, but is a good implementation and worth a look, recently got a lot more complex due to supporting more and more strategies and including various optimizations.
ChatRWKV/src/model_run is an older version, but haven't played with it so not sure how good it is. Might be worth a look since it's basically an older version of the one in rwkv_pip_package.
RWKV_in_150_lines.py I still haven't fully checked out, but I know it doesn't support GPT mode, so that may or may not be less useful
Also worth a look is RWKV-v4neo/src/model_run.py, which is a small inference-only impl capable of loading the large RWKV checkpoints
I'm not sure if it has GPT-mode, though

So it sounds like rwkv_pip_package is the way to go as source material:

https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

The following articles are very useful for understanding how RWKV works:

An interesting detail from the latter is the following:

The largest number a 16-bit floating point number (float16) can represent is 65 504, anything above that overflows, which is bad. Most of the code has no problems with this, partially because the Layer Normalizations keep values in a reasonable range. However, the RWKV attention contains exponentially large numbers (exp(bonus + k)). In practice, the RWKV attention is implemented in a way where we factor out an exponential factor from num and den to keep everything within float16 range. See for example the time_mixing function in RWKV in 150 lines.

This may pose issues for the GGML 4-bit quantisation format, which is non-optimal. We would likely want GPTQ quantisation.

@philpax philpax added the issue:enhancement New feature or request label Mar 26, 2023
@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 26, 2023

RWKV is really interesting, but the Python code is absolutely horrendous. I made an attempt to clean it up a little but the author wasn't interested.

I'd look at this one if you were interested in trying to make the Rust code just a frontend to a Python backend: https://github.com/harrisonvanderbyl/rwkvstic

But I sort of don't know what the benefit would be of doing that for a CLI app, because there's already stuff like Oobabooga's text thingy that handles a lot of different model formats with a front end.

I believe the rwkvstic repo I linked is on a path to get added to the actual Transformers repo as one of the supported models. If you wanted to do a Python frontend to stuff, maybe it would make more sense to look at making the frontend for Transformers rather than RWKV specifically. Then you'll just pretty much get that for free one support is merged.

huggingface/transformers#20737

huggingface/transformers#20809


Sort of related, but the TextSynth server (closed source) actually supports/publishes some 4bit RWKV models: https://bellard.org/ts_server/

It can do pretty fast inference on CPU. I contact the author and asked if he'd be willing to disclose the model format for those files and he said he probably would post a Python converter soon which would document the format as well.

@philpax
Copy link
Collaborator Author

philpax commented Mar 26, 2023

Nah, we'd want to port the actual inference code to Rust, similarly to what we did for LLaMA itself. The less Python we have in the codebase, the better 🦀

@KerfuffleV2
Copy link
Contributor

After a whole lot of struggling because I have essentially no idea of what I'm doing with NN or math stuff: https://github.com/KerfuffleV2/smolrsrwkv

Obviously that's not useful as anything more than an example of the basics, but it does work and generates tokens not all that much slower than the simple Torch version.

@philpax
Copy link
Collaborator Author

philpax commented Mar 27, 2023

Oh wow, nice one! That's an excellent start - do you want to try porting it to ggml-rs? (#81 exposes it as a separate library, so you should be able to use a Git dependency)

@philpax
Copy link
Collaborator Author

philpax commented Mar 27, 2023

(By the way, are you in the Discord? It'd be great to chat synchronously if need be)

@KerfuffleV2
Copy link
Contributor

Thanks, I didn't really do much more than port the Python example here though: https://johanwind.github.io/2023/03/23/rwkv_details.html

do you want to try porting it to ggml-rs?

I'd have to say probably no. I don't think I'm qualified to write a version that's actually worth using in real software. So if you or anyone else wants to use my code as an example or starting point, you're certainly very welcome.

I may mess around with trying to get it to work on backends other than ndarray but I still don't think it'll be suitable for production. Also, sad to say, I have a very short attention span. Just generally speaking, you're probably better off not depending on me for anything long term.

@philpax
Copy link
Collaborator Author

philpax commented Mar 27, 2023

I don't think I'm qualified to write a version that's actually worth using in real software.

Eh, none of us are really LLM experts - it's more a matter of engineering and refinement. I think you're just as qualified as the rest of us. Happy to help any efforts get across the line.

Also, sad to say, I have a very short attention span. Just generally speaking, you're probably better off not depending on me for anything long term.

Don't worry, I know the feeling very well. No worries at all; a driveby contribution is fine as long as it's maintainable, and from what I've seen of your implementation, you're doing fine in that regard.

I'd encourage you to give it a try, but no pressure at all - even if you don't do it, your port will still be a great resource for those coming after (especially with those sweet static types!)

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 28, 2023

@philpax

I think you're just as qualified as the rest of us.

I don't know know about that! I also really don't know any math stuff either. So I think I'd probably more focus on trying to make it a good example for understanding how the model works in its basic form rather than something that could be used directly.

I'd encourage you to give it a try, but no pressure at all

I did look, but one thing I noticed is it doesn't seem like GGML supports the operations that are required (or I don't know how to work around/refactor based on it). For example, how to do stuff like subtract tensors from a value, or map something like exponent to the elements? It doesn't seem obvious how to do that with the API it provides.

Could be it provides a higher level way to accomplish the same stuff, but unfortunately I don't have a high level understanding of how it works so that's kind of beyond my ability.

Anyway, maybe if you or someone showed me how to do that, porting it to GGML would be more practical. Here's an example of the kind of thing I mean: https://github.com/KerfuffleV2/smolrsrwkv/blob/c21cb8008b51aa10fb2c0eaa2a81714e8c27f76f/src/model.rs#LL125C70-L125C70 (that's elementwise map).

even if you don't do it, your port will still be a great resource for those coming after (especially with those sweet static types!)

Haha, yeah, I honestly found the Python example very hard to follow since there were no types and it was using weird NumPy shortcuts like sometensors[a > b] = 0 — indexing with a boolean expression, what? Figuring out silly stuff like that actually took as much time as writing the actual logic.

I cleaned up/reorganized the Rust version a bit, so hopefully it's easier to understand now. The state part was particularly bad in the initial version.

@philpax
Copy link
Collaborator Author

philpax commented Mar 28, 2023

I also really don't know any math stuff either.

You know your way around a type system, that's good enough for me :P

I did look, but one thing I noticed is it doesn't seem like GGML supports the operations that are required (or I don't know how to work around/refactor based on it). For example, how to do stuff like subtract tensors from a value, or map something like exponent to the elements? It doesn't seem obvious how to do that with the API it provides.

Oh... yeah. Just had some free time to look at this, and yeah, ggml doesn't support some critical operators (it doesn't seem to support any form of exponentiation...).

For subtracting tensors with a value, you'd use op_sub, potentially with new_f32 (if I'm interpreting what you're asking correctly). That being said, there's definitely non-zero challenges here - I was hoping that it was just a case of operations that we left out of the bindings, but it seems like there's actual holes here.

ggml's API is based around building up a computation graph and then executing all the computation at once, but it means that if you need an operation that it doesn't support you'll need to add it yourself.

weird NumPy shortcuts like sometensors[a > b] = 0 — indexing with a boolean expression, what?

Unfortunately, this is something I recognize from my MATLAB days - it's convenient, but it is absolutely bewildering the first time you see it 😅

I cleaned up/reorganized the Rust version a bit, so hopefully it's easier to understand now. The state part was particularly bad in the initial version.

Yeah, nice, appreciate it! Given that you can't actually implement a ggml version of this, I don't have any immediate suggestions for you, but you can explore the other Rust ML libraries if you're curious.

Still, tremendously cool to have a RWKV implementation in easy-to-understand Rust - you should be proud!

@KerfuffleV2
Copy link
Contributor

After profiling it, 80-90% or more of the time is just spent doing the matrix multiplication. The rest isn't really significant. How crazy would it be to just use ggml for only that one operation. Even having to copy the data every time it probably still would be worth it.

It may be a lot faster, but that still wouldn't really make that code suitable for general use though since it does everything in f32 and there isn't any easy way around that with the current approach.

@philpax
Copy link
Collaborator Author

philpax commented Mar 29, 2023

I'm not convinced that ggml would be faster for the f32 case; I think most of its benefits come through with quantized values, which would be hard to test with RWKV.

@KerfuffleV2
Copy link
Contributor

Well, the big deal is currently it only uses a single thread. Supposedly ndarray supports threading but I've failed to get it to actually use more than one.

I also messed around with Rayon just running the 3 and 2 groups of .dot() operations in parallel, but it didn't really make a difference because there's one place it's forced to a single thread and one place where it can only run two in parallel. I only got like a 20-30% speed increase at best, when I have 6 cores and 12 threads available.

So just running those matmul ops in parallel would make a huge, huge difference I think.

@philpax
Copy link
Collaborator Author

philpax commented Mar 29, 2023

Hmm, yeah, makes sense. Do the matmuls happen at the end or do they need to be copied back and forth?

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 29, 2023

Probably have to be copied back and forth. You can look here: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/src/model.rs

  1. channel_mixing (3) — k and r are independent, but vk depends on k.
  2. time_mixing (3) — All independent.
  3. evaluate_layer (1) — Only happens once per step rather than per layer.

Each group is basically all together, but they need data from other calculations and further calculations depend on those results as well.

(By the way, I'm getting reasonably happy with how readable it is now that it's all split into logical modules. The main function is looking pretty clean. If you have time to skim it, what do you think of the current state as an example for other projects?)

@philpax
Copy link
Collaborator Author

philpax commented Mar 29, 2023

Maybe it's possible to construct ggml Tensors directly from the ndarray representation? Not sure, though - the reason ggml can parallelize is because the computational graph lets it spread work across without having to wait for the main thread to dispatch more work. It might be possible to set up contexts/graphs to execute the matmuls, but I'm not sure if the overhead makes it worth it.

Setzer's OK with adding more operations to ggml, so we'd ideally do it all there, but it'll require some work to wire up (cf #45).


Your definition of the model looks really clean! Nice work - it's definitely much easier to understand when it's split up like that. I wonder if we can split up llama.rs code like that - it might be a little difficult because of the aforementioned graph, but I don't see it being too troublesome.

@KerfuffleV2
Copy link
Contributor

@philpax

I found a another way: https://github.com/KerfuffleV2/smolrsrwkv/blob/ab3df0c706ff786a943a3896d7d96b597b45fc98/src/util.rs#L121

It actually runs at acceptable speed now (relatively speaking), even on the 3B model.

Pretty simple now, but getting here was quite the struggle. ndarray claims to use the matrixmultiply crate which supports threading and even has a feature flag to turn it on, but it does absolutely nothing. That's because ndarray just never will even call matrixmultiply for matrix-vector multiplication.

The big issue now is 32bit models just use an insanely impractical amount of memory. I think I may have taken this as far as it can go because stuff like quantization or using non-float types looks to be a huge undertaking.

@philpax
Copy link
Collaborator Author

philpax commented Mar 30, 2023

Fantastic! Yeah, it doesn't look like ndarray supports f16 or lower types - I'm not sure what the best plan of attack is after that. As far as I can tell, none of the other libraries support <f32 inference.

Once #85 lands, we will have opened ggml to extension by us, so we can investigate implementing the functionality needed to implement RWKV. I don't have any other suggestions until then 😅

@philpax
Copy link
Collaborator Author

philpax commented Mar 30, 2023

Looks like someone's going for it https://github.com/saharNooby/rwkv.cpp

@KerfuffleV2
Copy link
Contributor

Oh, nice. I look forward to ripping off their ide... I mean collaborating in the spirit of open source.


This is probably poor etiquette but there's no ability to create issues in that repo:

@saharNooby Just in case it's useful, tagging you so you're aware of the efforts/discussion here. Also (in even more poor taste) I'll plug my own little Rust implementation: https://github.com/KerfuffleV2/smolrsrwkv

It's possible a C/C++ programmer could look at that and get something out of it (good chance you already know more than I do though).

@saharNooby
Copy link

Hi all! I've completely missed that Issues were not enabled for my repo, but I've enabled them now, thanks for pointing it out!

Looks like our work is closely related by extending ggml, but diverges at actual implementation of the model -- you do it in Rust, I do it in C++. I'll read the thread and code of both repositories and later write a more proper response.

For the context, here's what I know so far:

  • ggml is missing element-wise max(x, y), exp(x), which are required for RWKV
  • it is also missing sigmoid(x), but it has silu(x), so I think I can approximate it with ggml_silu(x) / x
  • layer_norm(x, w, b) works great as ggml_norm(x) * w + b

My next goal is to implement max and exp. I see that this PR would be a great showcase for me of how to add a new operator to ggml!

@KerfuffleV2
Copy link
Contributor

@saharNooby

Looks like our work is closely related by extending ggml, but diverges at actual implementation of the model -- you do it in Rust, I do it in C++.

What I linked is a separate project (the only relation here is llama-rs wants to implement RWKV and I've also contributed a little bit of code here). My project just uses a general math library to implement a simple version of RWKV by doing the calculations directly on matrices/vectors.

So basically it's just useful for understanding how RWKV fits together. Just from the other stuff you said just now, it seems like you're already past the point of something like that being helpful.

@saharNooby
Copy link

saharNooby commented Mar 31, 2023

@KerfuffleV2

What I linked is a separate project

I see. I meant your work here in repo llama-rs :)

Anyway, you may be interested in newly added exp, max, 1_minus_x and sigmoid operators for ggml: commit. For now, only forward pass and FP32. I've also added simple test suite that validates results against PyTorch's results.

With new operators, I managed to get non-NaN logits, yay! Will check their correctness against RWKV reference implementation soon.

@KerfuffleV2
Copy link
Contributor

Interesting. Is there a reason to implement those elementwise operations all separately instead of adding a generic elementwise map operation?

The matrix multiplications matter so much with this, it's crazy. I'm thinking about just setting up a GGML graph for those operations only. The rest can be done in a naive/simple way and still looks to achieve 95% of the maximum performance.

@saharNooby
Copy link

Is there a reason to implement those elementwise operations all separately instead of adding a generic elementwise map operation?

I guess it was simpler for me to just add new operations, copy-pasting code of existing operations; than to invent a new (for ggml) way of doing things... But I agree that generic map operation would work too, if someone implemented it. Not sure how to do it in C/C++.

@KerfuffleV2
Copy link
Contributor

I can't really help you with the C++ part. Come over to the Rust side!

In seriousness though, you may well end up doing less work overall if you take that approach because it's just one place to worry about changes. You can just have the operation take a function pointer and then apply that function pointer to the element.

Basically same as you're doing in ggml_compute_forward_exp_f32 except you'd call the user function instead of ggml_vec_element_wise_exp_f32. And you'd want one for applying a binary operation between two tensors like ggml_compute_forward_max_f32 — the idea would be the same though.

inline static void ggml_vec_mapv_f32 (
  const int n, float * y, const float * x,
  float (*fun)(float)) {
    for (int i = 0; i < n; ++i) y[i]  = fun(x[i]);
}

static void ggml_compute_forward_mapv_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));
    fun = GET_FUNCTION_DEFINITION_SOMEHOW();
    for (int i = 0; i < n; i++) {
        ggml_vec_mapv_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])), fun);
    }
}

It looks like storing metadata like a function pointer could be pretty annoying. From looking at other functions, it seems like the approach that was used was to use the opt field and create a 1d tensor with one single value. However, it seems to only support i32 as an integer type. The horrible hacky way to use that for a function point would be to just use two of those 32 bit integers to store the top/bottom half of the function pointer and then reassemble it when you need the function pointer. It's annoying to have to do but shouldn't really make a difference for performance.

Note: Many years ago I was a C programmer but I never got into C++. Consider the above pseudocode.

@KerfuffleV2
Copy link
Contributor

Might be getting annoying me writing so many comments here, but:

I've been working on my Rust RWKV implementation and got 8bit quantization working. I also managed to split it into a library so it would be pretty easy to interface with: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv-cli/src/main.rs

Interestingly, it seems about as fast as the official Python version when running on CPU, however it's basically still too slow to really be practical. the 3B model runs about as fast as a 13B model on llama-rs or llama.cpp. According to profiling, as expected it's just the matrix multiplication that's taking all the time. It's possible it could reach acceptable levels of performance if it could interface with a higher performance matrix multiplication library.

I don't know if this is something that llama-rs would want to depend on even if performance wasn't a problem. It would require a completely separate non-GGML codepath to use.

@saharNooby
Copy link

@KerfuffleV2 Do I understand quantization algo correctly, that for each matrix row, you determine min and max value, and then represent each element as (uint8) ((e - min) / (max - min))?

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 2, 2023

@saharNooby

Uhhh... I basically cargo culted it from the official version so I don't know that I can give you a good answer here. See:

  1. https://github.com/BlinkDL/ChatRWKV/blob/0d0abf181356c6f27501274cad18bdf28c83a45b/rwkv_pip_package/src/rwkv/model.py#L237
  2. https://github.com/BlinkDL/ChatRWKV/blob/0d0abf181356c6f27501274cad18bdf28c83a45b/rwkv_pip_package/src/rwkv/model.py#L335

The second one is the CPU-based matrix multiplication function for when 8bit quantization is in effect. So those extra values are required at the point you do the MM.

Also worth pointing out is my implementation is based on the example code here: https://johanwind.github.io/2023/03/23/rwkv_details.html

You probably noticed, but only some of the tensors are (or can be) quantized. The smaller and 1D ones are left as 32bit.

I think there are some differences between that and the official version (although the 8bit quantization stuff still seemed to fit in — with some changes). For one thing, the official version transposes some of the tensors while mine doesn't. It also has an extra state item.

quick edit: The Rust 8bit MM function is here, by the way: https://github.com/KerfuffleV2/smolrsrwkv/blob/076d14882be2ca471796c555d3c967d8e4d2585d/smolrwkv/src/util.rs#L159

@KerfuffleV2
Copy link
Contributor

I've been messing around trying to allow GGML to map arbitrary operations: https://github.com/KerfuffleV2/llama-rs/blob/5fd882035e95501d4127e30c84a838afbffcc95e/ggml/src/lib.rs#L207

This what it looks like in use: https://github.com/KerfuffleV2/llama-rs/blob/5fd882035e95501d4127e30c84a838afbffcc95e/llama-rs/src/lib.rs#L1310

The first one is just replacing the ggml_add operation, the second one is a nop (it does have to copy the data to the destination though).

Those two ops should enable supporting other models like RWKV where GGML doesn't currently have the required operations. I am going to try to see if I can get this (or something similar) added to GGML but in the meantime or if that fails, we could potentially use a patched version of GGML that enables this stuff.

The obvious downside is that it would make merging GGML changes more of a pain. (Also, I'm not really a C developer anymore, so while it appears to work there may be other issues with the current implementation.)

@philpax
Copy link
Collaborator Author

philpax commented Apr 10, 2023

Interesting! Yes, we're open to having our own fork of GGML - we'll just have to manage the patches ourselves.

The primary reason that I think they would shy away from it is because a native implementation would almost always be faster (especially with vectorised operations); that being said, it may be sufficient for operations that aren't on the hot path.

I'm not sure if it'd be better to go with @saharNooby's custom implemented operations or to use your solution (or both). How much of RWKV's unsupported operations are on the hot path?

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 10, 2023

that being said, it may be sufficient for operations that aren't on the hot path.

Operations on the hot path:

  • Matrix multiplication.

Options not on the hot path:

  • Everything else.

Okay, that's a little bit of an exaggeration but seriously, most of the stuff (especially for RWKV as far as I can see) is basically insignificant next to the matrix multiplication.

You're 100% write about the map approach likely not being as performant (although note that the map functions work on rows, not individual items so it could potentially be optimized to use SIMD, etc). I'd say it's mostly something to enable flexibility rather than the end of the optimized solution.

I'd rather be able to do something slightly suboptimally than not at all in most cases.

I'm not sure if it'd be better to go with @saharNooby's custom implemented operations

I'm obviously going to be be biased here, but I really prefer the flexible approach rather than hard-coding and having to maintain operations that may only help one particular model. We'll have to see if there's a big performance difference between the two approaches.

My intuition tells me that it's probably not going to be that significant but we'll have to see.

How much of RWKV's unsupported operations are on the hot path?

I'm not really sure how to answer that. What constitutes "the hot" path when setting up a GGML model?

You can see where they set up the RWKV tensors here: https://github.com/saharNooby/rwkv.cpp/blob/e84c446d9533dabef2d8d60735d5924db63362ff/rwkv.cpp#L315

exp, 1_minus_x, exp, sigmoid, max are ones I know definitely weren't already in GGML. I'm not sure if there are others.


I'm planning to rip off saharNooby's approach with my map-based operations instead and see how it works out in my separate RWKV project.

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 10, 2023

This is still a fairly long way off, but you can get a better idea of what ops are missing here: https://github.com/KerfuffleV2/smolrsrwkv/blob/e58d2e39de1c9627199fc1f89cebd35f3bc41a61/smolrwkv/src/ggml/graph.rs#L11 (and I only had to rip off my own pre-existing code for this, since it's basically the same except with the operations spelled out).

Setting up the model looks relatively clean using impls on the component parts. The same approach could probably be used for llama based models as well.

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 12, 2023

Hopefully these posts aren't annoying (let me know if so - maybe this should be in a discussion instead). By the way, this is happening in my separate smolrwkv project but my plan is to use that as a testbed and then try to use that to help with adding support for llama-rs. So that's why it's (hopefully) relevant here.


Finally got the first version of inference with GGML working: KerfuffleV2/smolrsrwkv@995841c

It is currently only set up to use full 32bit floats.

One interesting thing is relation to llama-rs I noticed is llama-rs builds up the whole graph again for every evaluation step. It looks like llama.cpp does that too but I really don't know why. It doesn't seem necessary.

Also maybe relevant for @saharNooby is it seems like they're copying all the state in and out of the model at every step. As far as I can tell though, that's not actually necessary. You can just do this: https://github.com/KerfuffleV2/smolrsrwkv/blob/995841c57d4a92976af53fe69db934461f06a66a/smolrwkv/src/ggml/graph.rs#L190

Copy the state from the tensor with the new value into the tensor that was set up for state. Then it just naturally loops into itself every evaluation and it's actually not necessary for the rest of the code to look at that at all. The only thing I need to copy out is the probabilities at the end of the computation: https://github.com/KerfuffleV2/smolrsrwkv/blob/995841c57d4a92976af53fe69db934461f06a66a/smolrwkv/src/ggml/context.rs#L79


Next step is to get quantized models working + various cleanups.


edit: Doing some profiling, I could barely even find the map functions. The total time spent in both binary and unary map functions appears to be around 0.1%. The time spent in matrix multiplication? Basically everything else. So that part doesn't seem like it's going to matter at all.

Oddly enough, the GGML version seems about the same speed as the ndarray based version which is disappointing. Hopefully there will be more of a difference with the quantized versions.

@philpax
Copy link
Collaborator Author

philpax commented Apr 12, 2023

Hopefully these posts aren't annoying (let me know if so - maybe this should be in a discussion instead). By the way, this is happening in my separate smolrwkv project but my plan is to use that as a testbed and then try to use that to help with adding support for llama-rs. So that's why it's (hopefully) relevant here.

No worries! These have been very interesting to keep up with, I just haven't had much to add.


Awesome to see the work you've done! Yeah, I think I've heard that traditional backends are on par with GGML for 32-bit. I think it pulls far ahead with quantized values.

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 13, 2023

Yeah, I think I've heard that traditional backends are on par with GGML for 32-bit. I think it pulls far ahead with quantized values.

You're right about that. I just got quantization working ( KerfuffleV2/smolrsrwkv@4208bc7 ). I got about 1-2 TPS on the 3B model and NDArray 8bit quantization. I can actually load the 7B model with GGML 4bit quantization and it gets 2+ TPS.

Right now there's no ability to save a prequantized file so it's a little slow to load (takes about 90sec for the 7B model), however that's only using one thread for quantization.

   0.055305828s  INFO smolrwkv_cli: Loading model from: /path/to/RWKV-4-Pile-7B-20230109-ctx4096.pth
   0.138817887s  INFO smolrwkv_cli: Backend type: GGML Q4_1
   0.138860247s  INFO load_model: smolrwkv::ggml::loader: Discovering model structure.
   0.186837307s  INFO load_model: smolrwkv::ggml::loader: Precomputing embedding...
   3.673750188s  INFO load_model: smolrwkv::ggml::loader: Loading 32 layer(s):
   3.832211013s  INFO load_model: smolrwkv::ggml::loader: [32/1]: Quantizing att.key.weight([4096, 4096])
   3.911192194s  INFO load_model: smolrwkv::ggml::loader: --> QUANT: len 67108864 -> 12582912 (54525952)

[...]

  97.794115185s  INFO load_model: smolrwkv::ggml::loader: [32/32]: Quantizing ffn.receptance.weight([4096, 4096])
  97.870882902s  INFO load_model: smolrwkv::ggml::loader: --> QUANT: len 67108864 -> 12582912 (54525952)

  97.884595100s  INFO load_model: smolrwkv::ggml::loader: Loading non-layer tensors.
 100.185164070s  INFO smolrwkv_cli: Loaded: layers=32, embed=4096, vocab=50277

[...]

The research has only been made possible thanks to the preparation and diligence of Liu Yaping, a professor of linguistics and head of the Central Plains Linguistic Research Team. According to Liu, the
 165.040212991s  INFO smolrwkv_cli: GGML memory used: 6925436384 (6.44981524348259GiB)
 [end of text]

 165.426812689s  INFO smolrwkv_cli: Completion. Token(s) generated: 101, elapsed time: 44.364179313s, TPS: 2.276620683437021

One weird thing is it actually uses around 17GB RAM even though GGML says it only needed 6.4GB. (This actually occurs just during loading, so it's nothing to do with actually evaluating the model.)

@saharNooby
Copy link

Also maybe relevant for @saharNooby is it seems like they're copying all the state in and out

Yeah, agree. I decided to not do anything with it because it does not look like a performance bottleneck.

I can actually load the 7B model with GGML 4bit quantization and it gets 2+ TPS.

Please do quality checks, you may be surprised how badly RWKV may pefrorm with quantization. Here is a rough script for calculating perplexity, may be useful.

If you confirm that RWKV breaks after simple 4-bit quantization implemented in ggml, you may be interested in reading how I tried to solve the issue (TL;DR: added a new quantization format, same disk space as Q4_1, same performance as FP32, which is sad, but works)

I also suggest to compare FP32 with FP16 -- for the same model when using ggml, I have 2x speed on FP16 compared to FP32. Which is not a trivial improvement, considernig that PyTorch can't do FP16 on CPU!

@philpax philpax mentioned this issue Apr 13, 2023
@philpax
Copy link
Collaborator Author

philpax commented Apr 13, 2023

As an aside, are there any GGML format RWKV models floating around? Is there a standard for them?

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 14, 2023

@philpax I haven't seen any, but saharNooby's project uses the GGMF format with version 100 and includes a converter. So you could make some if you want. I'd probably use a slightly different approach if I implemented it because it's possible to also precompute the embeddings which I don't think that project does. quick edit: (By the way, it looks like llama.cpp is open to merging my map operations stuff, so we won't have to maintain a separate version. I just need to clean it up and get it ready for actual merging, hopefully tomorrow.)

@saharNooby
Copy link

it's possible to also precompute the embeddings

@KerfuffleV2 You mean merge ln0 into embedding matrix? I've tried it (with PyTorch impl) and did not notice any performance improvement. As you said, 99% is matmul, I guess it is not worth additional complexity from new format and code.

As an aside, are there any GGML format RWKV models floating around? Is there a standard for them?

@philpax No "official" standard, but indeed, I just used llama.cpp's ggmf with version number 100. Here is full format spec in pseudocode:

RWKVModelFile {
  // All ints and floats are in machine byte order.
  // Magic is "ggml" string bytes.
  int32 magic = 0x67676d66;
  int32 version = 100;
  int32 n_vocab;
  int32 n_embed;
  int32 n_layer;
  // 0 if float32, 1 if float16, 2 if Q4_0, 3 if Q4_1, 4 if Q4_1_O.
  int32 data_type;
  // Read until EOF.
  Parameter[] parameters;
}

Parameter {
  int32 dim_count;
  int32 key_length;
  // 0 if float32, 1 if float16, 2 if Q4_0, 3 if Q4_1, 4 if Q4_1_O.
  int32 data_type;
  // Compared to PyTorch's tensor.shape, dimension order is reversed here!
  int32[dim_count] shape;
  // Keys are like "emb.weight", "block.0.ln1.weight".
  uint8[key_length] key_utf8;
  // float32: 4 * element_count bytes.
  // float16: 2 * element_count bytes.
  // Q4_0: element_count / 32 * 20 bytes.
  // Q4_1: element_count / 32 * 24 bytes.
  // Q4_1_O: element_count / 32 * 24 bytes.
  byte[] data;
}

Here is PyTorch -> ggmf converter code (note that doc comment there is a little out of date, I'll update it soon). And looks like someone already uploaded converted RWKV models in this format to HuggingFace...

@philpax
Copy link
Collaborator Author

philpax commented Apr 14, 2023

Interesting - you don't include the vocabulary in there?

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 14, 2023

@saharNooby

Yeah, agree. I decided to not do anything with it because it does not look like a performance bottleneck.

Fair enough, I doubt it makes a difference from the performance perspective. I wasn't sure if you knew if you didn't have to do that or not. I've been figuring out what I need to do as I go with this stuff. (Also, not unnecessarily copying stuff stuffs nicer to me, but that's just my own aesthetic preference.)

Please do quality checks, you may be surprised how badly RWKV may pefrorm with quantization.

I do plan to do more testing like that. Right now, I've only been trying to get the basic functionality going. I did notice that q4_0 gave terrible results, the model got into a loop repeating the same short phrase over and over almost immediately. With q4_1 the output at least looks pretty reasonable, not noticeably different to me compared to the float32 version.

By the way, I definitely have been looking at your repo and approach to doing stuff. It's been very helpful! I didn't end up actually copying the part where you set up the model but there was plenty of other useful information. Especially the part about the shapes being reversed in GGML, definitely could have wasted a lot of time there.

(TL;DR: added a new quantization format, same disk space as Q4_1, same performance as FP32, which is sad, but works)

That might be a bit too sad for me, I'm really hoping to get something that can run at around the same speed as the equivalent llama model. Right now based on testing the 14B RWKV model vs 13B Vicuna, RWKV is about twice as slow (1.6TPS vs 3.3TPS). That's using q4_1.

I'm curious about how your q4_1_O relates to the ongoing pulls for improving quantization in llama.cpp:

  1. Q4_0 scale selection using RMSE ggerganov/llama.cpp#835
  2. Use full range for q4_0 quantization ggerganov/llama.cpp#729
  3. More accurate Q4_0 and Q4_1 quantizations ggerganov/llama.cpp#896

Also, I'd guess the reason why the performance for q4_1_O is low is because there isn't an optimized version like there is for the other formats? Or do you think it's just inherently going to be much slower?

I also suggest to compare FP32 with FP16 -- for the same model

Are you doing anything special to convert from FP32 (or BF16 which is basically the same) into 16 bit? I looked at your code, and it seemed like the answer is no. Did you actually check to make sure all the values in the model could actually be represented as FP16 without actually doing special stuff like quantizing to 16 bit?

We're talking about normal machine 16 bit floats here, right? Not bfloat16 (as far as I could see, GGML doesn't have any bfloat16 stuff in it, but I might be mistaken).

(I'm also not sure how that would work with my map ops stuff which currently works with FP32.)

You mean merge ln0 into embedding matrix?

Yeah, it probably doesn't make a performance difference. It also saves a small amount of memory/reduces model file sizes too, since ln0 is never needed again. I'll admit the main reason I did that was just because I like the idea of precomputing anything that can be precomputed.

After thinking about it a bit more, there might be another reason: If you precompute then you can do the calculation at the point the model is still FP32 with higher precision. If you do it later, then you have to perform the calculation on the quantized values/converted values.

I guess those tensors aren't usually quantized to 4 bit but it might make a difference if you're converting to 16 bit.


@philpax

Interesting - you don't include the vocabulary in there?

Not who you asked, but RWKV has its own HuggingFace Tokenizer's style tokenizer definition with the vocab stored outside of the model files. There would be a way to embed it in every copy of the model files, of course, but that would be a bit of a different approach from the status quo.

@saharNooby
Copy link

saharNooby commented Apr 14, 2023

These posts keep getting longer...

Interesting - you don't include the vocabulary in there?

@philpax Yes, tokenizer remained on the Python side, since this is the side I'm most familiar with; and more objectively, Python has more tooling for LLM, like sampling, etc. Tokenizer is also not a performance-critical part for inference, so I saw no real reason to re-implement it in C/C++.

That might be a bit too sad for me, I'm really hoping to get something that can run at around the same speed as the equivalent llama model

@KerfuffleV2 First: I've tried my best to write AVX2 impl of Q4_1_O matmul, but restoration of outlier weights just does not work well with vectorized code. I have some ideas how to optimize if further, but I'm 99% sure that it would not improve it 2x, and we need some real asm/SIMD expert here to optimize it.

For me, quantization is not about performance, but about the ability to run way larger models than RAM allows. I understand that this is not everyone's viewpoint tho :)

Second: comparing performance of RWKV to Transformers (GPT, LLAMA and its variants, etc.) can be tricky, and depends on the use case.

When generating texts with empty or small prompt, latency per token of RWKV vs Transformer should be comparable (and maybe slower, as you noticed).

But consider generating text with a huge prompt, that exceeds Transformer context length and needs to be cut. Relevant use case is having a long conversation with a chat bot, or collaborative novel writing.

We want to generate n tokens, and Transformer has limitation of ctx_len. So, we cut the context to ctx_len - n tokens, calculate KV cache for the context, and generate n tokens. This KV cache generation takes a huge time on CPU (easily can take tens of minutes), and per token latency will be dominated by this cache calculation.

You may say: okay, we can wait for KV cache calculation once, when running the model on the prompt for the first time. But here is the thing: to generate next n tokens, you can't just reuse the KV cache. You need to append previously generated tokens to the prompt , cut it to ctx_len - n tokens again, calculate the KV cache again, and wait these tens of minutes again.

I've tried to "shift" the cache, but this fundamentally does not work with Transformers -- values in the cache depend on position of the token, and after shifting the cache values lose their meaning to the model.

RWKV is not free of these tens of minutes of computation, of course. To compute the next token for the a huge prompt, we need to have the state for it, that is, we need to all tokens of the prompt processed. But after you have the state, you can just continue generating new text -- no additional recomputation needed! It does not matter how many tokens were in the context, or how many tokens you want to generate -- per-token latency is always constant with RWKV.

This is the reason why I invest time in RWKV at all -- it may be a worse architecture/model by quality, but the fact is that on CPU it is actually usable with long prompts, in contrast to Transformers.

Are you doing anything special to convert from FP32

I just do tensor.half() or tensor.float() on PyTorch tensors in the conversion script. Looks like there is nothing special here.

Did you actually check to make sure all the values in the model could actually be represented as FP16 without actually doing special stuff like quantizing to 16 bit?

Hmm, this prompted me to think. I see that original model files have size 2x of parameter count, which suggests FP16, but does not guarantees it. I need to check what is the actual storage format, and, indeed, can it break with FP16 conversion for ggml.

Though after researching range of weights in RWKV I see that there is not much range -- something like -20..20, which can easily be represented in FP16. But, again, I need to check it to be sure. Edit: I've checked now, RWKV models are distributed in bfloat16 format

GGML doesn't have any bfloat16 stuff in it, but I might be mistaken

I didn't see anything related to bf16 in ggml too. When I write FP16, I specifically mean FP16 format from ggml, which is "normal" float16.

I'm also not sure how that would work with my map ops stuff which currently works with FP32

Probably, you don't need to do anything -- activations (hidden state vector) are in FP32 anyway.

Edit: added later

I'm curious about how your q4_1_O relates to the ongoing pulls for improving quantization in llama.cpp:

Looks like these improvement PRs are not much related to Q4_1_O: they try to store weights more efficiently in existing formats, but do not address outliers. As I understand, LLAMA does not have an issue with outlier weights/activations, so our quantization improvement effors would not intersect much.

Also, I've tried naively minimizing some error when quantizing, and this did not help much; I figured it is not worth the complexity and increasing quantization time.

Also, I'd guess the reason why the performance for q4_1_O is low is because there isn't an optimized version like there is for the other formats? Or do you think it's just inherently going to be much slower?

Because we need to store/load an outlier weight at an arbitrary index in the block, and do matmul in FP32, I think it is inhenrently slower. But I see no fundamental reason for it to be slower than FP16 -- it also converts everything to FP32, but is 2x faster.

@philpax philpax added the topic:model-support Support for new models label Apr 20, 2023
@philpax philpax changed the title Implement RWKV Support for RWKV Apr 20, 2023
@iacore
Copy link
Contributor

iacore commented Apr 22, 2023

Why did I only see this now.

Those implementations are stupidly simple:
My numpy+numba impl (f32 only): https://github.com/iacore/rwkv-np
My full rust impl with dfdx (f32 only): https://github.com/iacore/rwkv-rs

My server using custom fork of rwkv.cpp: https://github.com/iacore/rwkv-flask
My web frontend: https://github.com/iacore/rwkv-web

HF Tokenizer (Rust, Python, etc): https://github.com/huggingface/tokenizers
HF Tokenizer ported to WASM: https://github.com/iacore/rwkv-web/tree/main/src/lib/tokenizers

Performance wise, numpy+numba is as fast as ggml. dfdx can only use 4 threads for matmul.. Matmul takes up 80%-90% of the time. Using all the CPU cores is good enough.

@iacore
Copy link
Contributor

iacore commented Apr 22, 2023

@KerfuffleV2 First: I've tried my best to write AVX2 impl of Q4_1_O matmul, but restoration of outlier weights just does not work well with vectorized code. I have some ideas how to optimize if further, but I'm 99% sure that it would not improve it 2x, and we need some real asm/SIMD expert here to optimize it.

You need Zig. Zig has cross-arch SIMD. Optimize cycle count with llvm-mca.

Any Rust crate that depends on matrixmultiply is not fast enough.

q4_2 and q4_3 is coming, see this.

@saharNooby
Copy link

@iacore

You need Zig. Zig has cross-arch SIMD. Optimize cycle count with llvm-mca.

Interesting, thanks! I'm not sure about adding another language into the mix tho. Ideally, I would like to not even need to fork ggml...

q4_2 and q4_3 is coming, see RWKV/rwkv.cpp#16 (comment).

Already merged, it's fast and works good with Raven 7B.

@iacore
Copy link
Contributor

iacore commented Apr 23, 2023

Interesting, thanks! I'm not sure about adding another language into the mix tho. Ideally, I would like to not even need to fork ggml...

To me, it's better than writing AVX, NEON and whatever. You can distribute the pre-compiled assembly (by Zig) with the repo.

@danforbes
Copy link
Contributor

I took a stab at this but couldn't get it across the finish line. If anyone would like to pick up where I've left off, please feel free to reuse the code here https://github.com/danforbes/llama-rs/blob/dfo/model/rwkv/crates/models/rwkv/src/lib.rs

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request topic:model-support Support for new models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants