Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Ported quantize.cpp #84

Merged
merged 22 commits into from
Apr 25, 2023
Merged

Ported quantize.cpp #84

merged 22 commits into from
Apr 25, 2023

Conversation

FloppyDisck
Copy link
Contributor

@FloppyDisck FloppyDisck commented Mar 27, 2023

I went ahead and ported the main quantize.cpp file. My changes involve porting the file while keeping the internal C++ function calls intact. I have plans to port those function calls to remove ggml dependencies in a future PR though.

During the porting process, I faced some challenges as I was not familiar with how to use Context. As a result, I added the half library to handle the f16->f32 conversion. I could remove the dependency if needed but ill need some help with working with Context. Something to note on this is that if there are plans to move away from ggml then the half library will be necessary.

Additionally, I included some print statements inside the function to mimic the original behavior of quantize.cpp. I can remove those if needed.

Currently, there is no way to access the function since I did not implement a CLI function for it.

I am open to feedback and suggestions on how to improve this pull request.

Resolves #40

@FloppyDisck FloppyDisck changed the title Implemented quantize.cpp Ported quantize.cpp Mar 29, 2023
@philpax
Copy link
Collaborator

philpax commented Mar 29, 2023

Awesome work! I've left some feedback; it's not the most Rust-y code, but that's fine as it's a port and we can fix that up later on 🙂

Really appreciate you doing this, it's great to get one step closer to being completely standalone 🚀

@FloppyDisck
Copy link
Contributor Author

Your comments arent showing up in the PR,

And I fully agree, im going to go ahead and fix all the clippy issues plus see if I can improve some of the logic.

Do you have any recommendations on data reading and writing? I use the same buffer multiple times since its more efficient when working on rust embedded systems so I went ahead and did the same here.

@philpax
Copy link
Collaborator

philpax commented Mar 30, 2023

Weird, I can definitely see the comments here and in the diff. Not sure what's happening there.

For data read/write, not sure - reusing the same buffer seems reasonable if they're semantically similar, but I'd just use a new buffer if they're not. Do you have any examples of something you'd want advice on?

ggml-raw/ggml/ggml.c Outdated Show resolved Hide resolved
llama-rs/src/ggml.rs Outdated Show resolved Hide resolved
llama-rs/src/ggml.rs Outdated Show resolved Hide resolved
llama-rs/src/ggml.rs Outdated Show resolved Hide resolved
llama-rs/src/quantize.rs Outdated Show resolved Hide resolved
llama-rs/src/quantize.rs Outdated Show resolved Hide resolved
llama-rs/src/quantize.rs Outdated Show resolved Hide resolved
llama-rs/src/quantize.rs Outdated Show resolved Hide resolved
llama-rs/src/ggml.rs Outdated Show resolved Hide resolved
llama-rs/src/quantize.rs Outdated Show resolved Hide resolved
@philpax
Copy link
Collaborator

philpax commented Apr 6, 2023

I've updated the PR, but the output seems to be incorrect:

thread 'main' panicked at 'Could not load model: TensorWrongSize { tensor_name: "tok_embeddings.weight", path: "../llama-models/v0/ggml-model-q4_0-rust.bin" }', llama-cli/src/cli_args.rs:319:14

Probably an assumption somewhere that I broke. Need to look into it further - any ideas?

I'd also like to support loading unversioned models and GGJT, so this is going to be a bit of a headache in general :(

@FloppyDisck
Copy link
Contributor Author

Looking at your commits, I found a couple of places where it could've broken, Ill check it out and see if it fixes it.

@FloppyDisck
Copy link
Contributor Author

@philpax Regarding supporting other types of models, if you can provide the relevant issues I could research into making those work.

ggml/src/lib.rs Outdated Show resolved Hide resolved
@philpax philpax added this to the 0.1 milestone Apr 10, 2023
@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

Ok, updated to the latest main, haven't tested if it works. Couple of notes:

  1. It would be really nice to support GGML and GGJT, not just GGMF. Standalone loader #125 should make this much easier, but could be a while off.
  2. I think f32 as the element type for dst of the ggml quantize functions is wrong. The resulting values will not be floats in any meaningful sense. u32 is still wrong, but makes it clearer that it's a different data type.
  3. It'd be a good idea to add these asserts to the ggml quantize functions (assuming that they're correct):
assert_eq!(src.len(), n as usize);
assert_eq!(dst.len(), n as usize);
assert!(hist.len() >= 16);

@KerfuffleV2
Copy link
Contributor

u32 is still wrong, but makes it clearer that it's a different data type.

What about u8? When I see that, I think "just a bunch of bytes".

@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

What about u8? When I see that, I think "just a bunch of bytes".

The issue is the size - it should be equivalent to the size of the original array in bytes. I guess we could shove 4*size onto the user, it's not that big of a deal

@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

Merged in main again, comments/questions from above still apply

@FloppyDisck
Copy link
Contributor Author

  1. I agree, but maybe this should be moved into a different issue I can tackle later on? Maybe one based on making take a more module approach?
  2. In the original implementation both work and data are f32, altho we could have work be a u8 and modify the code accordingly.
  3. I agree, the asserts should work on the current model

@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

  1. Yeah, probably. If Standalone loader #125 gets done sooner rather than later we can give it a shot, but if not I'm happy to merge as-is and then we can patc hit up
  2. Looks like it's a llama_buffer in the original code https://github.com/ggerganov/llama.cpp/blob/be87b6ed20a5f7528bf491a83e759a9fc6a24fea/llama.cpp#L1596 but that's fine, we could just use a Vec<u8> here.

Cool then, let's see if #125 happens soon and in the meantime we can fix 2/3.

@philpax
Copy link
Collaborator

philpax commented Apr 19, 2023

#125 is going to be merged soon if all goes well, but its ggml-format loader doesn't work in its current stage. Given that, I think we're OK to merge this once that's in. Let's try to get support for the other formats as soon as possible, but I won't let that block merging.

@philpax philpax mentioned this pull request Apr 22, 2023
@philpax
Copy link
Collaborator

philpax commented Apr 25, 2023

I implemented write support for the loader (now ggml-format), so now that quantize can accept any single-file GGML/GGMF/GGJT and quantize it to GGJT q4_0 or q4_1. Seems to work from my testing, but I don't have that many f16 models lying around.

@philpax philpax merged commit 6e8aa79 into rustformers:main Apr 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert quantize.cpp to Rust
3 participants