GGUF file format specification #302

philpax · 2023-06-25T22:32:48Z

Closes #220.

Rendered: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md

Defines a complete specification for the proposed GGUF file format, which should generically describe models to be loaded by any compatible executor.

This is a first draft, so there's still some work that needs to be done - I need to fill in the TODOs and clarify a few things. If you have any suggestions for what should go in the TODOs, please let me know!

Changes from the version in the issue include:

changing of several of the key-value pairs, including splitting them out into per-architecture key-values
decoupling tensor info from tensor data, and aligning both
moving the embedded vocabulary into the metadata, so that it is no longer special-cased

Green-Sky

general question, is this format only for LLMs ? what about vision stuff and multiple models in one file? eg. https://github.com/monatis/clip.cpp does that.

docs/gguf.md

philpax · 2023-06-26T08:42:44Z

general question, is this format only for LLMs ? what about vision stuff and multiple models in one file? eg. https://github.com/monatis/clip.cpp does that.

Nope! LLMs are just the use-case I'm familiar with. We should describe whisper.cpp here and discuss standardising the others too (this is the first I've heard of clip.cpp, that's really cool). Do you have links to other GGML-using projects that aren't LLMs?

monatis · 2023-06-26T09:46:50Z

I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs:

One possible way might be

ggml_magic
number_of_pairs
[
    (key_length, key, value_type, value)
    ...
]

value_type can be used to indicate if it's an integer (e.g., value_type=0) or length of string if value_type > 0. Then we can define a function that extracts metadata from a given file easily. This is only a morning idea, but the whole idea is we need to define the format, not the content.

Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined.

The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI.

philpax · 2023-06-26T10:00:03Z

I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs:

One possible way might be
ggml_magic
number_of_pairs
[
    (key_length, key, value_type, value)
    ...
]
value_type can be used to indicate if it's an integer (e.g., value_type=0) or length of string if value_type > 0. Then we can define a function that extracts metadata from a given file easily. This is only a morning idea, but the whole idea is we need to define the format, not the content.

Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined.

The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI.

Yes, that's what's described in the spec. It's not a closed set; the keys that are specified are standardized and guaranteed to always share the same meaning, but users can extend it with their own as required to serve their needs. Ideally, the more popular KVs would end up being standardized as well.

Green-Sky · 2023-06-26T10:39:39Z

Do you have links to other GGML-using projects that aren't LLMs?

check out the README :) https://github.com/ggerganov/ggml#updates

docs/gguf.md

philpax · 2023-06-26T23:21:05Z

I've addressed the review comments 👍

Just asking the people behind each implementation: can you suggest metadata values that should be standardized, if any?

@ggerganov: whisper.cpp
@monatis: clip.cpp
@saharNooby: rwkv.cpp
@PABannier: biogpt.cpp/encodec.cpp
@skeskinen: bert.cpp

docs/gguf.md

Green-Sky · 2023-06-27T10:09:46Z

How is this spec relating to LoRa (ggla), I don't see it mentioned anywhere. @slaren

philpax · 2023-06-27T12:16:38Z

How is this spec relating to LoRa (ggla), I don't see it mentioned anywhere. @slaren

Good spot, I actually noticed that this morning and hadn't updated it. What should it look like? I imagine that you want it to

match an existing model exactly, so that it can't be misapplied
be marked as a LoRA

Maybe a subset of the fields of the original LLM, with a general.lora = true field?

docs/gguf.md

slaren · 2023-06-27T14:26:39Z

The LoRA files are very simple currently, it's just a tiny header with a few parameters and a bunch of tensors. I think it should work fine with the way this is designed currently.

The only parameters stored in the header currently are the rank and alpha values of the LoRA. This is not enough to support every type of LoRA, so I wouldn't bother with defining this in a very detailed way for now, we can look into it later.

docs/gguf.md

klosax · 2023-06-27T18:51:58Z

What is the difference between max_seq_len and context_length? Isn't both the maximum usable/recommended context length?

klosax · 2023-06-27T19:39:02Z

I suggest use of special key-values to identify special tokens:

tokenizer.bos_token_id Beginning of sequence marker
tokenizer.eos_token_id End of sequence marker
tokenizer.unk_token_id Unknown token
tokenizer.sep_token_id Separator token
tokenizer.pad_token_id Padding token

jploski · 2023-06-27T20:27:07Z

What is the difference between max_seq_len and context_length? Isn't both the maximum usable/recommended context length?

There is no difference, I suppose it's just came into existence because the Falcon implementation was derived from MPT/Replit, which also has this naming.

philpax · 2023-06-27T21:57:56Z

Updated with latest round of feedback.

LoganDark · 2023-06-27T22:16:52Z

Updated with latest round of feedback.

Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble

philpax · 2023-06-27T22:25:43Z

Updated with latest round of feedback.

Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble

Oh, no, I know this; I was just giving you two an opportunity to agree on what the names of those fields should be before I wrote anything up.

LoganDark · 2023-06-28T23:51:09Z

I suggest use of special key-values to identify special tokens:

tokenizer.bos_token_id Beginning of sequence marker
tokenizer.eos_token_id End of sequence marker
tokenizer.unk_token_id Unknown token
tokenizer.sep_token_id Separator token
tokenizer.pad_token_id Padding token

Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented?

monatis · 2023-10-17T06:34:55Z

Big endian support is proposed in ggerganov/llama.cpp#3552. I think it needs the community's attention because the spec is explicit in little endianness. We need to update it when merging that PR.

Dampfinchen · 2023-10-17T22:30:47Z

So Huggingface just introduced prompt templates embedded into the model files.

https://huggingface.co/docs/transformers/main/chat_templating

This means that when a model creator has set the prompt format in it, inference programs and UIs can detect the right prompt template based on the model files and set it automatically.

IMO, this is absolutely huge and a big step forward into making LLMs more accessible to everyone.

@ggerganov Does GGUF support this?

cztomsik · 2023-10-18T07:53:33Z

This means that when a model creator has set the prompt format in it, inference programs and UIs can detect the right prompt template based on the model files and set it automatically.

You can do this already, just look in the vocab for common tokens and you know which "template" was used.

What huggingface added is jinja specific/dependent. I don't think ggml/llama.cpp is going to re-implement jinja templating engine. EDIT: I don't think it's even possible, those templates depend on python expressions.

Also, it's only for chat, you can use models for other things, and then the chat template is useless, whereas the vocab introspection would still help.

monatis · 2023-10-18T10:49:15Z

So Huggingface just introduced prompt templates embedded into the model files.

I believe this should be responsibility of downstream executors if they want to support this. It's technically possible to add any template info in GGUF files --you can easily add the system prompt, roll names or a full templated string in it and then read from the model file to act accordingly.

Dampfinchen · 2023-10-28T15:09:09Z

So Huggingface just introduced prompt templates embedded into the model files.

I believe this should be responsibility of downstream executors if they want to support this. It's technically possible to add any template info in GGUF files --you can easily add the system prompt, roll names or a full templated string in it and then read from the model file to act accordingly.

So what you are saying is, GGUF files already contain the "chat_template" information in the tokenizer_config.json and it's up to the inference program to use it?

If that's the case, then I agree with you.

monatis · 2023-10-29T16:49:55Z

So what you are saying is, GGUF files already contain the "chat_template" information in the tokenizer_config.json and it's up to the inference program to use it?

It's up to the converter to add those pieces of information to the GGUF file and it's up to the inference program to make use of those. GGML / llama.cpp does not distribute preconverted GGUF files officially --community members do. Anyone can add arbitrary key-value pairs to GGUF files that they want to use after the conversion. The GGUF spec defines only the structure, not the content.

philpax · 2023-10-31T23:32:07Z

Hi all! Apologies for the late follow-up on this, I've been tremendously busy and I'm catching back up on everything now. I've updated the PR to address the comments. @ggerganov I think we should merge this in now, and then let the community make follow-up PRs to fill in any specifics or correct things.

@philpax

One other thing I noticed in the spec: As far as I can tell, the underlying representation of enum ggml_type is never specified, unlike e.g. enum gguf_metadata_value_type: uint32_t

Fixed, thanks!

@philpax Another issue:
    // Padding to the nearest multiple of ALIGNMENT.
    uint8_t _padding[ALIGNMENT - (sizeof(header + tensor_infos) % ALIGNMENT)];
The comment is clear. But as far as I can tell, the pseudocode below it fails in the case where sizeof(header + tensor_infos) is divisible by ALIGNMENT. In that case - where weights can start immediately on next byte, according to comment - the modulo is 0, and we would add 32 bytes of unnecessary padding.

Potential fix could be
    uint8_t _padding[(ALIGNMENT - (sizeof(header + tensor_infos) % ALIGNMENT)) % ALIGNMENT];

Well-spotted. I've reworded the relevant section to make it more obvious what the intended goal is there; I was going to go with your fix, but figured that a different approach might be clearer.

I've addressed the review comments 👍我已经解决了评论意见 👍
Just asking the people behind each implementation: can you suggest metadata values that should be standardized, if any?只需询问每个实现背后的人员：您能否建议应该标准化的元数据值（如果有的话）？

@ggerganov: whisper.cpp ： whisper.cpp

@monatis: clip.cpp ： clip.cpp

@saharNooby: rwkv.cpp ： rwkv.cpp

@PABannier: biogpt.cpp/encodec.cpp ： biogpt.cpp / encodec.cpp

@skeskinen: bert.cpp ： bert.cpp

I am working on bert.cpp 32 31 33 and found something for you.
* since training data difference, subword symbol maybe differenct too, e.g. ## or others (##bcd, elf##)
  
  * I have read some config from huggingface, it is hard for me to find a stable key point to the symbol.

* for bert transformer, here is a hyper parameter: type_vocab_size
  
  * maybe the kv design is enough for it.

I'm sorry, but I didn't understand what you were suggesting; would you be able to produce a diff to the spec? I can edit it as required, I'm just not sure what the exact changes you'd like me to make are 😞

@philpax I would like to replace {arch}.rope.scale_linear with the following keys, for YaRN:

[llm].rope.scaling.type: string, can be none, linear, or yarn
[llm].rope.scaling.factor: float32, replaces scale_linear
[llm].rope.scaling.original_context_length: uint32_t, original context length of base model
[llm].rope.scaling.finetuned: bool, true if model has been finetuned with RoPE scaling

see ggerganov/llama.cpp#2268

Done. That PR's not merged yet, so I've left the old key in there in the meantime so that implementers aren't confused.

Big endian support is proposed in ggerganov/llama.cpp#3552. I think it needs the community's attention because the spec is explicit in little endianness. We need to update it when merging that PR.

Thanks for mentioning this. As far as I can tell, the only functional difference is the version number has changed, which is... not ideal? How do you tell apart a little-endian and big-endian file? I've updated to v3 nonetheless and written a little about it, but this seems like something we should rectify ASAP.

Regarding the discussion of prompt templates: I think this is a great idea (improves the single-file usability of GGUF more still), which is why there's a section reserved for it in the spec. Nobody's defined what this should look like for GGUF yet as the design space was too wide when we were initially looking at it.

It looks like Hugging Face are using Jinja templates for their prompt templates, which is a simple and straightforward solution to the problem, but may not be appropriate for us: Jinja supports a lot of things we don't care about, and its canonical implementation is Python.

I'd love to hear suggestions for how we could embed prompt templates that cover the majority of what people would want to do with prompts while still remaining relatively simple to implement.

teleprint-me · 2023-11-01T03:12:38Z

@philpax

It looks like Hugging Face are using Jinja templates for their prompt templates, which is a simple and straightforward solution to the problem, but may not be appropriate for us: Jinja supports a lot of things we don't care about, and its canonical implementation is Python.

I'd love to hear suggestions for how we could embed prompt templates that cover the majority of what people would want to do with prompts while still remaining relatively simple to implement.

Speaking from experience with llama-cpp-python, it's best to let users define their templates. Rigid templates have already shown their limitations. Templates are often not well-understood and can be error-prone. This was especially evident with the transformers library, where some templates were simply incorrect due to inadequate research and implementation by the developers. Fortunately, transformers allows for customization, mitigating this issue to an extent.

The models are quite adaptable and the use of grammars provides the extensibility developers need. Over-complicating this with inflexible templates could become an obstacle.

I appreciate @ggerganov's original recommendation on handling special tokens, and the parameterized approach has its merits as it automates a previously tedious process. In Python, I've experimented with a flexible design that aligns well with @ggerganov's original idea and translates smoothly to JSON.

# Default chat formatting templates for reusability.
# These templates can be reused or modified on a model-by-model basis.

# Template for HuggingFace-based models.
huggingface_template = {
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "jinja": None,
    "tokenize": False,
}

# Common formatting settings applicable to all roles in chat models.
common_template: llama_types.CommonTemplate = {
    "separators": {
        "after_system": "\n",
        "between_messages": "\n",
        "end_of_response": "",
    },
    "default_termination": {
        "role": "assistant",  # Default role for termination
        "message": None,  # Default termination message (None for assistant)
    },
    "include_prompt": False,  # Whether to include user prefix/postfix in prompts
}

# Template for Llama-2 model.
llama2_template: llama_types.ChatMLTemplate = {
    "roles": {
        "system": {
            "prefix": "<<SYS>>",  # System message prefix
            "postfix": "<</SYS>>",  # System message postfix
            "format": None,  # Optionally specify a custom format
        },
        "user": {
            "prefix": "[INST] ",
            "postfix": " [/INST]",  # Model generates from here
            "format": None,
        },
        "assistant": {
            "prefix": "",  # No prefix for assistant role by default
            "postfix": "",  # No postfix for assistant role by default
            "format": None,  # Custom format for assistant (if needed)
        },
    }
}
# Merge common settings into the llama2_template to reduce code duplication.
llama2_template |= common_template

Regarding JSON, it might not be the most convenient in C++, but it's manageable. You could consider using nlohmann/json, although I understand the preference for minimal dependencies.

What sets this library apart is its commitment to user control and flexibility, a feature I highly respect. Imposing rigid templates could compromise this strength.

Translating this to C++ should be relatively straightforward, as shown in the provided example.

#include <iostream>
#include <map>

struct CommonTemplate {
    std::string after_system;
    std::string between_messages;
    std::string end_of_response;
    // Add other common fields
};

struct RoleTemplate {
    std::string prefix;
    std::string postfix;
    // Add other role-specific fields
};

int main() {
    CommonTemplate common_template = { "\n", "\n", "" };

    std::map<std::string, RoleTemplate> roles;
    roles["system"] = { "<<SYS>>", "<</SYS>>" };
    roles["user"] = { "[INST] ", " [/INST]" };
    roles["assistant"] = { "", "" };

    std::map<std::string, std::map<std::string, RoleTemplate>> llama2_template;
    llama2_template["roles"] = roles;

    // Adding common settings to llama2_template can be done by another structure
    // or by manual assignment if the key names differ between the common and role-specific settings.

    std::cout << "System Prefix: " << llama2_template["roles"]["system"].prefix << std::endl;
    std::cout << "User Prefix: " << llama2_template["roles"]["user"].prefix << std::endl;

    return 0;
}

However, I'd strongly caution against imposing fixed templates. User-centric design has been the strong suit of this library, and I would appreciate keeping it that way.

ggerganov · 2023-11-01T17:01:21Z

@philpax

Thank you once again for initiating GGUF, writing and maintaining the spec and coordinating the community around it!
GGUF is a huge benefit not only for ggml, but for many other projects as well. Great job!

Will proceed with merging the PR and let follow-up PRs update the spec as needed.

@teleprint-me

My idea about including chat templates in GGUF only extends to the point where it is a container of oblique templates associated with the given model. A project using GGUF files can decide what to do with these templates, but specifically for llama.cpp, I think the only functionality that it will provide is a way to query the templates through the API. It is super unlikely that we will ever implement JSON, jinga or any other template language at such low-level.

Without knowing the specifics of these templates and assuming they are just a parsable string, I would suggest that we store them in GGUF just as strings. Probably an array of strings since maybe there could be more than one template associated with the model.

philpax · 2023-11-01T18:09:28Z

Woohoo! Glad to see it in; it's been great to see widespread adoption 🎉

I also agree on the subject of prompt templates - we wouldn't require anyone to use them, but they're there to ease use with supported executors. I think we might include hints as to how they're used (i.e. indicate that it's a Jinja template or whatever), but it should otherwise be pretty freeform. A discussion for another issue, perhaps?

monatis · 2023-11-01T18:14:47Z

The Python package has monthly ~35k downloads: https://pypistats.org/packages/gguf

ggerganov · 2023-11-01T18:25:04Z

The Python package has monthly ~35k downloads: https://pypistats.org/packages/gguf

There are more than 1300 GGUF models hosted on HuggingFace:

https://huggingface.co/models?sort=trending&search=gguf 🎉

earonesty · 2023-11-06T16:15:52Z

we should be able to embed the jinja2 templates from huggingface in the gguf metadata, would be really helpful. jinja2 is easy to parse, lightweight

https://github.com/jinja2cpp/Jinja2Cpp

!!

philpax · 2023-11-08T17:05:27Z

I'd suggest opening another issue or a PR to the spec for that. I don't think llama.cpp will support actually using the templates, but it would be good to standardise the metadata keys for prompt templates regardless.

Dampfinchen · 2023-11-08T18:34:23Z

I'd suggest opening another issue or a PR to the spec for that. I don't think llama.cpp will support actually using the templates, but it would be good to standardise the metadata keys for prompt templates regardless.

There is already one here ggerganov/llama.cpp#3810 (comment)

Support for it directly in the llama.cpp UI would be cool, but it's not a big deal if it won't be implemented. What a big deal would be however, is that the converter.py should implement the chat template metadata into the GGUF file so programs can read and adjust the prompt template automatically.

FSSRepo · 2023-11-08T20:39:37Z

There should be an option in the GGUF API in ggml.c for tensor data to be written directly to a file instead of keeping them in memory until you call the gguf_write_to_file function. This is causing significant memory consumption when converting large models. My suggestion is to first write the key-value elements and tensor metadata with their precalculated offsets in the file, then progressively write and release memory for the already written tensor data.

cebtenzzre · 2023-11-09T00:23:04Z

This is causing significant memory consumption when converting large models.

Depending on which conversion script you're running, you may be running into ggerganov/llama.cpp#3433.

ggerganov · 2023-11-28T07:36:01Z

HuggingFace just added a GGUF filter to their UI 🎉

docs: gguf spec first pass

d2fbcb2

philpax mentioned this pull request Jun 25, 2023

ggml : unified file format #220

Closed

Green-Sky reviewed Jun 25, 2023

View reviewed changes

docs/gguf.md Outdated Show resolved Hide resolved

docs/gguf.md Outdated Show resolved Hide resolved

docs/gguf.md Show resolved Hide resolved

ggerganov reviewed Jun 26, 2023

View reviewed changes