-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: improve correctness of request parsing and responses #2929
Conversation
ea253e4
to
c3c3c51
Compare
Test 1: /v1/chat/completionsRequest{
"model": "Llama 3 8B Instruct",
"temperature": 0,
"messages": [
{
"role": "user",
"content": "this is a conversation between john and another person.\njohn: what is your name?"
},
{
"role": "assistant",
"content": "bob: my name is bob"
},
{
"role": "user",
"content": "john: could you please repeat that?"
}
]
} Response on main{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": "Other Person: My name is Emily.\n\n(Note: I'll respond as the other",
"role": "assistant"
},
"references": []
}
],
"created": 1725387192,
"id": "foobarbaz",
"model": "Llama 3 8B Instruct",
"object": "text_completion",
"usage": {
"completion_tokens": 16,
"prompt_tokens": 40,
"total_tokens": 56
}
} Why the above is unexpected: The model forgot its previous response, despite it being explicitly provided in the request. Response with this PR{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "bob: sure thing, john! my name is bob.",
"role": "assistant"
},
"references": null
}
],
"created": 1725385942,
"id": "placeholder",
"model": "Llama 3 8B Instruct",
"object": "chat.completion",
"usage": {
"completion_tokens": 12,
"prompt_tokens": 53,
"total_tokens": 65
}
} Test 2: /v1/completionsRequest{
"model": "Llama 3 8B Instruct",
"temperature": 0,
"echo": true,
"prompt": "France is a"
} Response on main{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"references": [],
"text": "France is a\nI think I know where this is going!\n\n...country!"
}
],
"created": 1725387245,
"id": "foobarbaz",
"model": "Llama 3 8B Instruct",
"object": "text_completion",
"usage": {
"completion_tokens": 12,
"prompt_tokens": 14,
"total_tokens": 26
}
} Why the above is unexpected: The model's response is joined with a newline to the query, with a chat template applied behind-the-scenes, even though this is not a chat. Response with this PR{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"references": null,
"text": "France is a country with a rich history and culture, known for its iconic landmarks like the E"
}
],
"created": 1725386185,
"id": "placeholder",
"model": "Llama 3 8B Instruct",
"object": "text_completion",
"usage": {
"completion_tokens": 16,
"prompt_tokens": 5,
"total_tokens": 21
}
} Test 3: /v1/chat/completionsRequest{
"model": "Llama 3 8B Instruct",
"prompt": "what is 2+2?",
"messages": [ { "role": "user", "content": "?" } ]
} Response on main{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": "The answer to 2 + 2 is... (drumroll please)...",
"role": "assistant"
},
"references": []
}
],
"created": 1725390175,
"id": "foobarbaz",
"model": "Llama 3 8B Instruct",
"object": "text_completion",
"usage": {
"completion_tokens": 16,
"prompt_tokens": 19,
"total_tokens": 35
}
} Why the above is unexpected: The server not only didn't complain about the invalid argument, it actually generated a response to its content. Response with this PR{
"error": {
"code": null,
"message": "Unrecognized request argument supplied: prompt",
"param": null,
"type": "invalid_request_error"
}
} |
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
74c42d9
to
84ae96a
Compare
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
VS 2019 does not support C++23 std::optional::transform. Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Marking as draft until I fully sort out the CI issues. We need to use VS 2022 but CMake can't find the C++ compiler if I just switch the machine image - probably because the hardcoded Enter-VsDevShell equivalent needs to be updated. |
Windows is building fine now, but the gcc we are using on Linux is too old for std::optional::transform (need gcc 12 from May 2022 but have 11), and on macOS there is an incompatibility with std::format and OS targets older than Ventura 13.3 (which can't be used on pre-2017 Intel Macs). |
The final nail in the coffin for std::format is that on Linux it requires libstdc++13, which released in April of last year, and was not made available until Ubuntu Lunar 23.04. Somewhere around 50% of our Ubuntu users are still on something older than 23.04, despite Ubuntu Noble 24.04LTS releasing earlier this year. |
The changes that you're attempting to make with regard to string formatting are tangentially related to the thrust of the change though, right? Perhaps it would be best to separate the concerns here as some of these fixes should go in sooner? |
This comment was marked as resolved.
This comment was marked as resolved.
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
231c566
to
9a0a372
Compare
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Seems to be failing on CI still... |
CI passed, actually. I canceled the Windows and macOS builds because they succeeded with the previous commit. |
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
This PR significantly improves the accuracy of the local API server's request parsing and responses.
Functional changes:
isChat
is now a finer distinction, and some parameters are no longer supported with one or the other since they aren't in the spec/v1/completions
to stop applying a prompt template (it should be already applied by the client, if any) and give token-accurate responses withecho=true
(whitespace was being trimmed to eagerly)/v1/chat/completions
to insert actualuser
andassistant
messages into history using fakeReply instead of just joining all of the user messages into a string to prefix the promptOther changes:
Outstanding issues (all existing before this PR):
user
/assistant
queries in follow-up requests