Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: improve correctness of request parsing and responses #2929

Merged
merged 16 commits into from
Sep 9, 2024

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Aug 30, 2024

This PR significantly improves the accuracy of the local API server's request parsing and responses.

Functional changes:

  • Split chat and completions endpoints, since they have significant differences - this is a functional change since what was just isChat is now a finer distinction, and some parameters are no longer supported with one or the other since they aren't in the spec
  • Rewrite request parsing using dedicated classes with more strict type checking and more meaningful client error responses
  • Fix /v1/completions to stop applying a prompt template (it should be already applied by the client, if any) and give token-accurate responses with echo=true (whitespace was being trimmed to eagerly)
  • Fix /v1/chat/completions to insert actual user and assistant messages into history using fakeReply instead of just joining all of the user messages into a string to prefix the prompt

Other changes:

  • Change llmodel to use std::optional for fakeReply, since constructing a std::string pointer requires too much boilerplate
  • Apply a few cleanups to existing code as recommended by iwyu/clazy/clang-tidy

Outstanding issues (all existing before this PR):

  • Reproducibility is inhibited by changing top_k, n_batch, repeat_penalty, and repeat_last_n in the UI; these should really be configured by the client
  • Reproducibility may be inhibited by changing promptTemplate in the UI, perhaps the client would like to configure this
  • Reproducibility is inhibited by not actually knowing which model is used if the expected model was potentially not found
  • Reproducibility is inhibited by enabling LocalDocs in the UI (the client probably wants control of this), which affects the model's context in a way that cannot be represented by user/assistant queries in follow-up requests

@cebtenzzre cebtenzzre requested a review from manyoso September 3, 2024 17:29
@cebtenzzre cebtenzzre marked this pull request as ready for review September 3, 2024 17:29
@cebtenzzre cebtenzzre changed the title WIP: local API server improvements Improve correctness of local API server parsing and responses Sep 3, 2024
@cebtenzzre
Copy link
Member Author

cebtenzzre commented Sep 3, 2024

Test 1: /v1/chat/completions

Request
{
    "model": "Llama 3 8B Instruct",
    "temperature": 0,
    "messages": [
        {
            "role": "user",
            "content": "this is a conversation between john and another person.\njohn: what is your name?"
        },
        {
            "role": "assistant",
            "content": "bob: my name is bob"
        },
        {
            "role": "user",
            "content": "john: could you please repeat that?"
        }
    ]
}
Response on main
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "Other Person: My name is Emily.\n\n(Note: I'll respond as the other",
        "role": "assistant"
      },
      "references": []
    }
  ],
  "created": 1725387192,
  "id": "foobarbaz",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 40,
    "total_tokens": 56
  }
}

Why the above is unexpected: The model forgot its previous response, despite it being explicitly provided in the request.

Response with this PR
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "bob: sure thing, john! my name is bob.",
        "role": "assistant"
      },
      "references": null
    }
  ],
  "created": 1725385942,
  "id": "placeholder",
  "model": "Llama 3 8B Instruct",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 12,
    "prompt_tokens": 53,
    "total_tokens": 65
  }
}

Test 2: /v1/completions

Request
{
    "model": "Llama 3 8B Instruct",
    "temperature": 0,
    "echo": true,
    "prompt": "France is a"
}
Response on main
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "references": [],
      "text": "France is a\nI think I know where this is going!\n\n...country!"
    }
  ],
  "created": 1725387245,
  "id": "foobarbaz",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 12,
    "prompt_tokens": 14,
    "total_tokens": 26
  }
}

Why the above is unexpected: The model's response is joined with a newline to the query, with a chat template applied behind-the-scenes, even though this is not a chat.

Response with this PR
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "references": null,
      "text": "France is a country with a rich history and culture, known for its iconic landmarks like the E"
    }
  ],
  "created": 1725386185,
  "id": "placeholder",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 5,
    "total_tokens": 21
  }
}

Test 3: /v1/chat/completions

Request
{
    "model": "Llama 3 8B Instruct",
    "prompt": "what is 2+2?",
    "messages": [ { "role": "user", "content": "?" } ]
}
Response on main
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "The answer to 2 + 2 is... (drumroll please)...",
        "role": "assistant"
      },
      "references": []
    }
  ],
  "created": 1725390175,
  "id": "foobarbaz",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 19,
    "total_tokens": 35
  }
}

Why the above is unexpected: The server not only didn't complain about the invalid argument, it actually generated a response to its content.

Response with this PR
{
  "error": {
    "code": null,
    "message": "Unrecognized request argument supplied: prompt",
    "param": null,
    "type": "invalid_request_error"
  }
}

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre changed the title Improve correctness of local API server parsing and responses server: improve correctness of request parsing and responses Sep 3, 2024
VS 2019 does not support C++23 std::optional::transform.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre marked this pull request as draft September 4, 2024 15:26
@cebtenzzre
Copy link
Member Author

Marking as draft until I fully sort out the CI issues. We need to use VS 2022 but CMake can't find the C++ compiler if I just switch the machine image - probably because the hardcoded Enter-VsDevShell equivalent needs to be updated.

@cebtenzzre cebtenzzre marked this pull request as ready for review September 4, 2024 19:57
@cebtenzzre
Copy link
Member Author

cebtenzzre commented Sep 4, 2024

Windows is building fine now, but the gcc we are using on Linux is too old for std::optional::transform (need gcc 12 from May 2022 but have 11), and on macOS there is an incompatibility with std::format and OS targets older than Ventura 13.3 (which can't be used on pre-2017 Intel Macs).

@cebtenzzre cebtenzzre marked this pull request as draft September 5, 2024 14:55
@cebtenzzre
Copy link
Member Author

The final nail in the coffin for std::format is that on Linux it requires libstdc++13, which released in April of last year, and was not made available until Ubuntu Lunar 23.04. Somewhere around 50% of our Ubuntu users are still on something older than 23.04, despite Ubuntu Noble 24.04LTS releasing earlier this year.

@manyoso
Copy link
Collaborator

manyoso commented Sep 5, 2024

The changes that you're attempting to make with regard to string formatting are tangentially related to the thrust of the change though, right? Perhaps it would be best to separate the concerns here as some of these fixes should go in sooner?

@cebtenzzre

This comment was marked as resolved.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre marked this pull request as ready for review September 6, 2024 23:24
@cebtenzzre cebtenzzre requested a review from manyoso September 6, 2024 23:24
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@manyoso
Copy link
Collaborator

manyoso commented Sep 7, 2024

Seems to be failing on CI still...

@cebtenzzre
Copy link
Member Author

Seems to be failing on CI still...

CI passed, actually. I canceled the Windows and macOS builds because they succeeded with the previous commit.

@cebtenzzre cebtenzzre merged commit 3900528 into main Sep 9, 2024
6 of 21 checks passed
@cebtenzzre cebtenzzre deleted the improve-local-server branch September 9, 2024 14:48
cebtenzzre added a commit that referenced this pull request Sep 30, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Sep 30, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants