Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: init functional tests #5566

Merged
merged 100 commits into from
Feb 24, 2024
Merged

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Feb 18, 2024

Motivation

Tests has been listed in #4216 as an improvements request.

The idea is to ensure all server routes are working properly using the Gherkin language to define test cases following BDD approach. It is designed to be human-readable, and describes use cases relating to a software system.

Example

  @llama.cpp
  Scenario: Multi users
    Given a prompt:
      """
      Write a very long story about AI.
      """
    And a prompt:
      """
      Write another very long music lyrics.
      """
    And 32 max tokens to predict
    Given concurrent completion requests
    Then the server is busy
    And  all slots are busy
    Then the server is idle
    And  all slots are idle
    Then all prompts are predicted

Proposed changes

A CI workflow is triggered which builds and starts the server in background, then test scenario are launched with python.
A very small model is used to quickly generate responses and a fixed seed is set to ensure reproducibility.

The Gherkin glue is written in python using behave.

Restriction

This is not designed to assess performance of the server.

Expected scenario:

(from @ngxson comment)

  • health and slots endpoints
  • completion endpoint
  • OAI compatible chat completion requests w/ and without streaming
  • multimodal prompts
  • passing multi users scenario
  • multi users scenario on OAI compatible endpoint with streaming
  • multi users with total number of tokens to predict exceeds the KV Cache size
  • server wrong usage scenario, like in Infinite loop of "context shift" #3969
  • slots shifting
  • continuous batching
  • embeddings endpoint
  • embeddings endpoint with image
  • multi user embedding endpoint: Segmentation fault #5655
  • OpenAI-compatible embeddings API
  • tokenize endpoint
  • infill endpoint
  • CORS and api key scenario @Azeirah
  • Upon receiving an non-completed unicode, the json converter crashes the whole process. @ngxson

Example of passing GitHub workflow can be found here.

TODO:

  • Use different build cmake matrix of the server to run tests: sanitizer: [ADDRESS, THREAD, UNDEFINED], build_type: [Debug, Release], build: [noavx, avx2, avx, avx512, clblast, openblas, kompute, vulkan, cublas]. Even if it works on docker ubuntu with CPU, it is interesting to see if it still works when no GPU device are available.
  • start the server within the scenario background instead of starting it manually to pass custom parameters
  • make the server binary path configurable
  • fix /slots and /health endpoint to properly access slots data over the queue_tasks: server: health: fix race condition on slots data using tasks queue #5634
  • fix slots[].state in health endpoint may be incoherent with the total slots idle under race condition
  • change the ci build trigger
  • use asyncio and aiohttp to trigger concurrent http request
  • fix async openai http client with streaming.
  • sometimes health check requests are triggered after completions request, making the scenario unstable
  • capture stdout / stderr of server_process maybe saved to a second file called llama-stdout.log @ngxson

@ggerganov
Copy link
Owner

I'm quite out of depth here, but if you can figure out a way to add server tests it would be awesome. I've sent you a collaborator invite

@phymbert
Copy link
Collaborator Author

@RafaAguilar @ngxson As you were part of the tests related discussion, do you feel OK with the proposed approach here ? If so, I will continue with asynchronous request and multi users scenario

@ngxson
Copy link
Collaborator

ngxson commented Feb 18, 2024

Great idea, thanks for starting this PR. Some suggestions:

  1. Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
  2. Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.
  3. Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

@Azeirah
Copy link
Contributor

Azeirah commented Feb 19, 2024

Excellent! I'll be reading this PR today and see if I can add a test or help out in some way :)

@Azeirah
Copy link
Contributor

Azeirah commented Feb 19, 2024

Great idea, thanks for starting this PR. Some suggestions:

  1. Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
  2. Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.
  3. Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Would it be possible to train a nonsensical 1m param model potentially? It should be really cheap and fast even on commodity hardware. These tests aren't meant to interact in any meaningful way with the output anyway.

I'm mentioning this because I see the trial run ran for 15 minutes for just two features with 3 scenarios each! Imagine the time needed to run 20-30 tests!

@ngxson
Copy link
Collaborator

ngxson commented Feb 19, 2024

@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict to stop the generation after X tokens.
Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab.
Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.

@Azeirah
Copy link
Contributor

Azeirah commented Feb 19, 2024

@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict to stop the generation after X tokens.
Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab.
Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.

Fair enough, I think it should be doable to make a model that behaves well enough. Potentially it could be trained explicitly to bias EOS whaha. I agree it would be a fun exercise, unfortunately I have a 7900xtx and I believe it cannot be used to train :(

In addition to that, we of course have no clue what kind of hardware these tests will be ran on, but if it's a virtual core on a xeon or some or other maybe we can try compiling openBLAS? I'm not sure if it'd be even worth investigating depending on the speedup and the variety of weird hardware you could get on Github actions. No clue what kind of control over the underlying (virtualised) hardware you'd get there.


Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it.

@ngxson
Copy link
Collaborator

ngxson commented Feb 19, 2024

Also one case that I have never tested before is invalid unicode.

In my personal project (which uses llama.h), on receiving responses via llama_token_to_piece, I pass it to nlohmann/json to convert it to json string. That's the same thing we're using in server example. Upon receiving an non-completed unicode, the json converter crashes the whole process.

Would be nice if someone can test if it's the case for server.cpp (which stream=True for example)

@ngxson
Copy link
Collaborator

ngxson commented Feb 19, 2024

@Azeirah I believe the hosted runner of github is Xeon with shared CPU cores. The performance is not meant to be consistent though. I believe that it cannot use anything better than AVX2.

For training, I'm using a GTX 1660 Ti. I initially purchased it for gaming 2 years ago, but who knows that now I need more VRAM than that :'( Back then, the dealer proposed me a 3080 Ti with a fairly good price, but I refused. Nowadays, for anything bigger than 1B, I need to rent a VPS on google cloud, it's more or less the same price with colab notebooks, but more flexible and have persistent storage.

@phymbert
Copy link
Collaborator Author

Great idea, thanks for starting this PR. Some suggestions:

  1. Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
  2. Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.
  3. Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Done 👍

@phymbert
Copy link
Collaborator Author

@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?

@ngxson
Copy link
Collaborator

ngxson commented Feb 19, 2024

@phymbert Can you try this model instead? (pay attention to set n_predict or max_tokens, because the model never outputs EOS token)

https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin

I have no idea if OpenBLAS will help. You can try if you want.

@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB

@phymbert phymbert marked this pull request as draft February 19, 2024 22:02
@phymbert
Copy link
Collaborator Author

@phymbert Can you try this model instead? (pay attention to set n_predict or max_tokens, because the model never outputs EOS token)

https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin

I have no idea if OpenBLAS will help. You can try if you want.

@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB

Nice thanks Took 0m0.481s. Note I have also reduced the KV Size.

@ggerganov
Copy link
Owner

@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?

@phymbert

Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas

Here are instructions for converting to GGUF and using in llama.cpp:

https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml

For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here:

https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf

Example:

# get the model
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

# run sample inference
./main -m ./stories260K.gguf -p "One day, Lily met" -n 128 -c 256

 One day, Lily met a boy named Timmy. Tim was very happy to help her mommy. He wanted to play with the ball all day. Suddenly, something unexpected happened. A little girl came over and saw a big tree. She was very sad.
Timmy wanted to play with the ball. He thought it would be fun! When he reached up, he found it st

llama_print_timings:        load time =      80.26 ms
llama_print_timings:      sample time =       1.70 ms /   128 runs   (    0.01 ms per token, 75427.22 tokens per second)
llama_print_timings: prompt eval time =       3.06 ms /     7 tokens (    0.44 ms per token,  2288.33 tokens per second)
llama_print_timings:        eval time =     134.04 ms /   127 runs   (    1.06 ms per token,   947.46 tokens per second)
llama_print_timings:       total time =     142.59 ms /   134 tokens

This should be ideal for CI

@ggerganov
Copy link
Owner

Btw, one thing that would greatly improve the state of server in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log, but there should be more.

Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance

@Azeirah
Copy link
Contributor

Azeirah commented Feb 23, 2024

I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.

@phymbert
Copy link
Collaborator Author

Wow! Very nice work - this would be very useful and should help to improve server significantly

multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969

What was the fix?

No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict nor "max_tokens" are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.

I have also added a wrong_usages.feature file to trace and reproduce this kind user issue.

@ggerganov Regarding #5655, I have reproduced it in issues.feature, to run it:
DEBUG=ON ./tests.sh --no-skipped --tags bug

It can be investigated/fixed in another PR.

Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.

@phymbert
Copy link
Collaborator Author

I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.

@Azeirah No worries, take care, it can wait for tomorrow 👍

@phymbert phymbert added testing Everything test related server/webui labels Feb 23, 2024
@Azeirah
Copy link
Contributor

Azeirah commented Feb 23, 2024

Wow! Very nice work - this would be very useful and should help to improve server significantly

multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969

What was the fix?

No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict nor "max_tokens" are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.

I have also added a wrong_usages.feature file to trace and reproduce this kind user issue.

@ggerganov Regarding #5655, I have reproduced it in issues.feature, to run it: DEBUG=ON ./tests.sh --no-skipped --tags bug

It can be investigated/fixed in another PR.

Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.

In the case the server is started with undesirable parameters, we should either abort or at the very least offer a clear warning with a suggested solution. Is that the case now?

I try to focus a lot on usability for end users.

Copy link
Contributor

@Azeirah Azeirah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's excellent! I ran it 5 times to check for flakes as well, especially since I have a capable PC and the warning in the readme, 0 flakes.

Did a quick look at all code, didn't read deep into the implementation but did look at the tests. I think it's a fantastic set-up for server tests! :)
If there's reason to add any tests later this set-up will be very easy to extend and it runs very stable so it's ready for merging if you ask me.

Only had a tiny comment about the readme again but it's very minor.

examples/server/tests/README.md Outdated Show resolved Hide resolved
examples/server/tests/README.md Show resolved Hide resolved
examples/server/tests/features/issues.feature Show resolved Hide resolved
@phymbert phymbert merged commit 525213d into ggerganov:master Feb 24, 2024
109 checks passed
@phymbert phymbert deleted the test/server-add-ci-test branch February 24, 2024 11:29
@phymbert
Copy link
Collaborator Author

Btw, one thing that would greatly improve the state of server in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log, but there should be more.

Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance

On it, especially in update_slots as it is a nightmare to understand what's going on

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault ggerganov#5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault ggerganov#5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
server/webui testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants