server: init functional tests #5566

phymbert · 2024-02-18T17:39:11Z

Motivation

Tests has been listed in #4216 as an improvements request.

The idea is to ensure all server routes are working properly using the Gherkin language to define test cases following BDD approach. It is designed to be human-readable, and describes use cases relating to a software system.

Example

  @llama.cpp
  Scenario: Multi users
    Given a prompt:
      """
      Write a very long story about AI.
      """
    And a prompt:
      """
      Write another very long music lyrics.
      """
    And 32 max tokens to predict
    Given concurrent completion requests
    Then the server is busy
    And  all slots are busy
    Then the server is idle
    And  all slots are idle
    Then all prompts are predicted

Proposed changes

A CI workflow is triggered which builds and starts the server in background, then test scenario are launched with python.
A very small model is used to quickly generate responses and a fixed seed is set to ensure reproducibility.

The Gherkin glue is written in python using behave.

Restriction

This is not designed to assess performance of the server.

Expected scenario:

(from @ngxson comment)

Example of passing GitHub workflow can be found here.

TODO:

ggerganov · 2024-02-18T18:01:17Z

I'm quite out of depth here, but if you can figure out a way to add server tests it would be awesome. I've sent you a collaborator invite

phymbert · 2024-02-18T20:08:15Z

@RafaAguilar @ngxson As you were part of the tests related discussion, do you feel OK with the proposed approach here ? If so, I will continue with asynchronous request and multi users scenario

ngxson · 2024-02-18T21:03:38Z

Great idea, thanks for starting this PR. Some suggestions:

Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.
~~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf~~ Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Azeirah · 2024-02-19T10:52:51Z

Excellent! I'll be reading this PR today and see if I can add a test or help out in some way :)

Azeirah · 2024-02-19T15:25:01Z

Great idea, thanks for starting this PR. Some suggestions:

Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)

Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.

~~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf~~ Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Would it be possible to train a nonsensical 1m param model potentially? It should be really cheap and fast even on commodity hardware. These tests aren't meant to interact in any meaningful way with the output anyway.

I'm mentioning this because I see the trial run ran for 15 minutes for just two features with 3 scenarios each! Imagine the time needed to run 20-30 tests!

ngxson · 2024-02-19T15:36:33Z

@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict to stop the generation after X tokens.
Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab.
Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.

Azeirah · 2024-02-19T15:46:36Z

@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict to stop the generation after X tokens.
Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab.
Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.

Fair enough, I think it should be doable to make a model that behaves well enough. Potentially it could be trained explicitly to bias EOS whaha. I agree it would be a fun exercise, unfortunately I have a 7900xtx and I believe it cannot be used to train :(

In addition to that, we of course have no clue what kind of hardware these tests will be ran on, but if it's a virtual core on a xeon or some or other maybe we can try compiling openBLAS? I'm not sure if it'd be even worth investigating depending on the speedup and the variety of weird hardware you could get on Github actions. No clue what kind of control over the underlying (virtualised) hardware you'd get there.

Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it.

ngxson · 2024-02-19T15:47:03Z

Also one case that I have never tested before is invalid unicode.

In my personal project (which uses llama.h), on receiving responses via llama_token_to_piece, I pass it to nlohmann/json to convert it to json string. That's the same thing we're using in server example. Upon receiving an non-completed unicode, the json converter crashes the whole process.

Would be nice if someone can test if it's the case for server.cpp (which stream=True for example)

ngxson · 2024-02-19T15:56:28Z

@Azeirah I believe the hosted runner of github is Xeon with shared CPU cores. The performance is not meant to be consistent though. I believe that it cannot use anything better than AVX2.

For training, I'm using a GTX 1660 Ti. I initially purchased it for gaming 2 years ago, but who knows that now I need more VRAM than that :'( Back then, the dealer proposed me a 3080 Ti with a fairly good price, but I refused. Nowadays, for anything bigger than 1B, I need to rent a VPS on google cloud, it's more or less the same price with colab notebooks, but more flexible and have persistent storage.

phymbert · 2024-02-19T20:52:50Z

Great idea, thanks for starting this PR. Some suggestions:

Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)

Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.

~~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf~~ Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Done 👍

phymbert · 2024-02-19T20:54:34Z

@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?

ngxson · 2024-02-19T21:48:55Z

@phymbert Can you try this model instead? (pay attention to set n_predict or max_tokens, because the model never outputs EOS token)

https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin

I have no idea if OpenBLAS will help. You can try if you want.

@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB

.github/workflows/server-test.yml

phymbert · 2024-02-19T22:09:48Z

@phymbert Can you try this model instead? (pay attention to set n_predict or max_tokens, because the model never outputs EOS token)

https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin

I have no idea if OpenBLAS will help. You can try if you want.

@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB

Nice thanks Took 0m0.481s. Note I have also reduced the KV Size.

examples/server/tests/features/server.feature

ggerganov · 2024-02-20T09:29:29Z

@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?

@phymbert

Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas

Here are instructions for converting to GGUF and using in llama.cpp:

https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml

For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here:

https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf

Example:

# get the model
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

# run sample inference
./main -m ./stories260K.gguf -p "One day, Lily met" -n 128 -c 256

 One day, Lily met a boy named Timmy. Tim was very happy to help her mommy. He wanted to play with the ball all day. Suddenly, something unexpected happened. A little girl came over and saw a big tree. She was very sad.
Timmy wanted to play with the ball. He thought it would be fun! When he reached up, he found it st

llama_print_timings:        load time =      80.26 ms
llama_print_timings:      sample time =       1.70 ms /   128 runs   (    0.01 ms per token, 75427.22 tokens per second)
llama_print_timings: prompt eval time =       3.06 ms /     7 tokens (    0.44 ms per token,  2288.33 tokens per second)
llama_print_timings:        eval time =     134.04 ms /   127 runs   (    1.06 ms per token,   947.46 tokens per second)
llama_print_timings:       total time =     142.59 ms /   134 tokens

This should be ideal for CI

ggerganov · 2024-02-20T10:57:55Z

Btw, one thing that would greatly improve the state of server in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log, but there should be more.

Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance

add wrong_usage.feature to demonstrate user issue which will not be fixed.

Azeirah · 2024-02-23T17:07:23Z

I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.

allow to enable VERBOSE mode

phymbert · 2024-02-23T17:49:37Z

Wow! Very nice work - this would be very useful and should help to improve server significantly

multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969

What was the fix?

No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict nor "max_tokens" are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.

I have also added a wrong_usages.feature file to trace and reproduce this kind user issue.

@ggerganov Regarding #5655, I have reproduced it in issues.feature, to run it:
DEBUG=ON ./tests.sh --no-skipped --tags bug

It can be investigated/fixed in another PR.

Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.

phymbert · 2024-02-23T18:00:47Z

I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.

@Azeirah No worries, take care, it can wait for tomorrow 👍

Azeirah · 2024-02-23T20:18:05Z

Wow! Very nice work - this would be very useful and should help to improve server significantly

multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969

What was the fix?

No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict nor "max_tokens" are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.

I have also added a wrong_usages.feature file to trace and reproduce this kind user issue.

@ggerganov Regarding #5655, I have reproduced it in issues.feature, to run it: DEBUG=ON ./tests.sh --no-skipped --tags bug

It can be investigated/fixed in another PR.

Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.

In the case the server is started with undesirable parameters, we should either abort or at the very least offer a clear warning with a suggested solution. Is that the case now?

I try to focus a lot on usability for end users.

Azeirah

I think it's excellent! I ran it 5 times to check for flakes as well, especially since I have a capable PC and the warning in the readme, 0 flakes.

Did a quick look at all code, didn't read deep into the implementation but did look at the tests. I think it's a fantastic set-up for server tests! :)
If there's reason to add any tests later this set-up will be very easy to extend and it runs very stable so it's ready for merging if you ask me.

Only had a tiny comment about the readme again but it's very minor.

examples/server/tests/README.md

examples/server/tests/features/issues.feature

…d @wrong_usage.

phymbert · 2024-02-24T11:46:38Z

Btw, one thing that would greatly improve the state of server in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log, but there should be more.

Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance

On it, especially in update_slots as it is a nightmare to understand what's going on

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

phymbert marked this pull request as draft February 19, 2024 22:02

ngxson reviewed Feb 19, 2024

View reviewed changes

.github/workflows/server-test.yml Outdated Show resolved Hide resolved

phymbert force-pushed the test/server-add-ci-test branch from a9bf1ce to 0765d9c Compare February 20, 2024 00:15

phymbert commented Feb 20, 2024

View reviewed changes

examples/server/tests/features/server.feature Outdated Show resolved Hide resolved

ggerganov mentioned this pull request Feb 20, 2024

llama : update the convert-llama2c-to-ggml example #5608

Closed

phymbert added 10 commits February 20, 2024 19:05

server: init functional test

157bcf2

server: tests: reduce number of files, all in one tests shell script

9b63d70

server: tests: fix ci workflow

6497755

server: tests: fix ci workflow

4e5245e

server: tests: fix ci workflow

30aa323

server: tests: use ngxson llama_xs_q4.bin

fe9866a

server: tests: build only the server

1680599

server: tests: add health check and concurrent request example

8bb586b

server: tests: change model to: @karpathy's tinyllamas

6c95ec6

server: tests: refactor steps and vocabulary

56583be

phymbert added 2 commits February 23, 2024 17:12

server: tests: allow to pass argument to the test file

1c1fd40

add wrong_usage.feature to demonstrate user issue which will not be fixed.

server: tests: print server logs only on github action

2109743

phymbert added 2 commits February 23, 2024 18:28

server: tests: check if the server has not crashed after a scenario

30f802d

server: tests: adding concurrent embedding in issue ggerganov#5655

6c0e6f4

allow to enable VERBOSE mode

server: tests: linter

77b8589

server: tests: fix concurrent OAI streaming request

7183149

phymbert added testing Everything test related server/webui labels Feb 23, 2024

phymbert force-pushed the test/server-add-ci-test branch from 6a80812 to 7183149 Compare February 23, 2024 20:57

phymbert added 2 commits February 23, 2024 22:25

server: tests: add a note regarding inference speed.

2d107ba

server: tests: removing debug print

124ca77

Azeirah approved these changes Feb 24, 2024

View reviewed changes

examples/server/tests/README.md Outdated Show resolved Hide resolved

examples/server/tests/README.md Show resolved Hide resolved

examples/server/tests/features/issues.feature Show resolved Hide resolved

phymbert added 7 commits February 24, 2024 10:58

server: tests - allow print on debug

5957a2d

server: tests - README.md add build instruction and notice on @bug an…

482eb30

…d @wrong_usage.

server: tests - add explanation about KV Cache.

60781f0

server: tests - print only in case of DEBUG

a779a4b

server: add link to tests in the README.md

a2a928c

server: tests: improved README.md

5ed4452

github issue template: add link to the tests server framework

99163c8

phymbert merged commit 525213d into ggerganov:master Feb 24, 2024
109 checks passed

phymbert deleted the test/server-add-ci-test branch February 24, 2024 11:29

phymbert mentioned this pull request Feb 24, 2024

server : improvements and maintenance #4216

Open

10 tasks

phymbert mentioned this pull request Mar 10, 2024

server: ci: windows build and tests #5968

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: init functional tests #5566

server: init functional tests #5566

phymbert commented Feb 18, 2024 •

edited

Loading

ggerganov commented Feb 18, 2024

phymbert commented Feb 18, 2024

ngxson commented Feb 18, 2024 •

edited

Loading

Azeirah commented Feb 19, 2024

Azeirah commented Feb 19, 2024 •

edited

Loading

ngxson commented Feb 19, 2024 •

edited

Loading

Azeirah commented Feb 19, 2024 •

edited

Loading

ngxson commented Feb 19, 2024

ngxson commented Feb 19, 2024

phymbert commented Feb 19, 2024

phymbert commented Feb 19, 2024

ngxson commented Feb 19, 2024 •

edited

Loading

phymbert commented Feb 19, 2024

ggerganov commented Feb 20, 2024

ggerganov commented Feb 20, 2024

Azeirah commented Feb 23, 2024

phymbert commented Feb 23, 2024

phymbert commented Feb 23, 2024

Azeirah commented Feb 23, 2024

Azeirah left a comment

phymbert commented Feb 24, 2024

server: init functional tests #5566

server: init functional tests #5566

Conversation

phymbert commented Feb 18, 2024 • edited Loading

Motivation

Example

Proposed changes

Restriction

Expected scenario:

TODO:

ggerganov commented Feb 18, 2024

phymbert commented Feb 18, 2024

ngxson commented Feb 18, 2024 • edited Loading

Azeirah commented Feb 19, 2024

Azeirah commented Feb 19, 2024 • edited Loading

ngxson commented Feb 19, 2024 • edited Loading

Azeirah commented Feb 19, 2024 • edited Loading

ngxson commented Feb 19, 2024

ngxson commented Feb 19, 2024

phymbert commented Feb 19, 2024

phymbert commented Feb 19, 2024

ngxson commented Feb 19, 2024 • edited Loading

phymbert commented Feb 19, 2024

ggerganov commented Feb 20, 2024

ggerganov commented Feb 20, 2024

Azeirah commented Feb 23, 2024

phymbert commented Feb 23, 2024

phymbert commented Feb 23, 2024

Azeirah commented Feb 23, 2024

Azeirah left a comment

Choose a reason for hiding this comment

phymbert commented Feb 24, 2024

phymbert commented Feb 18, 2024 •

edited

Loading

ngxson commented Feb 18, 2024 •

edited

Loading

Azeirah commented Feb 19, 2024 •

edited

Loading

ngxson commented Feb 19, 2024 •

edited

Loading

Azeirah commented Feb 19, 2024 •

edited

Loading

ngxson commented Feb 19, 2024 •

edited

Loading