Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : parallel decoding and multimodal #3589

Closed
wants to merge 41 commits into from

Conversation

FSSRepo
Copy link
Collaborator

@FSSRepo FSSRepo commented Oct 11, 2023

Based on the server I've already created, I'm working on this PR. There are commented-out sections to save myself from unnecessary headaches. I haven't even tested the results of this PR yet.

Here's the plan for a complete implementation:

  • To make completion work in streaming mode.
  • Correct the mode without streaming. In progress ...
  • Fix embedding endpoint. Inherent issue with CUDA backend Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625
  • Test the grammars and some options of completion.
  • To correct the infill function and perform the tests, it would make my life easier if someone could provide a way to test that endpoint.
  • Conduct completion and chat tests, with and without the UI.
  • To correct the context reset mode, I need to understand the approach to this, as slots share the entire assigned context, so one slot could consume most of the context, leaving others with no context.
  • Refactor the code by removing duplications, unnecessary comments, and prepare it for merging into the master branch.
  • Beam Search fixed but no tested, if there are some issue about that I will take care of that.

@cebtenzzre cebtenzzre marked this pull request as draft October 11, 2023 22:23
@FSSRepo FSSRepo changed the title server: parallel decoding implementation server : parallel decoding implementation Oct 11, 2023
@ggerganov
Copy link
Owner

You should rebase and adopt the latest sampling changes from: #3543

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 12, 2023

@KerfuffleV2 In the new way of working with samplers, if each slot has different llama_sampling_params, then I should create a llama_sampling_context every time I want to perform a completion.

Because if there were a function llama_set_sampling_params(llama_sampling_context ctx, llama_sampling_params params) to change the parameters without having to create the context every time, it would be ideal.

llama_sampling_context llama_sampling_context_init(
        const struct gpt_params & params,
                  llama_grammar * grammar) {
  llama_sampling_context result;

  result.params = params.sampling_params;
  result.grammar = grammar;
  return result;
}

It should be, to reduce headache pain.:

llama_sampling_context llama_sampling_context_init(
        const struct llama_sampling_params & params,
                  llama_grammar * grammar) {
  llama_sampling_context result;

  result.params = params;
  result.grammar = grammar;
  return result;
}

mmmmmmm

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 12, 2023

The completion function is working fine; however, for some reason, the server crashes when it finishes. I haven't been able to identify the cause.

@KerfuffleV2
Copy link
Collaborator

mmmmmmm

Yes, I agree with that. Looks like there are also likely more changes coming: #3601

for some reason, the server crashes when it finishes. I haven't been able to identify the cause.

Maybe try compiling with address sanitizing turned on? (Support recently got added to normal Makefile builds also.)

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 12, 2023

Maybe try compiling with address sanitizing turned on? (Support recently got added to normal Makefile builds also.)

It was a null pointer; I was accessing an object outside of a lambda function, which always confuses me because the compiler never throws any error in that case.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 13, 2023

@ggerganov I need to know what to do in case the context gets crowded, given that the server is being used by multiple clients who share the entire context window. I assume that the context swap should only occur when a single slot is fully utilizing the context, or put a context limit per slot.

When I use the embeddings, it doesn't return the same values for some reason. I've already checked the embedding pipeline to compare it with the server implementation, but I'm not getting good results.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 13, 2023

Am I wrong?

Content:

{
  "content": "Which is your dog?"
}

PR server:

{
"embedding": [
    0.0013885498046875,
    0.02568817138671875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    -0.02291107177734375,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
]
}

master server:

{
  "embedding": [
    0.0013885498046875,
    0.02568817138671875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    -0.02291107177734375,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
]
}

Embedding example:

2.071708 0.094213 -0.879323 -0.868794 -1.877190 0.149166 0.456107 -1.180170 -1.899972 0.324895 0.137798 1.375898 -0.929520 -0.724631 -3.895488 0.835485 2.534587 -1.334827 -1.068944 -1.522295 -2.574473 -1.089735 0.195261 0.207192 0.607332 -0.160022 -2.255088 0.348395 -0.843356 0.270982 -1.236908 -0.836289 -1.600230 -0.213995 -0.724451 1.241112 0.571537 -1.612993 -0.924750 -1.141107 0.301533 0.476756 0.925662 -0.778871 0.644059 -2.298353 -1.250245 0.888944 1.218905 1.417424 -1.571156 0.380106 1.282110 1.155357 0.254458 -0.391138 4.462999 -2.220449 -0.675895 2.643535 -0.059323 -0.518516 -1.783109 1.390705 1.135090 2.257272 ....

Edit:

This is a problem when I offload the model to cuda, related issue #3625
Fixed

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 13, 2023

@ggerganov @monatis Server example now support LlaVA model

test

Response:

Screenshot 2023-10-13 184159

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 13, 2023

274014591-d2aa99eb-5dac-4963-98cd-05d10b7763a5

Params

{
  "stream": false,
  "n_predict": -1,
  "temperature": 0.1,
  "prompt": "A chat between a curious human and an artificial intelligence assistant.  The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: [(img)]describe the image in detail.\nASSISTANT:",
  "image_data": "image data in base64",
  "stop": ["</s>"]
}

Response:

 The image features a desk with two computer monitors on it. One of the monitors has an unusual decoration: a small toy lamb figurine is sitting on top of it, adding a playful touch to the workspace. A keyboard and mouse are also present on the desk, indicating that this is likely a computer setup.\n\nIn addition to the main desk area, there's another smaller desk visible in the background. The room appears to be well-equipped for work or study purposes, with various items such as books and a TV placed around the space.

@monatis
Copy link
Collaborator

monatis commented Oct 14, 2023

{
  "stream": false,
  "n_predict": -1,
  "temperature": 0.1,
  "prompt": "A chat between a curious human and an artificial intelligence assistant.  The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: [(img)]describe the image in detail.\nASSISTANT:",
  "image_data": "image data in base64",
  "stop": ["</s>"]
}

Awesome work! Thanks @FSSRepo for taking this

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 16, 2023

@Seikho Can you test now?

@cebtenzzre cebtenzzre changed the title server : parallel decoding implementation server : parallel decoding and multimodal Oct 16, 2023
@Seikho
Copy link

Seikho commented Oct 17, 2023

The stopping string issue is all good in commit d7eca25, however I can't start the server on commit 4d18043 (your most recent commit).

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 17, 2023

The stopping string issue is all good in commit d7eca25, however I can't start the server on commit 4d18043 (your most recent commit).

Some visible error or just slient stop? Send the output and parameters that you giving to the server

@Seikho
Copy link

Seikho commented Oct 17, 2023

At the time it was a silent error, although I didn't run with -v.
Now that I've switched back to that commit, recompiled and tried again it's running fine. Sorry for the wild goose chase.

@ggerganov
Copy link
Owner

I'll make a review tomorrow. First will merge #3624 and then will push changes to this PR to adapt to the new sampling API.

I need to know what to do in case the context gets crowded, given that the server is being used by multiple clients who share the entire context window. I assume that the context swap should only occur when a single slot is fully utilizing the context, or put a context limit per slot.

There should be a max context per-client (n_ctx_per_slot) and should perform the swap per client.
The total context should be n_system + n_ctx_per_slot*n_slot where n_system is the number of tokens in the common system prompt.

When I use the embeddings, it doesn't return the same values for some reason. I've already checked the embedding pipeline to compare it with the server implementation, but I'm not getting good results.

Might be fixed via #3657

Do I leave the multi-modal support to the server enabled by default or do I make it an option in CMake?

Enable by default

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 17, 2023

There should be a max context per-client (n_ctx_per_slot) and should perform the swap per client. The total context should be n_system + n_ctx_per_slot*n_slot where n_system is the number of tokens in the common system prompt.

I'm somewhat confused about this. Is context_per_slot a parameter for each slot or a general one? The context_per_slot should not exceed the model's n_ctx, as a user might assign 512 to one slot and 768 to another slot, but the established context is only 1024, leaving 256 remaining that cannot be allocated.

Also, I have no idea how the seed parameter would fit into parallel decoding.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 18, 2023

@ggerganov Are you reviewing this, now? I will try to adapt to the new changes in sampling API

@ggerganov
Copy link
Owner

I didn't find the time today, but tomorrow this will be first thing. Until then, you can make changes if you want to. I'll mark the PR when I start reviewing.

@ggerganov ggerganov self-requested a review October 19, 2023 09:09
@ggerganov ggerganov marked this pull request as ready for review October 19, 2023 09:09
@ggerganov
Copy link
Owner

@FSSRepo I can't push into your branch:

$ ▶ git push FSSRepo HEAD:master
error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/FSSRepo/llama.cpp'

Let's continue in #3677 - sent you a collaborator invite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants