dummy output for fastchat inference #688

leiwen83 · 2023-11-01T12:34:41Z

I am current investigating the slow responsing issue reported by @itlackey in #421 (comment).

I try with StarCode1B and fastchat interface, and here are some summary for my findings:

fastchat serving itself is fast enough, and there is actually no api flooding which @itlackey mentioned. But it is true that current responding very flow.
The root cause is that there are many unnessceary words generated by fastchat, then aborted by tabby server.

        // I use tabby source for completion test.
        let state = completions::CompletionState::new

       // The completion result return from fastchat is:
       '(\n            engine.clone(),\n            prompt_template,\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.', logprobs=None, finish_reason='length'

      // tabby choose keep only
(
            engine.clone(),
            prompt_template,
            args.model.clone(),

So there is nearly 600 characters being generated, which take 3090 use 1.8s, while only 100 is useful. If we could make it stop at 100 perfectly, then the response time would be reduced to 1/6, which make it fast enough.

There maybe two approach to achieveing this, and I'd like to hear from your suggestion.

let tabby server to explicit tell fastchat, some stop word. Like maybe right closure?
Current in fastchat.rs, I choose to call fastchat api in blocking way, that is I expect when call is return, I could get full response. Maybe stream calling is more prefered in the code completion case? So that tabby server could directly cancel further generation after 100 character in this case?

Thx~

The text was updated successfully, but these errors were encountered:

wsxiaoys · 2023-11-01T17:25:54Z

Two things you might consider verify:

Is there any reason fastchat server doesn't obey max_tokens output length?
https://github.com/TabbyML/tabby/blob/main/crates/http-api-bindings/src/fastchat.rs#L65C9-L65C9

tabby/crates/tabby/src/serve/completions.rs

Line 126 in 90e446b

.max_decoding_length(128)

For /v1/completion, it was set to 128 as default.

If fastchat supports stop words, you might pass per-language stop words to fastchat server.

leiwen83 · 2023-11-02T08:32:29Z

For 1, I check the max_tokens, it is correctly passed into fastchat. And it is true that fastchat get stopped at 128 token.

For the output I mentioned, although it over 600 characters, it is actually only 128 token.

       '(\n            engine.clone(),\n            prompt_template,\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.', logprobs=None, finish_reason='length'

However, from the plugin log, I see what plugin received is much shorter than completions.rs send out.

INFO - #com.tabbyml.intellijtabby.agent.Agent - Parsed agent output:        '(\n            engine.clone(),\n            prompt_template,\n            args.model.clone(),\n    '

So where the completion text shorten is taken place?

For 2, fastchat support stop word, but I am not sure whether it could truely be help... Since in this case, there is seem no valid stop word could be used.

leiwen83 · 2023-11-03T05:48:17Z

I root cause the issue as http-api-binding current don't read the prompt template from model's tabby.json, which make the output far from our need. #696 fixed it.

wsxiaoys · 2023-11-03T07:03:02Z

I'm glad you've identified the issue, but I don't believe that using #696 is the correct approach to address it.

You can find the relevant code here: https://github.com/TabbyML/tabby/blob/main/crates/http-api-bindings/src/fastchat.rs#L64

In this code, we already have tokens[0] as the prefix and tokens[1] as the suffix. The best course of action would be to simply pass suffix into the suffix field of fastapi and let fastapi handle the prompt template.

leiwen83 · 2023-11-03T07:10:59Z

In fastchat, it don't recognize prefix/suffix. And it only accept incoming prompt as it is to next level, like vllm to do the real inference. So tokens[0] here would cause the inference result different from what is conducted like ggml, which use prompt template from tabby.json.

wsxiaoys · 2023-11-03T17:25:28Z

I'll suggest patching fastapi to achieve that

leiwen83 · 2023-11-04T01:49:47Z

I'll suggest patching fastapi to achieve that

You mean add some patch for the fastchat side? But it is somehow different logic handling from current we deal with ggml.
In ggml case, tabby already assemble the FIM prompt passing before calling inference. Maybe we shall put ggml and http-binding at the same level?

leiwen83 · 2023-11-05T11:11:28Z

@wsxiaoys ,Shall we consider to unify current prompt generation towards ggml/http-binding/ctranslate, so that they share the common interface?
I see there is great demanding in http-binding as current llm model is developing very fast, and we may need to change serving backend from time to time to meet the need, such as performance, massive serving, or simply new model quick enabling.
Make the serving prompt handling logic the same could greatly be helpful in identify issue and long term support, also for different solution comparsion.

And after the logici unified, some other serving bandend could easily get integrated, like grpc? Since for massive deployment consideration and performance, http interface may still somehow bring cost comparing with grpc.

leiwen83 · 2023-11-06T02:23:07Z

@wsxiaoys, I further test with several new case, and find that even with <fim_prefix> <fim_suffix> <fim_middle> being set correctly, it would still generate dummy output. So shall we consider add the option like use "\n" as the stop word?

wsxiaoys · 2023-11-06T02:37:46Z

As mentioned earlier, given experimental-http has ability to connect any HTTP API endpoint, I recommend implementing the specific logic for Fastchat in an external HTTP server

leiwen83 · 2023-11-06T02:49:22Z

bility to connect any HTTP API endpoint, I recommend implementing the specific logic for Fastchat in an external HTTP server

fastchat itself is not good place to do such logic. In our local practice, we use fastchat as the control panel for model serving, while vllm as the backend for data panel. It means fastchat only do its job for serving like openai interface, it don't change prompt, but pass it as it is to vllm.

So do you mean have some other logic sit between tabby server and fastchat? It seems to me it is quite burden for code completion. If this logic is kept only inside fastchat.rs, not current change experimental-http logic, would it be more acceptable?

wsxiaoys · 2023-11-06T02:51:47Z

I still prefer the implementation in Tabby's repository to be something generic rather than specific to a vendor (in this case, if you want to customize stop words). Therefore, I suggest either attempting to resolve the issue on FastChat or routing requests through an intermediate API server.

leiwen83 · 2023-11-06T02:54:10Z

I see. So I would keep the change to my localside. But do you have any plan to fix this dummy output in general?

wsxiaoys · 2023-11-06T02:58:01Z

No, experimental features are not actively being considered for maintenance by the team. I also plan to remove the experimental-http device from the default feature set and retain it solely as a reference implementation

leiwen83 · 2023-11-06T03:04:06Z

No, experimental features are not actively being considered for maintenance by the team. I also plan to remove the experimental-http device from the default feature set and retain it solely as a reference implementation

Sorry for confusion; the dummy I mentioned for with fim existed is also existed in ggml serving. So that is why I ask whether it could be fixed

wsxiaoys · 2023-11-06T03:15:53Z

I don't see that from dicussions in this thread. feel free to file a bug if you find a supported devices (gpu, cuda or metal) is not functioning in any scenario. Thank you!

leiwen83 · 2023-11-06T05:14:27Z

I see. I would try get some test case which could be made public, and then open as a separated thread for that.

Thx~

leiwen83 closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dummy output for fastchat inference #688

dummy output for fastchat inference #688

leiwen83 commented Nov 1, 2023 •

edited

Loading

wsxiaoys commented Nov 1, 2023 •

edited

Loading

leiwen83 commented Nov 2, 2023 •

edited

Loading

leiwen83 commented Nov 3, 2023

wsxiaoys commented Nov 3, 2023

leiwen83 commented Nov 3, 2023

wsxiaoys commented Nov 3, 2023

leiwen83 commented Nov 4, 2023

leiwen83 commented Nov 5, 2023 •

edited

Loading

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023 •

edited

Loading

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023 •

edited

Loading

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023

leiwen83 commented Nov 6, 2023

dummy output for fastchat inference #688

dummy output for fastchat inference #688

Comments

leiwen83 commented Nov 1, 2023 • edited Loading

wsxiaoys commented Nov 1, 2023 • edited Loading

leiwen83 commented Nov 2, 2023 • edited Loading

leiwen83 commented Nov 3, 2023

wsxiaoys commented Nov 3, 2023

leiwen83 commented Nov 3, 2023

wsxiaoys commented Nov 3, 2023

leiwen83 commented Nov 4, 2023

leiwen83 commented Nov 5, 2023 • edited Loading

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023 • edited Loading

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023 • edited Loading

leiwen83 commented Nov 6, 2023

wsxiaoys commented Nov 6, 2023

leiwen83 commented Nov 6, 2023

leiwen83 commented Nov 1, 2023 •

edited

Loading

wsxiaoys commented Nov 1, 2023 •

edited

Loading

leiwen83 commented Nov 2, 2023 •

edited

Loading

leiwen83 commented Nov 5, 2023 •

edited

Loading

wsxiaoys commented Nov 6, 2023 •

edited

Loading

wsxiaoys commented Nov 6, 2023 •

edited

Loading