-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dummy output for fastchat inference #688
Comments
Two things you might consider verify:
For
|
For 1, I check the max_tokens, it is correctly passed into fastchat. And it is true that fastchat get stopped at 128 token. For the output I mentioned, although it over 600 characters, it is actually only 128 token.
However, from the plugin log, I see what plugin received is much shorter than completions.rs send out.
So where the completion text shorten is taken place? For 2, fastchat support stop word, but I am not sure whether it could truely be help... Since in this case, there is seem no valid stop word could be used. |
I root cause the issue as http-api-binding current don't read the prompt template from model's tabby.json, which make the output far from our need. #696 fixed it. |
I'm glad you've identified the issue, but I don't believe that using #696 is the correct approach to address it. You can find the relevant code here: https://github.com/TabbyML/tabby/blob/main/crates/http-api-bindings/src/fastchat.rs#L64 In this code, we already have |
In fastchat, it don't recognize prefix/suffix. And it only accept incoming prompt as it is to next level, like vllm to do the real inference. So tokens[0] here would cause the inference result different from what is conducted like ggml, which use prompt template from tabby.json. |
I'll suggest patching fastapi to achieve that |
You mean add some patch for the fastchat side? But it is somehow different logic handling from current we deal with ggml. |
@wsxiaoys ,Shall we consider to unify current prompt generation towards ggml/http-binding/ctranslate, so that they share the common interface? And after the logici unified, some other serving bandend could easily get integrated, like grpc? Since for massive deployment consideration and performance, http interface may still somehow bring cost comparing with grpc. |
@wsxiaoys, I further test with several new case, and find that even with <fim_prefix> <fim_suffix> <fim_middle> being set correctly, it would still generate dummy output. So shall we consider add the option like use "\n" as the stop word? |
As mentioned earlier, given experimental-http has ability to connect any HTTP API endpoint, I recommend implementing the specific logic for Fastchat in an external HTTP server |
fastchat itself is not good place to do such logic. In our local practice, we use fastchat as the control panel for model serving, while vllm as the backend for data panel. It means fastchat only do its job for serving like openai interface, it don't change prompt, but pass it as it is to vllm. So do you mean have some other logic sit between tabby server and fastchat? It seems to me it is quite burden for code completion. If this logic is kept only inside fastchat.rs, not current change experimental-http logic, would it be more acceptable? |
I still prefer the implementation in Tabby's repository to be something generic rather than specific to a vendor (in this case, if you want to customize stop words). Therefore, I suggest either attempting to resolve the issue on FastChat or routing requests through an intermediate API server. |
I see. So I would keep the change to my localside. But do you have any plan to fix this dummy output in general? |
No, experimental features are not actively being considered for maintenance by the team. I also plan to remove the |
Sorry for confusion; the dummy I mentioned for with fim existed is also existed in ggml serving. So that is why I ask whether it could be fixed |
I don't see that from dicussions in this thread. feel free to file a bug if you find a supported devices ( |
I see. I would try get some test case which could be made public, and then open as a separated thread for that. Thx~ |
Hi @wsxiaoys ,
I am current investigating the slow responsing issue reported by @itlackey in #421 (comment).
I try with StarCode1B and fastchat interface, and here are some summary for my findings:
So there is nearly 600 characters being generated, which take 3090 use 1.8s, while only 100 is useful. If we could make it stop at 100 perfectly, then the response time would be reduced to 1/6, which make it fast enough.
There maybe two approach to achieveing this, and I'd like to hear from your suggestion.
Thx~
The text was updated successfully, but these errors were encountered: