Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Response Streaming #726

Merged
merged 35 commits into from
Dec 1, 2024
Merged

Implement Response Streaming #726

merged 35 commits into from
Dec 1, 2024

Conversation

carlodek
Copy link
Contributor

@carlodek carlodek commented Aug 1, 2024

Motivation and Context (Why the change? What's the scenario?)

Add option to stream Ask result tokens without waiting for the full answer to be ready.

High level description (Approach, Design)

  • New stream boolean option for the Ask API, false by default. When true, answer tokens are streamed as soon as they are generated by LLMs.
  • New MemoryAnswer.StreamState enum property: Error, Reset, Append, Last.
  • If moderation is enabled, the content is validated at the end. In case of moderation failure, the service returns an answer with StreamState = Reset and the new content to show to the end user.
  • Streaming uses SSE message format.
  • By default, SSE streams end with a [DONE] token. This can be disabled via KM settings.
  • SSE payload is optimized, returning RelevantSources only in the first SSE message.

Example request:

curl 'http://127.0.0.1:9001/ask' --header 'Content-Type: application/json' \
    --data '{"question": "which storage engines can I use with Kernel Memory?", "stream": true }'

Response:

data: {"streamState":"append","question":"which storage engines can I use with Kernel Memory?","noResult":false,"text":"","relevantSources":[... cut ...]}

data: {"streamState":"append","noResult":false,"text":"The"}

data: {"streamState":"append","noResult":false,"text":" storage"}

data: {"streamState":"append","noResult":false,"text":" engines"}

data: {"streamState":"append","noResult":false,"text":" that"}

[...]

data: {"streamState":"append","noResult":false,"text":"work"}

data: {"streamState":"append","noResult":false,"text":" in"}

data: {"streamState":"append","noResult":false,"text":" progress"}

data: {"streamState":"append","noResult":false,"text":")"}

data: [DONE]

you can now call azure openAI and OpenAI with streaming
@carlodek carlodek requested a review from dluc as a code owner August 1, 2024 14:06
@carlodek
Copy link
Contributor Author

carlodek commented Aug 1, 2024 via email

@dluc
Copy link
Collaborator

dluc commented Oct 16, 2024

Update: for this feature to be merged, there's a couple of things to do:

  • Check this similar PR Implement new streaming ask endpoint (WIP) #400 and decide which approach to take
  • Support content moderation. The stream of tokens needs to be validated while streamed, on a configurable frequence. If at any point the text moderation fails, the stream needs to be reset, e.g. sending a special token or similar.

@dluc dluc added the waiting for author Waiting for author to reply or address comments label Oct 16, 2024
@nurkmez2
Copy link

I have been waiting for a long time if you could add Ask Stream Endpoint in the next version, it would be great

@nurkmez2
Copy link

Hi @carlodek
It would be great if you could complete the feature of ask_stream endpoint.
Thanks and Regards

@roldengarm
Copy link
Contributor

Any updates on this please @dluc it has been a long time, hopefully we can get this sorted soon.

@carlodek
Copy link
Contributor Author

Hello, I've noticed that your work works like a charm with Ollama too! Still with Ollama Stream call is working with ask_stream endpoint. @dluc let me know if I have to do something more

Copy link
Collaborator

@dluc dluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several changes out of scope and unnecessary ones, plus some affecting the solution security. First thing I would ask is to undo all these changes, including code style ones, limiting to the bare minimum. Thanks!

.dockerignore Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
Directory.Packages.props Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
service/Core/Core.csproj Outdated Show resolved Hide resolved
service/Service/Program.cs Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
service/Service/Program.cs Outdated Show resolved Hide resolved
service/Service/Program.cs Outdated Show resolved Hide resolved
service/Service/Program.cs Outdated Show resolved Hide resolved
service/Service/Program.cs Outdated Show resolved Hide resolved
@dluc
Copy link
Collaborator

dluc commented Nov 29, 2024

I'll make some changes to see if we can merge it:

  • move the [DONE] token at the end of the stream, separate from the stream chunks
  • merge stream behavior into the existing /ask endpoint, using an optional boolean flag to choose whether to stream or not
  • remove code duplication in SearchClient
  • revisit how content moderation is implemented

carlodek and others added 3 commits November 29, 2024 09:15
changed x in token
removed unnecessary blank line
@dluc dluc force-pushed the stream_response branch 5 times, most recently from c7f558a to 817c90b Compare December 1, 2024 11:32
- reduce code duplication
- reduce stream payload size
- stream reset on moderation
- handle errors
- add streaming examples
@dluc dluc removed the waiting for author Waiting for author to reply or address comments label Dec 1, 2024
@dluc dluc changed the title added ask_stream endpoint Implement Response Streaming Dec 1, 2024
@dluc dluc merged commit 77fd7be into microsoft:main Dec 1, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants