-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new streaming ask endpoint (WIP) #400
Implement new streaming ask endpoint (WIP) #400
Conversation
@microsoft-github-policy-service agree |
Also I want to note, the way I have this implemented currently is just returning the whole answer as a stream of strings. Another implementation i was thinking of was returning a stream of PartialMemoryAnswer(MemoryAnswer with nullable properties), with the first object returning the metadata (such as facts/citations), and the parts after the first one only containing their pieces of text. example: var noAnswerFound = new PartialMemoryAnswer
{
Question = question,
NoResult = true,
Result = this._config.EmptyAnswer,
};
StringBuilder bufferedAnswer = new(); // Buffering result in memory to count chars and check if not empty answer
bool finishedRequiredBuffering = false;
var watch = Stopwatch.StartNew();
await foreach (var x in this.GenerateAnswerAsync(question, facts.ToString()).WithCancellation(cancellationToken).ConfigureAwait(false))
{
if (x is null || x.Length == 0)
{
continue;
}
bufferedAnswer.Append(x);
int currentLength = bufferedAnswer.Length;
if (!finishedRequiredBuffering)
{
if (currentLength <= this._config.EmptyAnswer.Length && ValueIsEquivalentTo(bufferedAnswer.ToString(), this._config.EmptyAnswer))
{
this._log.LogTrace("Answer generated in {0} msecs. No relevant memories found", watch.ElapsedMilliseconds);
noAnswerFound.NoResultReason = "No relevant memories found";
yield return noAnswerFound;
yield break;
}
else if (currentLength > this._config.EmptyAnswer.Length)
{
finishedRequiredBuffering = true;
yield return new PartialMemoryAnswer
{
Question = question,
NoResult = false,
Result = bufferedAnswer.ToString(),
RelevantSources = citationList,
};
}
}
if (finishedRequiredBuffering)
{
yield return new PartialMemoryAnswer
{
Result = x
};
}
if (this._log.IsEnabled(LogLevel.Trace) && currentLength >= 30)
{
this._log.LogTrace("{0} chars generated", currentLength);
}
} |
…to 100-streaming-askAsync-response
service/Core/MemoryServerless.cs
Outdated
@@ -257,4 +257,29 @@ public Task DeleteDocumentAsync(string documentId, string? index = null, Cancell | |||
minRelevance: minRelevance, | |||
cancellationToken: cancellationToken); | |||
} | |||
|
|||
/// <inheritdoc /> | |||
public IAsyncEnumerable<string> AskStreamingAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this new method returns only the answer, missing information about sources, relevance, etc. Why not stream a json object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this new method returns only the answer, missing information about sources, relevance, etc. Why not stream a json object?
I did a small write-up here.
In this PR I made it as a stream of strings, because using an object would require making some design choices.
Is the extra information (sources, relevance, etc) sent with every "answer part", or only the first/last one?
In an implementation I did in a personal project I sent all the metadata in the first response part, and only sent messageID + messageChunk with every remaining part.
If you prefer that approach or have another solution I can rewrite these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everyone will want streaming, without giving up the other metadata, in particular the list of sources and their relevance. I would use the same approach used by OpenAI streaming, returning a content_update property within the response object, that includes the token - potentially sending the list of sources on the first response only using the same approach (something like sources_update) to avoid the overhead of sending the same sources with every token.
Something like, pseudo structure (I would reuse the existing class, just change it a bit to support "_update"):
response:
{
sources_update {
stream of items. // sent all at once with the first token, then the list stays empty
},
content_update {
next string token
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes to return MemoryAnswer instead of string.
I'm not sure what you exactly want with MemoryAnswer. New properties for streaming, or modifying the existing ones?
In my latest changes I've used the existing properties for the time being
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my two cents, that sounds perfect @dluc . We're interested as well in streaming, it's one of the reasons why we have not implemented KM yet. We do indeed need the sources as well.
To send the sources only on the first response makes perfect sense.
Just wondering when this will be merged? Are there any blockers? I could possibly assist if need be.
some conflicts to address |
if it helps - I rebased and fixed some issues => https://github.com/chaelli/kernel-memory/tree/100-streaming-askAsync-response |
…reaming-askAsync-response
Resolved these |
any reason this is waiting? do we need to add SSE format for the endpoint? (if so - maybe this works #625 (reply in thread) - did for my quick test in the browser - but my experience with SSE are minimal ;)) |
@dluc just checking in, can this be merged soon please? We're about to build a UI using Kernel Memory and streaming is a must-have @JonathanVelkeneers can you please check as well? |
sorry to keep this on hold but I cannot merge yet. I'd suggest to work with a fork not to be blocked. It's an important feature but there's some other critical work in progress. |
@dluc thanks for the reply, understood. Do you by any chance have a rough ETA when you will be ready to merge it? We've already forked the repo so we can auto deploy it to our infra, so the service side is no big deal. |
@JonathanVelkeneers I played around a bit with SSE and now have a working solution that does "clean" SSE (via get). Basically, I use two endpoints - one that starts the answer generation and returns an id and one that then streams the answer:
When requesting the answer from the browser, this is the basic functionality:
maybe you can add this to your branch? |
@dluc does it make sense for us to rebase this? or will it take much longer to merge? |
no worries I'll take care of the rebase ;-) |
Is this feature still gonna be implemented? |
Apologies for the delay, I had to put this topic on hold, and we currently have two PRs for the same feature. I’ll need to review and decide the best approach. Unfortunately, no ETA at the moment |
Update: for this feature to be merged, there's a couple of things to do:
|
@JonathanVelkeneers |
@JonathanVelkeneers thanks for helping! - There were two PRs addressing the same feature, and I had to decide between them. The Response Streaming implementation is now merged into |
Motivation and Context (Why the change? What's the scenario?)
Draft PR for issue #100 .
This draft pr was very briefly discussed in discord. The changes still need work.
currently missing:
Get the streaming working in WebClient, it currently still buffers the response (i was not able to find a fix)High level description (Approach, Design)
Used existing AskAsync method as reference point for most of the code.
The new streaming endpoint does not return a MemoryAnswer, but returns the actual text result in an async enumerable.
If sources are needed the SearchAsync method can still be used to fetch them with minimal extra performance overhead.