Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
API: move to api-version=2 for faster streaming LLM responses (#5446)
Previously, the Cody IDE client used `api-version=1` for LLM chat responses, which sent the full LLM response on every streaming change. This was problematic because it meant we were sending a lot of redundant traffic between the IDE client and the remote Sourcegraph instance eventually resulting in a slower user experience. Now, we use `api-version=2`, which only sends the delta between chunks. This significantly reduces the amount of characters we're sending over the wire. For example, for a chat response with 3k output tokens, we were previously processing up to 1.8m tokens(!!) while now we only process 3k tokens. Don't merge until sourcegraph/sourcegraph#293 goes live. ## Test plan Tested locally and confirmed that both api-version=1 and api-version=2 work as expected. - [x] Update all HTTP recordings to reflect the new API. This should give us good test coverage. - [x] Manually confirm the web extension is still working. Cody Web has no automated tests, but I ran the demo locally and took this screenshot of the delta encoding in action ![CleanShot 2024-09-12 at 10 25 52@2x](https://github.com/user-attachments/assets/4189fad2-f23a-4d83-ac1c-96aa462099a2) <!-- Required. See https://docs-legacy.sourcegraph.com/dev/background-information/testing_principles. --> ## Changelog * Cody now uses a new LLM API that offers faster performance, especially for long chat responses. This improvement is only enabled for Claude models at this point. <!-- OPTIONAL; info at https://www.notion.so/sourcegraph/Writing-a-changelog-entry-dd997f411d524caabf0d8d38a24a878c -->
- Loading branch information