-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for rate limited embeddings calculation #877
Comments
The AIChat RAG sends embedding requests sequentially, without any concurrency. If 429 errors still occur under these circumstances, then it's likely a problem with your API server. The api does not meet the demands of a production environment. Maybe you can try to adjust |
Retry may be supported in the future, but it's not planned for the near term. Many scenarios can cause the embeddings interface to failsuch as Network errors, input exceed tokens limit, rate limited, etc. |
It seems it doesn't matter if the requests are sequential or parallel. If too many requests come in within a certain amount of time, the request fails. The problem is that I have no control over the rate limit and our company does not provide unrestricted access tokens. This seems reasonable as they could lose significant amounts of money if these tokens are lost. Without a throttling mechanism, I cannot use the RAG function with large/many files. I am using:
What would be a recommended value for |
#879 has been merged, you can try it to see if the problem is solved. |
If there is no delay after a failed attempt, a new attempt will likely yield the same result since the quota remains active. The message includes a hint about the appropriate delay before trying again. Parsing this message may not be practical, as it could change at any time. A possible solution could be implementing a hard-coded delay of about one minute or, alternatively, a configuration option per model. However, I would only recommend this if the backend provides the error code. It doesn't make sense to repeat the request for every exception reason, such as incorrect credentials. |
AIChat needs to support many embedding providers. It is not practical to distinguish errors. We will provide an environment variable |
You can try #882 to see if the problem is solved.
|
This works for me, thanks. |
For large input files, you might encounter this error from the Azure backend during embeddings calculation:
Please provide a way to limit the requests per second. It would also be great if this would not lead directly to an exception, but to some kind of back-off behavior. There could be multiple users of a single endpoint, so you have no control over the total load and aichat could crash even if rate limiting is implemented.
The text was updated successfully, but these errors were encountered: