Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for rate limited embeddings calculation #877

Closed
martin-eder-zeiss opened this issue Sep 23, 2024 · 8 comments · Fixed by #879
Closed

Support for rate limited embeddings calculation #877

martin-eder-zeiss opened this issue Sep 23, 2024 · 8 comments · Fixed by #879
Labels
enhancement New feature or request

Comments

@martin-eder-zeiss
Copy link

martin-eder-zeiss commented Sep 23, 2024

For large input files, you might encounter this error from the Azure backend during embeddings calculation:

Error: Failed to create embedding

Caused by:
    0: Failed to call embeddings api
    1: Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-01 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 55 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. (code: 429)

Please provide a way to limit the requests per second. It would also be great if this would not lead directly to an exception, but to some kind of back-off behavior. There could be multiple users of a single endpoint, so you have no control over the total load and aichat could crash even if rate limiting is implemented.

@martin-eder-zeiss martin-eder-zeiss added the enhancement New feature or request label Sep 23, 2024
@sigoden
Copy link
Owner

sigoden commented Sep 23, 2024

The AIChat RAG sends embedding requests sequentially, without any concurrency. If 429 errors still occur under these circumstances, then it's likely a problem with your API server. The api does not meet the demands of a production environment.

Maybe you can try to adjust max_batch_size to process more data at once, reducing the number of requests.

@sigoden sigoden closed this as completed Sep 23, 2024
@sigoden
Copy link
Owner

sigoden commented Sep 23, 2024

Retry may be supported in the future, but it's not planned for the near term.

Many scenarios can cause the embeddings interface to failsuch as Network errors, input exceed tokens limit, rate limited, etc.
Since AI Chat needs to support many embeddings API providers, the printed errors can be very diverse. We need to listen to user feedback before making a decision.

@martin-eder-zeiss
Copy link
Author

It seems it doesn't matter if the requests are sequential or parallel. If too many requests come in within a certain amount of time, the request fails. The problem is that I have no control over the rate limit and our company does not provide unrestricted access tokens. This seems reasonable as they could lose significant amounts of money if these tokens are lost. Without a throttling mechanism, I cannot use the RAG function with large/many files.

I am using:

  - name: text-embedding-3-large
    type: embedding
    input_price: 0.13
    max_tokens_per_chunk: 8191
    default_chunk_size: 3000
    max_batch_size: 100

What would be a recommended value for max_batch_size? All examples for this value are in this order of magnitude.

@sigoden
Copy link
Owner

sigoden commented Sep 24, 2024

max_batch_size: 100 is already optimal and does not need to be increased further.

#879 has been merged, you can try it to see if the problem is solved.

@martin-eder-zeiss
Copy link
Author

Error: Failed to create embedding after 3 attempts

Caused by:
    0: Failed to call embeddings api
    1: Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-01 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 52 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. (code: 429)

If there is no delay after a failed attempt, a new attempt will likely yield the same result since the quota remains active. The message includes a hint about the appropriate delay before trying again. Parsing this message may not be practical, as it could change at any time.

A possible solution could be implementing a hard-coded delay of about one minute or, alternatively, a configuration option per model. However, I would only recommend this if the backend provides the error code. It doesn't make sense to repeat the request for every exception reason, such as incorrect credentials.

@sigoden
Copy link
Owner

sigoden commented Sep 24, 2024

AIChat needs to support many embedding providers. It is not practical to distinguish errors.

We will provide an environment variable AICHAT_EMBEDDINGS_RETRY_LIMIT to allow users to customize the number of retry limits.
The retry delay will be 1 second, 2 seconds, 4 seconds, etc., with the highest number of retries (in seconds) being 2.

@sigoden
Copy link
Owner

sigoden commented Sep 24, 2024

@martin-eder-zeiss

You can try #882 to see if the problem is solved.

git checkout origin/feat-retry
AICHAT_EMBEDDINGS_RETRY_LIMIT=7 cargo run

@martin-eder-zeiss
Copy link
Author

This works for me, thanks.
Is the delay reset after a successful attempt? The exception might occur multiple times in one run if there are many or very large files. If it happens repeatedly, it probably doesn't make sense to start with a huge delay just because the quota is hit again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants