Support for rate limited embeddings calculation #877

martin-eder-zeiss · 2024-09-23T10:50:07Z

For large input files, you might encounter this error from the Azure backend during embeddings calculation:

Error: Failed to create embedding

Caused by:
    0: Failed to call embeddings api
    1: Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-01 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 55 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. (code: 429)

Please provide a way to limit the requests per second. It would also be great if this would not lead directly to an exception, but to some kind of back-off behavior. There could be multiple users of a single endpoint, so you have no control over the total load and aichat could crash even if rate limiting is implemented.

The text was updated successfully, but these errors were encountered:

sigoden · 2024-09-23T12:44:10Z

The AIChat RAG sends embedding requests sequentially, without any concurrency. If 429 errors still occur under these circumstances, then it's likely a problem with your API server. The api does not meet the demands of a production environment.

Maybe you can try to adjust max_batch_size to process more data at once, reducing the number of requests.

sigoden · 2024-09-23T13:03:19Z

Retry may be supported in the future, but it's not planned for the near term.

Many scenarios can cause the embeddings interface to failsuch as Network errors, input exceed tokens limit, rate limited, etc.
Since AI Chat needs to support many embeddings API providers, the printed errors can be very diverse. We need to listen to user feedback before making a decision.

martin-eder-zeiss · 2024-09-23T13:23:23Z

It seems it doesn't matter if the requests are sequential or parallel. If too many requests come in within a certain amount of time, the request fails. The problem is that I have no control over the rate limit and our company does not provide unrestricted access tokens. This seems reasonable as they could lose significant amounts of money if these tokens are lost. Without a throttling mechanism, I cannot use the RAG function with large/many files.

I am using:

  - name: text-embedding-3-large
    type: embedding
    input_price: 0.13
    max_tokens_per_chunk: 8191
    default_chunk_size: 3000
    max_batch_size: 100

What would be a recommended value for max_batch_size? All examples for this value are in this order of magnitude.

sigoden · 2024-09-24T00:07:52Z

max_batch_size: 100 is already optimal and does not need to be increased further.

#879 has been merged, you can try it to see if the problem is solved.

martin-eder-zeiss · 2024-09-24T10:51:47Z

Error: Failed to create embedding after 3 attempts

Caused by:
    0: Failed to call embeddings api
    1: Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-01 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 52 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. (code: 429)

If there is no delay after a failed attempt, a new attempt will likely yield the same result since the quota remains active. The message includes a hint about the appropriate delay before trying again. Parsing this message may not be practical, as it could change at any time.

A possible solution could be implementing a hard-coded delay of about one minute or, alternatively, a configuration option per model. However, I would only recommend this if the backend provides the error code. It doesn't make sense to repeat the request for every exception reason, such as incorrect credentials.

sigoden · 2024-09-24T11:13:50Z

AIChat needs to support many embedding providers. It is not practical to distinguish errors.

We will provide an environment variable AICHAT_EMBEDDINGS_RETRY_LIMIT to allow users to customize the number of retry limits.
The retry delay will be 1 second, 2 seconds, 4 seconds, etc., with the highest number of retries (in seconds) being 2.

sigoden · 2024-09-24T11:41:03Z

@martin-eder-zeiss

You can try #882 to see if the problem is solved.

git checkout origin/feat-retry
AICHAT_EMBEDDINGS_RETRY_LIMIT=7 cargo run

martin-eder-zeiss · 2024-09-24T12:04:22Z

This works for me, thanks.
Is the delay reset after a successful attempt? The exception might occur multiple times in one run if there are many or very large files. If it happens repeatedly, it probably doesn't make sense to start with a huge delay just because the quota is hit again.

martin-eder-zeiss added the enhancement New feature or request label Sep 23, 2024

sigoden closed this as completed Sep 23, 2024

sigoden mentioned this issue Sep 23, 2024

feat: add retry logic to embedding/rerank api calls #879

Merged

sigoden mentioned this issue Sep 24, 2024

feat: add AICHAT_EMBEDDINGS_RETRY_LIMIT #882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for rate limited embeddings calculation #877

Support for rate limited embeddings calculation #877

martin-eder-zeiss commented Sep 23, 2024 •

edited

Loading

sigoden commented Sep 23, 2024

sigoden commented Sep 23, 2024

martin-eder-zeiss commented Sep 23, 2024

sigoden commented Sep 24, 2024

martin-eder-zeiss commented Sep 24, 2024

sigoden commented Sep 24, 2024 •

edited

Loading

sigoden commented Sep 24, 2024

martin-eder-zeiss commented Sep 24, 2024

Support for rate limited embeddings calculation #877

Support for rate limited embeddings calculation #877

Comments

martin-eder-zeiss commented Sep 23, 2024 • edited Loading

sigoden commented Sep 23, 2024

sigoden commented Sep 23, 2024

martin-eder-zeiss commented Sep 23, 2024

sigoden commented Sep 24, 2024

martin-eder-zeiss commented Sep 24, 2024

sigoden commented Sep 24, 2024 • edited Loading

sigoden commented Sep 24, 2024

martin-eder-zeiss commented Sep 24, 2024

martin-eder-zeiss commented Sep 23, 2024 •

edited

Loading

sigoden commented Sep 24, 2024 •

edited

Loading