Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Introduce bytes-based batch size config to optimize Flint write throughput #304

Closed
dai-chen opened this issue Apr 4, 2024 · 1 comment
Labels
0.4 enhancement New feature or request

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Apr 4, 2024

Is your feature request related to a problem?

The current fixed batch size configuration (spark.datasource.flint.write.batch_size) poses limitations, particularly when dealing with documents of varying sizes. This can lead to inefficient resource utilization and potential memory issues. For example, smaller documents may result in underutilization of the batch, while larger documents may cause memory overhead.

What solution would you like?

Introduce a new configuration option (ex. spark.datasource.flint.write.batch_bytes) to enable bytes-based batch sizing. This adaptive approach would dynamically adjust the batch size based on the overall size of the OpenSearch bulk request body, ensuring optimal resource utilization and minimizing memory overhead.

This proposal is similar to maxFilesPerTrigger and maxBytesPerTrigger in Spark built-in file source. We're introducing this enhancement specifically for the FlintWrite sink to provide similar adaptability and efficiency in writing data to the OpenSearch cluster.

What alternatives have you considered?

Alternative solutions could involve manually adjusting batch sizes based on the expected document sizes, however this approach is:

  1. less efficient and prone to errors, especially in dynamic environments where document sizes may vary significantly.
  2. requiring changes on the batch size config which is a Spark conf, instead of being configurable per SQL statement

Do you have any additional context?

Currently, FlintWrite is instantiated per partition, necessitating the need to optimize throughput globally. A minimal requirement is to avoid over-pressuring the OpenSearch cluster.

@penghuo
Copy link
Collaborator

penghuo commented Apr 23, 2024

set refresh_policy to false

Flint spark.datasource.flint.write.refresh_policy default value is wait_for. For _bulk, Flint could use false, which return immediately without wait for refresh finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.4 enhancement New feature or request
Development

No branches or pull requests

2 participants