[FEATURE] Introduce bytes-based batch size config to optimize Flint write throughput #304

dai-chen · 2024-04-04T18:07:14Z

Is your feature request related to a problem?

The current fixed batch size configuration (spark.datasource.flint.write.batch_size) poses limitations, particularly when dealing with documents of varying sizes. This can lead to inefficient resource utilization and potential memory issues. For example, smaller documents may result in underutilization of the batch, while larger documents may cause memory overhead.

What solution would you like?

Introduce a new configuration option (ex. spark.datasource.flint.write.batch_bytes) to enable bytes-based batch sizing. This adaptive approach would dynamically adjust the batch size based on the overall size of the OpenSearch bulk request body, ensuring optimal resource utilization and minimizing memory overhead.

This proposal is similar to maxFilesPerTrigger and maxBytesPerTrigger in Spark built-in file source. We're introducing this enhancement specifically for the FlintWrite sink to provide similar adaptability and efficiency in writing data to the OpenSearch cluster.

What alternatives have you considered?

Alternative solutions could involve manually adjusting batch sizes based on the expected document sizes, however this approach is:

less efficient and prone to errors, especially in dynamic environments where document sizes may vary significantly.
requiring changes on the batch size config which is a Spark conf, instead of being configurable per SQL statement

Do you have any additional context?

Currently, FlintWrite is instantiated per partition, necessitating the need to optimize throughput globally. A minimal requirement is to avoid over-pressuring the OpenSearch cluster.

The text was updated successfully, but these errors were encountered:

penghuo · 2024-04-23T15:46:57Z

set refresh_policy to false

Flint spark.datasource.flint.write.refresh_policy default value is wait_for. For _bulk, Flint could use false, which return immediately without wait for refresh finished.

dai-chen added enhancement New feature or request untriaged 0.4 labels Apr 4, 2024

dai-chen added this to OpenSearch Spark Project Planning Apr 4, 2024

dai-chen removed the untriaged label Apr 4, 2024

dai-chen mentioned this issue Apr 4, 2024

[Feature] OpenSearch and Apache Spark Integration #3

Closed

penghuo mentioned this issue May 2, 2024

add batch_bytes configuration for Flint #329

Merged

penghuo closed this as completed May 6, 2024

github-project-automation bot moved this to Done in OpenSearch Spark Project Planning May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Introduce bytes-based batch size config to optimize Flint write throughput #304

[FEATURE] Introduce bytes-based batch size config to optimize Flint write throughput #304

dai-chen commented Apr 4, 2024 •

edited

Loading

penghuo commented Apr 23, 2024

[FEATURE] Introduce bytes-based batch size config to optimize Flint write throughput #304

[FEATURE] Introduce bytes-based batch size config to optimize Flint write throughput #304

Comments

dai-chen commented Apr 4, 2024 • edited Loading

penghuo commented Apr 23, 2024

set refresh_policy to false

dai-chen commented Apr 4, 2024 •

edited

Loading