You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current fixed batch size configuration (spark.datasource.flint.write.batch_size) poses limitations, particularly when dealing with documents of varying sizes. This can lead to inefficient resource utilization and potential memory issues. For example, smaller documents may result in underutilization of the batch, while larger documents may cause memory overhead.
What solution would you like?
Introduce a new configuration option (ex. spark.datasource.flint.write.batch_bytes) to enable bytes-based batch sizing. This adaptive approach would dynamically adjust the batch size based on the overall size of the OpenSearch bulk request body, ensuring optimal resource utilization and minimizing memory overhead.
This proposal is similar to maxFilesPerTrigger and maxBytesPerTrigger in Spark built-in file source. We're introducing this enhancement specifically for the FlintWrite sink to provide similar adaptability and efficiency in writing data to the OpenSearch cluster.
What alternatives have you considered?
Alternative solutions could involve manually adjusting batch sizes based on the expected document sizes, however this approach is:
less efficient and prone to errors, especially in dynamic environments where document sizes may vary significantly.
requiring changes on the batch size config which is a Spark conf, instead of being configurable per SQL statement
Do you have any additional context?
Currently, FlintWrite is instantiated per partition, necessitating the need to optimize throughput globally. A minimal requirement is to avoid over-pressuring the OpenSearch cluster.
The text was updated successfully, but these errors were encountered:
Flint spark.datasource.flint.write.refresh_policy default value is wait_for. For _bulk, Flint could use false, which return immediately without wait for refresh finished.
Is your feature request related to a problem?
The current fixed batch size configuration (
spark.datasource.flint.write.batch_size
) poses limitations, particularly when dealing with documents of varying sizes. This can lead to inefficient resource utilization and potential memory issues. For example, smaller documents may result in underutilization of the batch, while larger documents may cause memory overhead.What solution would you like?
Introduce a new configuration option (ex.
spark.datasource.flint.write.batch_bytes
) to enable bytes-based batch sizing. This adaptive approach would dynamically adjust the batch size based on the overall size of the OpenSearch bulk request body, ensuring optimal resource utilization and minimizing memory overhead.This proposal is similar to
maxFilesPerTrigger
andmaxBytesPerTrigger
in Spark built-in file source. We're introducing this enhancement specifically for the FlintWrite sink to provide similar adaptability and efficiency in writing data to the OpenSearch cluster.What alternatives have you considered?
Alternative solutions could involve manually adjusting batch sizes based on the expected document sizes, however this approach is:
Do you have any additional context?
Currently, FlintWrite is instantiated per partition, necessitating the need to optimize throughput globally. A minimal requirement is to avoid over-pressuring the OpenSearch cluster.
The text was updated successfully, but these errors were encountered: