Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49673][CONNECT] Increase CONNECT_GRPC_ARROW_MAX_BATCH_SIZE to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE #48122

Closed

Conversation

dillitz
Copy link
Contributor

@dillitz dillitz commented Sep 16, 2024

What changes were proposed in this pull request?

Increases the default maxBatchSize from 4MiB * 0.7 to 128MiB (=
CONNECT_GRPC_MAX_MESSAGE_SIZE) * 0.7. This makes better use of the allowed maximum message size.
This limit is used when creating Arrow batches for the SqlCommandResult in the SparkConnectPlanner and for ExecutePlanResponse.ArrowBatch in processAsArrowBatches. This, for example, lets us return much larger LocalRelations in the SqlCommandResult (i.e., for the SHOW PARTITIONS command) while still staying within the GRPC message size limit.

Why are the changes needed?

There are SqlCommandResults that exceed 0.7 * 4MiB.

Does this PR introduce any user-facing change?

Now support SqlCommandResults <= 0.7 * 128 MiB instead of only <= 0.7 * 4MiB and ExecutePlanResponses will now better use the limit of 128MiB.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@dillitz dillitz changed the title [SPARK-49673] Increase maxBatchSize to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE for sqlCommandResult [SPARK-49673] Increase maxBatchSize to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE in handleSqlCommand Sep 16, 2024
@dillitz dillitz changed the title [SPARK-49673] Increase maxBatchSize to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE in handleSqlCommand [SPARK-49673] Increase CONNECT_GRPC_ARROW_MAX_BATCH_SIZE to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE Sep 17, 2024
@dillitz dillitz requested a review from HyukjinKwon September 17, 2024 08:58
Copy link
Contributor

@nija-at nija-at left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any existing tests that use the new limits to ensure that everything still works?

@dillitz
Copy link
Contributor Author

dillitz commented Sep 17, 2024

Any existing tests that use the new limits to ensure that everything still works?

I've added an E2E test that explicitly tests multiple batches and changed the static limit to 10MiB there to not create too much memory pressure.

@dillitz dillitz requested a review from nija-at September 17, 2024 12:16
@HyukjinKwon HyukjinKwon changed the title [SPARK-49673] Increase CONNECT_GRPC_ARROW_MAX_BATCH_SIZE to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE [SPARK-49673][CONNECT] Increase CONNECT_GRPC_ARROW_MAX_BATCH_SIZE to 0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE Sep 18, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE

### What changes were proposed in this pull request?
Increases the default `maxBatchSize` from 4MiB * 0.7 to 128MiB (=
CONNECT_GRPC_MAX_MESSAGE_SIZE) * 0.7. This makes better use of the allowed maximum message size.
This limit is used when creating Arrow batches for the `SqlCommandResult` in the `SparkConnectPlanner` and for `ExecutePlanResponse.ArrowBatch` in `processAsArrowBatches`. This, for example, lets us return much larger `LocalRelations` in the `SqlCommandResult` (i.e., for the `SHOW PARTITIONS` command) while still staying within the GRPC message size limit.

### Why are the changes needed?
There are `SqlCommandResults` that exceed 0.7 * 4MiB.

### Does this PR introduce _any_ user-facing change?
Now support `SqlCommandResults` <= 0.7 * 128 MiB instead of only <= 0.7 * 4MiB and ExecutePlanResponses will now better use the limit of 128MiB.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48122 from dillitz/increase-sql-command-batch-size.

Authored-by: Robert Dillitz <robert.dillitz@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…0.7 * CONNECT_GRPC_MAX_MESSAGE_SIZE

### What changes were proposed in this pull request?
Increases the default `maxBatchSize` from 4MiB * 0.7 to 128MiB (=
CONNECT_GRPC_MAX_MESSAGE_SIZE) * 0.7. This makes better use of the allowed maximum message size.
This limit is used when creating Arrow batches for the `SqlCommandResult` in the `SparkConnectPlanner` and for `ExecutePlanResponse.ArrowBatch` in `processAsArrowBatches`. This, for example, lets us return much larger `LocalRelations` in the `SqlCommandResult` (i.e., for the `SHOW PARTITIONS` command) while still staying within the GRPC message size limit.

### Why are the changes needed?
There are `SqlCommandResults` that exceed 0.7 * 4MiB.

### Does this PR introduce _any_ user-facing change?
Now support `SqlCommandResults` <= 0.7 * 128 MiB instead of only <= 0.7 * 4MiB and ExecutePlanResponses will now better use the limit of 128MiB.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48122 from dillitz/increase-sql-command-batch-size.

Authored-by: Robert Dillitz <robert.dillitz@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants