-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] adaptive query executor and delta optimized table writes don't work on databricks #1059
Comments
Hey @martinstuder, thanks for reporting this. We don't support databricks 7.2 runtime or AQE with the current 0.2 release. Would you be able to try with the 7.0 runtime with no AQE as its not supported there? We are working on support for the databricks 7.3 runtime in the next 0.3 release which will hopefully support AQE. You probably already found it but our docs for Databricks for reference: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html. I'll also try to reproduce this locally. |
Please also turn off cbo as I don't know that we have tested with that. |
Hi @tgravescs, thanks for your feedback. I tried with DBR 7.0 with AQE and CBO disabled. I can confirm that I run into the same issue. I should probably add though that I'm running Databricks Container Services (DCS) using a "GPU-enabled" base image similar to the ones provided by Databricks (see here). So I'm running the non-ML version of the Databricks runtime with |
thanks, I don't think the container service should matter here as long as its using the same databricks jar for 7.0 runtime. I'm trying to reproduce locally but so far no luck. It might be related to data (size or schema) or perhaps I'm not doing the write operations before hand. Since its seems to be stuck in the stats estimator might be more related to data itself so I'll try perhaps some larger data. |
Looks like the issue is related to delta optimized writes ( The pipeline is basically as follows:
The resulting |
thanks, it's great that you found what was causing this. It looks like their optimized writes is basically an adaptive shuffle, which is really similar to AQE, which unfortunately we don't support right now. I'm assuming that just changing the number of partitions for you won't work if other previous stages requires it to be more to be efficient? it's likely I couldn't reproduce it as I hadn't setup the optimized configs on the table. Hopefully I will be able to reproduce now and see if there is anything else I can recommend. This should cause any parquet write to not be gpu accelerated, thus bypassing the issue for now. If you are using another format like orc, there is similar flags to disable. |
Hi Team, I can reproduce the stackoverflow issue with ONLY AQE=on (Nothing to do with delta write optimize). Here is my minimum reproduce based on databricks 8.2ML GPU + 21.10 snapshot jars.
Then it will fail with stackoverflow:
|
…IDIA#1059) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Describe the bug
When attempting to write a delta table using pyspark on Azure Databricks 7.2 I get the following exception (reduced; full exception attached):
Steps/Code to reproduce bug
The issue happens both with CBO and AQE disabled or enabled. Increasing the stack size to 4m on the driver (-Xss4m) does not help. Without the RAPIDS Accelerator For Apache Spark the export succeeds.
Expected behavior
Delta export succeeds.
Environment details (please complete the following information)
Spark configuration settings:
Additional context
full_exception.txt
The text was updated successfully, but these errors were encountered: