-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for Data Maintenance #9
Conversation
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
When runnning LF_SR function, we see
Cause:
but in the DM data "s_store_returns" definition (official TPCDS_TOOLKIT/tools/tpcds_source.sql old version: https://github.com/gregrahn/tpcds-kit/blob/master/tools/tpcds_source.sql#L356):
Those definitions are from TPC-DS Spec.
According to Spec:
I can fix this by changing |
nds/convert_submit_cpu.template
Outdated
@@ -21,5 +21,13 @@ export SPARK_CONF=("--master" "yarn" | |||
"--num-executors" "8" | |||
"--executor-memory" "40G" | |||
"--executor-cores" "12" | |||
"--conf" "spark.task.cpus=1") | |||
"--conf" "spark.task.cpus=1" | |||
"--packages" "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the first character seems not aligned.
nds/convert_submit_cpu.template
Outdated
@@ -21,5 +21,13 @@ export SPARK_CONF=("--master" "yarn" | |||
"--num-executors" "8" | |||
"--executor-memory" "40G" | |||
"--executor-cores" "12" | |||
"--conf" "spark.task.cpus=1") | |||
"--conf" "spark.task.cpus=1" | |||
"--packages" "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we consider the case of not using iceberg?
In that case, the parameters for iceberg should be enabled by some options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe What's our plan/strategy to do the WHOLE NDS test? Are we going to do them all based on Iceberg? For #8 I've made the transcode step to save the data ONLY to Iceberg, should we keep our old way to save them just to a folder? The old way may be more friendly to users who doesn't know Iceberg, but eventually, if they want to perform the whole NDS test including Data Maintenance, they will come back to do Iceberg writing again... Any suggestions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do the entire NDS suite one would need to use a format that supports the entire set of operations required by the entire suite. That would mean using Iceberg, Delta Lake, or some other format that allows incremental table update.
I've made the transcode step to save the data ONLY to Iceberg
That's not desired. We want to support transcoding to a bunch of different formats, because we're not always going to run the entire suite. We get a lot of useful information from running the significant portion of NDS that works on raw Parquet and ORC files, and we do not want to lose the ability to setup those benchmarks. The transcode needs to be flexible, allowing outputs ideally to every major output format that we want to bench. For now that definitely includes raw Parquet and ORC along with Iceberg (and the ability to control settings for these formats such as compression codec, probably via separate configs spec'd either inline or sideband in the Spark instance to use).
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
For DELETE functions, e.g.
results in
I can break the SQL into
And this can work. |
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
This seems like potentially a bug in Spark 3.2, especially since the error is such a low-level class cast error. I was able to get the same query to plan without an error on Spark 3.1.2. |
One issue for our use case, we want to use Spark 3.2.1 as our NDS 2.0 benchmark environment due to some performance consideration especially for query77(there's a huge performance drop in 3.1.2). |
Filed a Spark issue: https://issues.apache.org/jira/browse/SPARK-39454 Update: it's said that this will be fixed in Spark 3.3.0 and Spark 3.2.2 |
I verified on Spark 3.2.2, it can work. |
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is close. Minor comments on documentation and waiting to hear back on copyright/license question.
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu allxu@nvidia.com
This PR add initial supports to do part of Data Maintenance work.
Data Maintenance requires ACID operations like INSERT, DELETE and Spark currently doesn't provide native supports for them. So we choose Iceberg as the data source metadata manager.
With this change, we will:
fix: #4 , #8