diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md index e6c9da364f7..c7f3ca7a17e 100644 --- a/docs/tuning-guide.md +++ b/docs/tuning-guide.md @@ -272,7 +272,7 @@ from the main [columnar batch size](#columnar-batch-size) setting. Some transco load CSV files then write Parquet files) need to lower this setting when using large task input partition sizes to avoid GPU out of memory errors. -### Enable Incompatible Operations +## Enable Incompatible Operations Configuration key: [`spark.rapids.sql.incompatibleOps.enabled`](configs.md#sql.incompatibleOps.enabled) @@ -331,4 +331,63 @@ Custom Spark SQL Metrics are available which can help identify performance bottl Not all metrics are enabled by default. The configuration setting `spark.rapids.sql.metrics.level` can be set to `DEBUG`, `MODERATE`, or `ESSENTIAL`, with `MODERATE` being the default value. More information about this -configuration option is available in the configuration documentation. \ No newline at end of file +configuration option is available in the configuration documentation. + +## Window Operations + +Apache Spark supports a few optimizations for different windows patterns. Generally Spark +buffers all the data for a partition by key in memory and then loops through the rows looking for +boundaries. When it finds a boundary change, it will then calculate the aggregation on that window. +This ends up being `O(N^2)` where N is the size of the window. In a few cases in can improve on +that. These optimizations include. + + * Lead/Lag. In this case Spark keeps an offset pointer and can output the result from the + buffered data in linear time. Lead and Lag only support row based windows and set up the + row ranges automatically based off of the lead/lag requested. + * Unbounded Preceding to Unbounded Following. In this case Spark will do a single aggregation + and duplicate the result multiple times. This works for both row and range based windows. + There is no difference in the calculation in this case because the window is the size of + the partition by group. + * Unbounded Preceding to some specific bound. For this case Spark keeps running state as it + walks through each row and outputs an updated result each time. This also works for both + row and range based windows. For row based queries it just adds a row at a time. For + range based queries it adds rows until the order by column changes. + * Some specific bound to Unbounded Following. For this case Spark will still recalculate + aggregations for each window group. The complexity of this is `O(N^2)` but it only has to + check lower bounds when doing the aggregation. This also works for row or range based + windows, and for row based windows the performance improvement is very minimal because it + is just removing a row each time instead of doing a check for equality on the order by + columns. + +Some proprietary implementations have further optimizations. For example Databricks has a special +case for running windows (rows between unbounded preceding to current row) which allows it to +avoid caching the entire window in memory and just cache the running state in between rows. + +CUDF and the RAPIDS Accelerator do not have these same set of optimizations yet and so the +performance can be different based off of the window sizes and the aggregation operations. Most +of the time the window size is small enough that the parallelism of the GPU can offset the +difference in the complexity of the algorithm and beat the CPU. In the general case if `N` is +the size of the window and `G` is the parallelism of the GPU then the complexity of a window +operations is `O(N^2/G)`. The main optimization currently supported by the RAPIDS Accelerator +is for running window (rows between unbounded preceding and current row). This is only for a +specific set of aggregations. + + * MIN + * MAX + * SUM + * COUNT + * ROW_NUMBER + * RANK + * DENSE_RANK + +For these operations the GPU can use specialized hardware to do the computation in approximately +`O(N/G * LOG(N))` time. The details of how this works is a bit complex, but it is described +somewhat generally +[here](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda). + +Some aggregations can be done in constant time like `count` on a non-nullable +column/value, `lead` or `lag`. These allow us to compute the result in approximately `O(N/G)` time. +For all other cases large windows, including skewed values in partition by and order by data, can +result in slow performance. If you do run into one of these situations please file an +[issue](https://github.com/NVIDIA/spark-rapids/issues/new/choose) so we can properly prioritize +our work to support more optimizations. \ No newline at end of file