Gluten 0.5.0
Pre-release
Pre-release
Change log
Generated on 2023-04-07
Gluten 0.5.0
Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.
Here is the major highlight in Gluten 0.5.0:
- Support Spark3.2 and Spark3.3
- Support Ubuntu20.04 or later
- Support CentOS7 and 8
- Support JDK8 only
- Support GCC9 or later
- Use Substrait as unified plan
- Use Velox as default backend engine
- Use Celeborn as default RSS
- Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
- Support Spill for Sort, Agg, and Join operators
- Run Pass all Spark3.2 Unit Test
- 2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
- 2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
- Support Intel QAT accelerators in Shuffle compression
Limitations
- Not Support Complex data type such as Array, Map, Struct
- OOM happened in some operators not support Spill
- Decimal result may mismatch in some cases
Features
#974 | [CH] Supprt string repeat function |
#1008 | [CH] Support locate function |
#1273 | Implement cast decimal to int |
#1223 | [CH] support reading from S3 and using Clickhouse local cache to speed up |
#1131 | [Gluten-core] Add an option to only fallback once |
#1165 | Reduce GC Time when executing BHJ for CH backend. |
#1147 | [Gluten-core]Make validate failure logLevel configuable |
#1100 | Making transformer plan log more obvious |
#1112 | Refactor Gluten metrics and add apis for each backend |
#926 | gluten timezone not the same as backend |
#1039 | Remove compute pid metric in shuffle operator. |
#882 | Selective query execution |
#959 | Upgrade Arrow version to 11.0.0 |
#969 | Docker for gluten running on centos 8 |
#986 | Align and enrich metrics compare to Spark |
#972 | Can we separate native dynamic library from build generated jars? |
#913 | No Spark Shim Provider found for 3.2.0 |
#853 | Support named struct type |
#888 | Clickhouse backend broadcast relation support r2c |
#850 | Add cast check in ExpressionTransformer |
#825 | Setup development environment for macOS |
#788 | Pass needed hadoop conf from driver to executor |
Bugs Fixed
#1284 | Scala double data is wronlgy compared with null in a ut |
#729 | Validation failed for GlutenHashAggregateExecTransformer class |
#799 | This operator doesn't support doExecuteColumnar |
#527 | archives for Spark patch versions become unavailable on new releases affecting shims versioning |
#523 | Some basic failed SQL cases |
#1028 | [VL] SusbtraitToVeloxPlan error |
#858 | Sort result mismatch issue with different input records. |
#877 | Array/Map DataType result mismatch issue when containing null value |
#1227 | [CH] Scalar subquery filters execute twice for parquet file |
#1265 | [CH] Rescale decimal trigger fallback |
#1233 | [CH] Fix fallback issue when reading csv files |
#1235 | [CH] Fix missing reading from the broadcasted value when executing DPP |
#1234 | [CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all |
#1207 | shims-spark32 and shims-spark33 may be depencied at the same time |
#1161 | Bundle built by buildbundle-veloxbe.sh for Spark3.3 is broken |
#1210 | [CH] Fix the wrong table path of the orders table for TPCH in UT |
#1175 | FileNotFoundException while executing spark jobs -.so files |
#1179 | [VL] CI is failing on boost's checksum |
#1162 | [CK]fix CoaleseBatches metrics |
#1124 | Memory management not suitable with Velox split preload feature. |
#1149 | Run tpc-ds core |
#741 | Handle remainder for the case that its right input is zero |
#1090 | [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct |
#1068 | [VL] Managed memory leak in imported Spark UTs |
#772 | Velox does not install folly in centos8 by default, break compile in centos8. |
#789 | Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten |
#700 | AARCH64 port of Gluten |
#1027 | [VL] unsupported method |
#1072 | [CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2 |
#489 | cannot build gluten (velox backend) in Amazon Linux 2 |
#1012 | Enable local cache throw exception |
#995 | Fix memory leak for ClickHouse Backend |
#914 | System variables related to Folly could not be found when compiling gluten. |
#990 | Failed to build velox |
#946 | Upgrade arrow version to 10.0.1 |
#860 | CH backend inset result not equals spark result |
#601 | Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame |
#843 | Unable to convert BHJ to SHJ by using hint |
#826 | ch_backend not support inset is empty |
#815 | Gluten + Velox backend does not support Struct dataset with same element name. |
#563 | Error compiling within -Pbackends-xx,spark-3.3,spark-ut |
#560 | An unsupportedOperationException interrupted the query execution |
#770 | VeloxRuntimeError when reading parquet file with only meta data |
#800 | [UT]ExpectedAnswer may not match SparkAnswer when is sorted |
#676 | WholeStageTransformerSuite#logForFailedTest() swallows exceptions |
#790 | Join RuntimeException when having duplicated equal-join keys |
#757 | Parquet scan not offloaded |
#797 | It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn. |
#784 | No Spark Shim Provider found for 3.3.0 |
#547 | Jar conflict issue |
#727 | build from local velox repo doesn't work |
PRs
#1266 | [GLUTEN-1246] [CORE] Fix scale may be negative issue |
#1313 | [VL] Update doc for centos7 install |
#1312 | [CH] Ignore ch backend tpcds suite |
#1198 | [VL] fix: Update Velox setup scripts for centos 7 |
#1294 | [VL] Following #1185, do some clean-ups against Velox + Celeborn CI |
#1196 | [CH-375] Enable ch backend tpcds suite |
#1272 | [VL] Decimal sum support overflow |
#1295 | [VL][doc] Add a developer guide for implementing spark built-in functions in velox |
#1115 | [VL] Enable decimalV2A test |
#1276 | [GLUTEN-1275] [VL] Fix cast double to decimal |
#1289 | [VL] Fix exec_backend_test compile problem introduced by gtest 1.13 |
#1258 | [VL] Add some config options for Velox spill |
#1185 | [VL] Support Celeborn integration testing in Velox CI |
#1285 | [GLUTEN-1284][Fix] Fix an issue in checking null value in a UT |
#1282 | [VL] Refactor Backend's interface and ConfMap |
#1071 | [CH-326] Support partitioning with expressions |
#1241 | Update documents |
#1269 | [VL] Support tpch to use decimal type in Velox CI |
#1259 | [VL] Migrate the Celeborn package to the official version |
#1255 | Minor: Update pull request template with more example inputs |
#1230 | [GLUTEN-1217][VL] UT: Enable UT when partitioned column is decimal type |
#1267 | [VL] Remove Decimal failed SPARK-35955 ut exclude in GlutenDataFrameSuite |
#1260 | [CH-382] Fix bug about collect_list |
#1231 | [VL] Upgrade velox to 2023.3.29 |
#1257 | [VL] Add config to limit the max memory |
#1264 | [VL] Revert "[GLUTEN-1246][CORE][Fix] Fix allowPrecisionLoss false mode an… |
#1253 | [GLUTEN-1199][VL][Fix] Fix npe when task is killed by speculation |
#1248 | [CORE] Centralize local/tmp directory creation |
#1226 | [GLUTEN-1246][CORE][Fix] Fix allowPrecisionLoss false mode and enable some test suites |
#1256 | [VL] Include Velox non-contiguous allocations into Spark memory manager's accountation (patch 2) |
#1200 | [CH] Fix memory leak on coalesce |
#1249 | [GLUTEN-1233] Fix fallback issue when reading csv files |
#1238 | [VL][UT] Enable UT on some cast functions |
#1229 | [VL][UT] Enable UT on decimal when offset being zero |
#1228 | [VL] Include Velox contiguous allocations into Spark memory manager's accountation |
#1242 | [VL][DOC] Update queries to use |
#1247 | [CH] Make sure gluten don't throw exception when some spark confs not set |
#1236 | [GLUTEN-1235] Fix missing reading from the broadcasted value when executing DPP |
#1094 | [VL] Agg support Covariance function |
#1237 | [Gluten-1234] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all |
#1202 | [GLUTEN-1201][CORE] Skip scalar subquery execution during validation |
#1143 | [CH-364] Explicitly invoke JNI_ONUnload when spark driver or executor are shuting down |
#1208 | [VL] Change to use date and decimal type in gluten-it |
#1225 | [CH-384] support reading from S3 and using Clickhouse local cache to … |
#1176 | [VL] Support Decimal type in Gluten |
#1209 | [GLUTEN-1207] Fix a dependency issue for spark shims |
#1206 | [VL] Support using Celeborn in the scenario of switching multiple SparkContexts in the same process of Velox CI |
#1224 | [VL] fix ut failure by groupings size validation fix |
#1219 | [VL]Add Meituan email in contact. |
#1215 | [VL] Clean up EP build scripts |
Usage
Setup proxy on necessary by
--conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<> -Dhttp.proxyPort=<> -Dhttps.proxyHost=<> -Dhttps.proxyPort=<>"
spark-shell --name run_gluten \
--master yarn --deploy-mode client \
--conf spark.plugins=io.glutenproject.GlutenPlugin \
--conf spark.gluten.sql.columnar.backend.lib=velox \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=20g \
--conf spark.gluten.loadLibFromJar=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--jars https://github.com/oap-project/gluten/releases/download/0.5.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04-0.5.0-SNAPSHOT.jar,https://github.com/oap-project/gluten/releases/download/0.5.0/gluten-thirdparty-lib-ubuntu-20.04.jar