Skip to content

Gluten 0.5.0

Pre-release
Pre-release
Compare
Choose a tag to compare
@zhejiangxiaomai zhejiangxiaomai released this 07 Apr 09:32
· 3430 commits to main since this release
3c3267a

Change log

Generated on 2023-04-07

Gluten 0.5.0

Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.

Here is the major highlight in Gluten 0.5.0:

  • Support Spark3.2 and Spark3.3
  • Support Ubuntu20.04 or later
  • Support CentOS7 and 8
  • Support JDK8 only
  • Support GCC9 or later
  • Use Substrait as unified plan
  • Use Velox as default backend engine
  • Use Celeborn as default RSS
  • Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
  • Support Spill for Sort, Agg, and Join operators
  • Run Pass all Spark3.2 Unit Test
  • 2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
  • 2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
  • Support Intel QAT accelerators in Shuffle compression

Limitations

  • Not Support Complex data type such as Array, Map, Struct
  • OOM happened in some operators not support Spill
  • Decimal result may mismatch in some cases

Features

#974 [CH] Supprt string repeat function
#1008 [CH] Support locate function
#1273 Implement cast decimal to int
#1223 [CH] support reading from S3 and using Clickhouse local cache to speed up
#1131 [Gluten-core] Add an option to only fallback once
#1165 Reduce GC Time when executing BHJ for CH backend.
#1147 [Gluten-core]Make validate failure logLevel configuable
#1100 Making transformer plan log more obvious
#1112 Refactor Gluten metrics and add apis for each backend
#926 gluten timezone not the same as backend
#1039 Remove compute pid metric in shuffle operator.
#882 Selective query execution
#959 Upgrade Arrow version to 11.0.0
#969 Docker for gluten running on centos 8
#986 Align and enrich metrics compare to Spark
#972 Can we separate native dynamic library from build generated jars?
#913 No Spark Shim Provider found for 3.2.0
#853 Support named struct type
#888 Clickhouse backend broadcast relation support r2c
#850 Add cast check in ExpressionTransformer
#825 Setup development environment for macOS
#788 Pass needed hadoop conf from driver to executor

Bugs Fixed

#1284 Scala double data is wronlgy compared with null in a ut
#729 Validation failed for GlutenHashAggregateExecTransformer class
#799 This operator doesn't support doExecuteColumnar
#527 archives for Spark patch versions become unavailable on new releases affecting shims versioning
#523 Some basic failed SQL cases
#1028 [VL] SusbtraitToVeloxPlan error
#858 Sort result mismatch issue with different input records.
#877 Array/Map DataType result mismatch issue when containing null value
#1227 [CH] Scalar subquery filters execute twice for parquet file
#1265 [CH] Rescale decimal trigger fallback
#1233 [CH] Fix fallback issue when reading csv files
#1235 [CH] Fix missing reading from the broadcasted value when executing DPP
#1234 [CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all
#1207 shims-spark32 and shims-spark33 may be depencied at the same time
#1161 Bundle built by buildbundle-veloxbe.sh for Spark3.3 is broken
#1210 [CH] Fix the wrong table path of the orders table for TPCH in UT
#1175 FileNotFoundException while executing spark jobs -.so files
#1179 [VL] CI is failing on boost's checksum
#1162 [CK]fix CoaleseBatches metrics
#1124 Memory management not suitable with Velox split preload feature.
#1149 Run tpc-ds core
#741 Handle remainder for the case that its right input is zero
#1090 [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct
#1068 [VL] Managed memory leak in imported Spark UTs
#772 Velox does not install folly in centos8 by default, break compile in centos8.
#789 Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten
#700 AARCH64 port of Gluten
#1027 [VL] unsupported method
#1072 [CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2
#489 cannot build gluten (velox backend) in Amazon Linux 2
#1012 Enable local cache throw exception
#995 Fix memory leak for ClickHouse Backend
#914 System variables related to Folly could not be found when compiling gluten.
#990 Failed to build velox
#946 Upgrade arrow version to 10.0.1
#860 CH backend inset result not equals spark result
#601 Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame
#843 Unable to convert BHJ to SHJ by using hint
#826 ch_backend not support inset is empty
#815 Gluten + Velox backend does not support Struct dataset with same element name.
#563 Error compiling within -Pbackends-xx,spark-3.3,spark-ut
#560 An unsupportedOperationException interrupted the query execution
#770 VeloxRuntimeError when reading parquet file with only meta data
#800 [UT]ExpectedAnswer may not match SparkAnswer when is sorted
#676 WholeStageTransformerSuite#logForFailedTest() swallows exceptions
#790 Join RuntimeException when having duplicated equal-join keys
#757 Parquet scan not offloaded
#797 It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn.
#784 No Spark Shim Provider found for 3.3.0
#547 Jar conflict issue
#727 build from local velox repo doesn't work

PRs

#1266 [GLUTEN-1246] [CORE] Fix scale may be negative issue
#1313 [VL] Update doc for centos7 install
#1312 [CH] Ignore ch backend tpcds suite
#1198 [VL] fix: Update Velox setup scripts for centos 7
#1294 [VL] Following #1185, do some clean-ups against Velox + Celeborn CI
#1196 [CH-375] Enable ch backend tpcds suite
#1272 [VL] Decimal sum support overflow
#1295 [VL][doc] Add a developer guide for implementing spark built-in functions in velox
#1115 [VL] Enable decimalV2A test
#1276 [GLUTEN-1275] [VL] Fix cast double to decimal
#1289 [VL] Fix exec_backend_test compile problem introduced by gtest 1.13
#1258 [VL] Add some config options for Velox spill
#1185 [VL] Support Celeborn integration testing in Velox CI
#1285 [GLUTEN-1284][Fix] Fix an issue in checking null value in a UT
#1282 [VL] Refactor Backend's interface and ConfMap
#1071 [CH-326] Support partitioning with expressions
#1241 Update documents
#1269 [VL] Support tpch to use decimal type in Velox CI
#1259 [VL] Migrate the Celeborn package to the official version
#1255 Minor: Update pull request template with more example inputs
#1230 [GLUTEN-1217][VL] UT: Enable UT when partitioned column is decimal type
#1267 [VL] Remove Decimal failed SPARK-35955 ut exclude in GlutenDataFrameSuite
#1260 [CH-382] Fix bug about collect_list
#1231 [VL] Upgrade velox to 2023.3.29
#1257 [VL] Add config to limit the max memory
#1264 [VL] Revert "[GLUTEN-1246][CORE][Fix] Fix allowPrecisionLoss false mode an…
#1253 [GLUTEN-1199][VL][Fix] Fix npe when task is killed by speculation
#1248 [CORE] Centralize local/tmp directory creation
#1226 [GLUTEN-1246][CORE][Fix] Fix allowPrecisionLoss false mode and enable some test suites
#1256 [VL] Include Velox non-contiguous allocations into Spark memory manager's accountation (patch 2)
#1200 [CH] Fix memory leak on coalesce
#1249 [GLUTEN-1233] Fix fallback issue when reading csv files
#1238 [VL][UT] Enable UT on some cast functions
#1229 [VL][UT] Enable UT on decimal when offset being zero
#1228 [VL] Include Velox contiguous allocations into Spark memory manager's accountation
#1242 [VL][DOC] Update queries to use
#1247 [CH] Make sure gluten don't throw exception when some spark confs not set
#1236 [GLUTEN-1235] Fix missing reading from the broadcasted value when executing DPP
#1094 [VL] Agg support Covariance function
#1237 [Gluten-1234] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all
#1202 [GLUTEN-1201][CORE] Skip scalar subquery execution during validation
#1143 [CH-364] Explicitly invoke JNI_ONUnload when spark driver or executor are shuting down
#1208 [VL] Change to use date and decimal type in gluten-it
#1225 [CH-384] support reading from S3 and using Clickhouse local cache to …
#1176 [VL] Support Decimal type in Gluten
#1209 [GLUTEN-1207] Fix a dependency issue for spark shims
#1206 [VL] Support using Celeborn in the scenario of switching multiple SparkContexts in the same process of Velox CI
#1224 [VL] fix ut failure by groupings size validation fix
#1219 [VL]Add Meituan email in contact.
#1215 [VL] Clean up EP build scripts

Usage

Setup proxy on necessary by

--conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<> -Dhttp.proxyPort=<> -Dhttps.proxyHost=<> -Dhttps.proxyPort=<>"
spark-shell --name run_gluten \
 --master yarn --deploy-mode client \
 --conf spark.plugins=io.glutenproject.GlutenPlugin \
 --conf spark.gluten.sql.columnar.backend.lib=velox \
 --conf spark.memory.offHeap.enabled=true \
 --conf spark.memory.offHeap.size=20g \
 --conf spark.gluten.loadLibFromJar=true \
 --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
 --jars https://github.com/oap-project/gluten/releases/download/0.5.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04-0.5.0-SNAPSHOT.jar,https://github.com/oap-project/gluten/releases/download/0.5.0/gluten-thirdparty-lib-ubuntu-20.04.jar