Skip to content

Releases: apache/incubator-gluten

Gluten v1.0.0

14 Jul 03:07
bfe394b
Compare
Choose a tag to compare

Release Notes - Gluten - Version 1.0.0

Highlights (Velox backend only)

  • Support Spark 3.2 and Spark3.3
  • Run Pass all Velox, Spark3.2 UTs, and partially Spark3.3 UTs
  • Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
  • Support FileSystem: localfs, HDFS, S3, OSS (via s3a)
  • Support data types: Primitive type, Decimal, Date, Timestamp
  • Support 20 operators, detail here
  • Support 164 functions, detail here
  • Support native Parquet write
  • Support native ORC read
  • Support Intel® In-memory Analytics Accelerator (IAA/IAX) hardware accelerator in Shuffle compression
  • Support cap-based spill (static memory allocation) for join/agg/sort operator (experimental feature)
  • Support static build method via vcpkg
  • Support local cache (experimental feature)
  • 2.71x speedup in Decision Support Benchmark1 (TPC-H Like) testing
  • 2.29x speedup in Decision Support Benchmark2 (TPC-DS Like) testing
  • Velox code updated to commit
  • Document improvement for support features and configuration

Known Issues

  • Parquet write only support compression.codec, parquet.block.size and parquet.block.rows configurations
  • Velox backend does not support dynamic partition write and bucket write
  • Spill may throw OutOfMemoryExcetpion

New Features

Improvements

Read more

Gluten 0.5.0

07 Apr 09:32
3c3267a
Compare
Choose a tag to compare
Gluten 0.5.0 Pre-release
Pre-release

Change log

Generated on 2023-04-07

Gluten 0.5.0

Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.

Here is the major highlight in Gluten 0.5.0:

  • Support Spark3.2 and Spark3.3
  • Support Ubuntu20.04 or later
  • Support CentOS7 and 8
  • Support JDK8 only
  • Support GCC9 or later
  • Use Substrait as unified plan
  • Use Velox as default backend engine
  • Use Celeborn as default RSS
  • Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
  • Support Spill for Sort, Agg, and Join operators
  • Run Pass all Spark3.2 Unit Test
  • 2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
  • 2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
  • Support Intel QAT accelerators in Shuffle compression

Limitations

  • Not Support Complex data type such as Array, Map, Struct
  • OOM happened in some operators not support Spill
  • Decimal result may mismatch in some cases

Features

#974 [CH] Supprt string repeat function
#1008 [CH] Support locate function
#1273 Implement cast decimal to int
#1223 [CH] support reading from S3 and using Clickhouse local cache to speed up
#1131 [Gluten-core] Add an option to only fallback once
#1165 Reduce GC Time when executing BHJ for CH backend.
#1147 [Gluten-core]Make validate failure logLevel configuable
#1100 Making transformer plan log more obvious
#1112 Refactor Gluten metrics and add apis for each backend
#926 gluten timezone not the same as backend
#1039 Remove compute pid metric in shuffle operator.
#882 Selective query execution
#959 Upgrade Arrow version to 11.0.0
#969 Docker for gluten running on centos 8
#986 Align and enrich metrics compare to Spark
#972 Can we separate native dynamic library from build generated jars?
#913 No Spark Shim Provider found for 3.2.0
#853 Support named struct type
#888 Clickhouse backend broadcast relation support r2c
#850 Add cast check in ExpressionTransformer
#825 Setup development environment for macOS
#788 Pass needed hadoop conf from driver to executor

Bugs Fixed

#1284 Scala double data is wronlgy compared with null in a ut
#729 Validation failed for GlutenHashAggregateExecTransformer class
#799 This operator doesn't support doExecuteColumnar
#527 archives for Spark patch versions become unavailable on new releases affecting shims versioning
#523 Some basic failed SQL cases
#1028 [VL] SusbtraitToVeloxPlan error
#858 Sort result mismatch issue with different input records.
#877 Array/Map DataType result mismatch issue when containing null value
#1227 [CH] Scalar subquery filters execute twice for parquet file
#1265 [CH] Rescale decimal trigger fallback
#1233 [CH] Fix fallback issue when reading csv files
#1235 [CH] Fix missing reading from the broadcasted value when executing DPP
#1234 [CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all
#1207 shims-spark32 and shims-spark33 may be depencied at the same time
#1161 Bundle built by buildbundle-veloxbe.sh for Spark3.3 is broken
#1210 [CH] Fix the wrong table path of the orders table for TPCH in UT
#1175 FileNotFoundException while executing spark jobs -.so files
#1179 [VL] CI is failing on boost's checksum
#1162 [CK]fix CoaleseBatches metrics
#1124 Memory management not suitable with Velox split preload feature.
#1149 Run tpc-ds core
#741 Handle remainder for the case that its right input is zero
#1090 [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct
#1068 [VL] Managed memory leak in imported Spark UTs
#772 Velox does not install folly in centos8 by default, break compile in centos8.
#789 Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten
#700 AARCH64 port of Gluten
#1027 [VL] unsupported method
#1072 [CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2
#489 cannot build gluten (velox backend) in Amazon Linux 2
#1012 Enable local cache throw exception
#995 Fix memory leak for ClickHouse Backend
#914 System variables related to Folly could not be found when compiling gluten.
#990 Failed to build velox
#946 Upgrade arrow version to 10.0.1
#860 CH backend inset result not equals spark result
#601 Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame
#843 Unable to convert BHJ to SHJ by using hint
#826 ch_backend not support inset is empty
#815 Gluten + Velox backend does not support Struct dataset with same element name.
#563 Error compiling within -Pbackends-xx,spark-3.3,spark-ut
#560 An unsupportedOperationException interrupted the query execution
#770 VeloxRuntimeError when reading parquet file with only meta data
#800 [UT]ExpectedAnswer may not match SparkAnswer when is sorted
#676 WholeStageTransformerSuite#logForFailedTest() swallows exceptions
#790 Join RuntimeException when having duplicated equal-join keys
#757 Parquet scan not offloaded
#797 It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn.
#784 No Spark Shim Provider found for 3.3.0
#547 Jar conflict issue
#727 build from local velox repo doesn't work

PRs

#1266 [GLUTEN-1246] [CORE] Fix scale may be negative issue
#1313 [VL] Update doc for centos7 install
#1312 [CH] Ignore ch backend tpcds suite
#1198 [VL] fix: Update Velox setup scripts for centos 7
#1294 [VL] Following #1185, do some clean-ups against Velox + Celeborn CI
[#1196](https://github.com/oa...
Read more