14 Jul 03:07

xieqi

bfe394b

Gluten v1.0.0

Release Notes - Gluten - Version 1.0.0

Highlights (Velox backend only)

Support Spark 3.2 and Spark3.3
Run Pass all Velox, Spark3.2 UTs, and partially Spark3.3 UTs
Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
Support FileSystem: localfs, HDFS, S3, OSS (via s3a)
Support data types: Primitive type, Decimal, Date, Timestamp
Support 20 operators, detail here
Support 164 functions, detail here
Support native Parquet write
Support native ORC read
Support Intel® In-memory Analytics Accelerator (IAA/IAX) hardware accelerator in Shuffle compression
Support cap-based spill (static memory allocation) for join/agg/sort operator (experimental feature)
Support static build method via vcpkg
Support local cache (experimental feature)
2.71x speedup in Decision Support Benchmark1 (TPC-H Like) testing
2.29x speedup in Decision Support Benchmark2 (TPC-DS Like) testing
Velox code updated to commit
Document improvement for support features and configuration

Known Issues

Parquet write only support compression.codec, parquet.block.size and parquet.block.rows configurations
Velox backend does not support dynamic partition write and bucket write
Spill may throw OutOfMemoryExcetpion

New Features

[GLUTEN-1243][VL] Support bit_xor aggregate function
[GLUTEN-1245][VL][Feat] Add VeloxParquetFileFormat to support parquet write in velox backend
[GLUTEN-1270][VL][Feat] Support multiple HDFS endpoints
[GLUTEN-1306][VL] feat: Link static depends via vcpkg
[GLUTEN-1306][FOLLOWUP] vcpkg setup script add alinux3 support
[GLUTEN-1346][VL] Support native velox row to column
[GLUTEN-1367] Support running gluten on anolis
[GLUTEN-1371][VL] Support First/Last aggregate functions
[GLUTEN-1374][VL] RangePartitioning supports velox columnar batch
[GLUTEN-1409][VL] feat: Support named_struct in Velox backend
[GLUTEN-1476][VL] Support GetStructField
[GLUTEN-1478] Support ordered result check for MapData
[GLUTEN-1490] refactor substrait literals using generics, and support map/struct/array literals based on it
[GLUTEN-1521][Core] Support to add the customer columnar rules by config
[GLUTEN-1623][VL] Support asinh, acosh, atanh, sec, csc math functions for Velox backend
[GLUTEN-1638][VL] feat: Add hdfs support in parquet write
[GLUTEN-1640] Support judging whether the execution plan has a fallback
[GLUTEN-1654][VL] support approx_count_distinct for velox
[GLUTEN-1658][CORE] feat: Support SparkResourcesUtil.scala in k8s
[GLUTEN-1662][VL] feat: Support InsertIntoHiveDirCommand in velox parquet write
[GLUTEN-1704][VL] Support metrics on splits and row groups by
[GLUTEN-1794][VL] support split preload
[GLUTEN-1860] StructLiteral support null literal
[CORE] Support submit subqueries concurrently to improve scalar subquery performance
[VL] package.sh support centos7 and centos8
[VL] feat: support partial merge phase in aggregation
[VL] package and velox scripts add alinux support
[VL] feat: support more distinct functions
[VL] Support mocking map stage with no input files in micro benchmark
[VL] add support for reading ORC
[VL] add long decimal type support for Orc file format

Improvements

[GLUTEN-842][VL] convert expand op to expand exec in velox
[GLUTEN-842] remove group id transformer
[GLUTEN-1108][VL] Init NativeRowToColumnarJniWrapper with memory pool and schema
[GLUTEN-1199] Avoid throwing exception from destructor of JavaInputStreamAdaptor
[GLUTEN-1205][VL] Rename some class name and dir name for columnar sh…
[GLUTEN-1205][VL] Refactor shuffle partition writer
[GLUTEN-1205][VL] Refactor shuffle partitioner
[GLUTEN-1205][VL][FOLLOWUP] Refactor shuffle partition writer
[GLUTEN-1209][VL] refactor: Refactor Java Celeborn into an independent module
[GLUTEN-1296][VL] Remove some logs in CI
[GLUTEN-1325][VL] Optimize decimal arithmetic
[GLUTEN-1331][CORE] Enable some functions
[GLUTEN-1336][VL] add spark3.3 UT under connector and expression
[GLUTEN-1336][VL] move Spark3.3 Unit tests to seperate job
[GLUTEN-1336][VL] add more spark3.3 UT
[GLUTEN-1336][VL] CI: move slow tests into another job for Spark3.3
[GLUTEN-1357][CORE] Change soft-affinity log level from INFO to DEBUG
[GLUTEN-1369][Core] Move config 'spark.gluten.enabled' to GlutenConfig from QueryPlanSelector
[GLUTEN-1393][VL] feat: Change velox pipeline input from arrow to velox ValueStreamNode
[GLUTEN-1407] Let profile control shim version
[GLUTEN-1416][VL] NoSuchMethodError from shaded Arrow
[GLUTEN-1433][VL] feat: offload timestamp scan to Velox - phase 1
[GLUTEN-1433][VL] Enable GlutenStatisticsCollectionSuite
[GLUTEN-1434][VL] Delete some unused files and functions
[GLUTEN-1434][VL] Refactor to add ColumnarBatchIterator
[GLUTEN-1434][VL] Remove unused arrow code and add GLUTEN_CHECK and GLUTEN_DCHECK
[GLUTEN-1458][VL][CI] feat: Adding Spark3.3 w/ Ubuntu22.04 test
[GLUTEN-1476][VL] Enable scan on struct and map types
[GLUTEN-1476][CORE] Use correct field name in struct type
[GLUTEN-1478][VL] enable timestamp expression tests
[GLUTEN-1478] Enable failed UT in GlutenIntervalExpressionsSuite
[GLUTEN-1478][VL] Enable some spark UTs for cast function
[GLUTEN-1478][VL] Enable tests on casting from string to decimal
[GLUTEN-1478][VL] Enable test on casting from decimal to bool
[GLUTEN-1480][DOC] Refactor to enable github pages
[GLUTEN-1491][VL][feat] Refine row_number() method in velox backend
[GLUTEN-1500][VL] feat: Use 0.6 * task memory cap as spill threshold for all spillable operators
[GLUTEN-1500][VL] Implement OOM cap shared by tasks, and spill threshold shared by tasks and operators
[GLUTEN-1500][VL] Integrate with Velox arbitration API
[GLUTEN-1533][VL][Feat] Replace sort agg with gluten hash agg
[[GLUTEN-1534][VL]](https://github.com/oap-proj...

Contributors

zhouyuan, xieqi, and 40 other contributors

Assets 31

07 Apr 09:32

zhejiangxiaomai

0.5.0

3c3267a

Gluten 0.5.0 Pre-release

Pre-release

Change log

Generated on 2023-04-07

Gluten 0.5.0

Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.

Here is the major highlight in Gluten 0.5.0:

Support Spark3.2 and Spark3.3
Support Ubuntu20.04 or later
Support CentOS7 and 8
Support JDK8 only
Support GCC9 or later
Use Substrait as unified plan
Use Velox as default backend engine
Use Celeborn as default RSS
Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
Support Spill for Sort, Agg, and Join operators
Run Pass all Spark3.2 Unit Test
2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
Support Intel QAT accelerators in Shuffle compression

Limitations

Not Support Complex data type such as Array, Map, Struct
OOM happened in some operators not support Spill
Decimal result may mismatch in some cases

Features


#974	[CH] Supprt string repeat function
#1008	[CH] Support locate function
#1273	Implement cast decimal to int
#1223	[CH] support reading from S3 and using Clickhouse local cache to speed up
#1131	[Gluten-core] Add an option to only fallback once
#1165	Reduce GC Time when executing BHJ for CH backend.
#1147	[Gluten-core]Make validate failure logLevel configuable
#1100	Making transformer plan log more obvious
#1112	Refactor Gluten metrics and add apis for each backend
#926	gluten timezone not the same as backend
#1039	Remove compute pid metric in shuffle operator.
#882	Selective query execution
#959	Upgrade Arrow version to 11.0.0
#969	Docker for gluten running on centos 8
#986	Align and enrich metrics compare to Spark
#972	Can we separate native dynamic library from build generated jars?
#913	No Spark Shim Provider found for 3.2.0
#853	Support named struct type
#888	Clickhouse backend broadcast relation support r2c
#850	Add cast check in ExpressionTransformer
#825	Setup development environment for macOS
#788	Pass needed hadoop conf from driver to executor

Bugs Fixed


#1284	Scala double data is wronlgy compared with null in a ut
#729	Validation failed for GlutenHashAggregateExecTransformer class
#799	This operator doesn't support doExecuteColumnar
#527	archives for Spark patch versions become unavailable on new releases affecting shims versioning
#523	Some basic failed SQL cases
#1028	[VL] SusbtraitToVeloxPlan error
#858	Sort result mismatch issue with different input records.
#877	Array/Map DataType result mismatch issue when containing null value
#1227	[CH] Scalar subquery filters execute twice for parquet file
#1265	[CH] Rescale decimal trigger fallback
#1233	[CH] Fix fallback issue when reading csv files
#1235	[CH] Fix missing reading from the broadcasted value when executing DPP
#1234	[CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all
#1207	shims-spark32 and shims-spark33 may be depencied at the same time
#1161	Bundle built by `buildbundle-veloxbe.sh` for Spark3.3 is broken
#1210	[CH] Fix the wrong table path of the orders table for TPCH in UT
#1175	FileNotFoundException while executing spark jobs -.so files
#1179	[VL] CI is failing on boost's checksum
#1162	[CK]fix CoaleseBatches metrics
#1124	Memory management not suitable with Velox split preload feature.
#1149	Run tpc-ds core
#741	Handle remainder for the case that its right input is zero
#1090	[TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct
#1068	[VL] Managed memory leak in imported Spark UTs
#772	Velox does not install folly in centos8 by default, break compile in centos8.
#789	Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten
#700	AARCH64 port of Gluten
#1027	[VL] unsupported method
#1072	[CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2
#489	cannot build gluten (velox backend) in Amazon Linux 2
#1012	Enable local cache throw exception
#995	Fix memory leak for ClickHouse Backend
#914	System variables related to Folly could not be found when compiling gluten.
#990	Failed to build velox
#946	Upgrade arrow version to 10.0.1
#860	CH backend inset result not equals spark result
#601	Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame
#843	Unable to convert BHJ to SHJ by using hint
#826	ch_backend not support inset is empty
#815	Gluten + Velox backend does not support Struct dataset with same element name.
#563	Error compiling within -Pbackends-xx,spark-3.3,spark-ut
#560	An unsupportedOperationException interrupted the query execution
#770	VeloxRuntimeError when reading parquet file with only meta data
#800	[UT]ExpectedAnswer may not match SparkAnswer when is sorted
#676	WholeStageTransformerSuite#logForFailedTest() swallows exceptions
#790	Join RuntimeException when having duplicated equal-join keys
#757	Parquet scan not offloaded
#797	It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn.
#784	No Spark Shim Provider found for 3.3.0
#547	Jar conflict issue
#727	build from local velox repo doesn't work

PRs


#1266	[GLUTEN-1246] [CORE] Fix scale may be negative issue
#1313	[VL] Update doc for centos7 install
#1312	[CH] Ignore ch backend tpcds suite
#1198	[VL] fix: Update Velox setup scripts for centos 7
#1294	[VL] Following #1185, do some clean-ups against Velox + Celeborn CI
[#1196](https://github.com/oa...

Assets 28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Notes - Gluten - Version 1.0.0

Highlights (Velox backend only)

Known Issues

New Features

Improvements

Contributors

Change log

Gluten 0.5.0

Features

Bugs Fixed

PRs

Releases: apache/incubator-gluten

Gluten v1.0.0

Release Notes - Gluten - Version 1.0.0

Highlights (Velox backend only)

Known Issues

New Features

Improvements

Contributors

Gluten 0.5.0

Change log

Gluten 0.5.0

Features

Bugs Fixed

PRs