This release consists of 245 commits from 69 contributors. See credits at the end of this changelog for more information.
Breaking changes:
- make unparser
Dialect
traitSend
+Sync
#11504 (y-f-u) - Implement physical plan serialization for csv COPY plans , add
as_any
,Debug
toFileFormatFactory
#11588 (Lordworms) - Consistent API to set parameters of aggregate and window functions (
AggregateExt
-->ExprFunctionExt
) #11550 (timsaucer) - Rename
ColumnOptions
toParquetColumnOptions
#11512 (alamb) - Rename
input_type
-->input_types
on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666 (lewiszlw) - Rename RepartitionExec metric
repart_time
torepartition_time
#11703 (alamb) - Remove
AggregateFunctionDefinition
#11803 (lewiszlw) - Skipping partial aggregation when it is not helping for high cardinality aggregates #11627 (korowa)
- Optionally create name of aggregate expression from expressions #11776 (lewiszlw)
Performance related:
- feat: Optimize CASE expression for "column or null" use case #11534 (andygrove)
- feat: Optimize CASE expression for usage where then and else values are literals #11553 (andygrove)
- perf: Optimize IsNotNullExpr #11586 (andygrove)
Implemented enhancements:
- feat: Add
fail_on_overflow
option toBinaryExpr
#11400 (andygrove) - feat: add UDF to_local_time() #11347 (appletreeisyellow)
- feat: switch to using proper Substrait types for IntervalYearMonth and IntervalDayTime #11471 (Blizzara)
- feat: support UDWFs in Substrait #11489 (Blizzara)
- feat: support
unnest
in GROUP BY clause #11469 (JasonLi-cn) - feat: support
COUNT()
#11229 (tshauck) - feat: consume and produce Substrait type extensions #11510 (Blizzara)
- feat: Error when a SHOW command is passed in with an accompanying non-existant variable #11540 (itsjunetime)
- feat: support Map literals in Substrait consumer and producer #11547 (Blizzara)
- feat: add bounds for unary math scalar functions #11584 (tshauck)
- feat: Add support for cardinality function on maps #11801 (Weijun-H)
- feat: support
Utf8View
type instarts_with
function #11787 (tshauck) - feat: Expose public method for optimizing physical plans #11879 (andygrove)
Fixed bugs:
- fix: Fix eq properties regression from #10434 #11363 (suremarc)
- fix: make sure JOIN ON expression is boolean type #11423 (jonahgao)
- fix:
regexp_replace
fails when pattern or replacement is a scalarNULL
#11459 (Weijun-H) - fix: unparser generates wrong sql for derived table with columns #11505 (y-f-u)
- fix: make
UnKnownColumn
s not equal to others physical exprs #11536 (jonahgao) - fix: fixes trig function order by #11559 (tshauck)
- fix: CASE with NULL #11542 (Weijun-H)
- fix: panic and incorrect results in
LogFunc::output_ordering()
#11571 (jonahgao) - fix: expose the fluent API fn for approx_distinct instead of the module #11644 (Michael-J-Ward)
- fix: dont try to coerce list for regex match #11646 (tshauck)
- fix: regr_count now returns Uint64 #11731 (Michael-J-Ward)
- fix: set
null_equals_null
to false whenconvert_cross_join_to_inner_join
#11738 (jonahgao) - fix: Add additional required expression for natural join #11713 (Lordworms)
- fix: hash join tests with forced collisions #11806 (korowa)
- fix:
collect_columns
quadratic complexity #11843 (crepererum)
Documentation updates:
- Minor: Add link to blog to main DataFusion website #11356 (alamb)
- Add
to_local_time()
in function reference docs #11401 (appletreeisyellow) - Minor: Consolidate specification doc sections #11427 (alamb)
- Combine the Roadmap / Quarterly Roadmap sections #11426 (alamb)
- Minor: Add an example for backtrace pretty print #11450 (goldmedal)
- Docs: Document creating new extension APIs #11425 (alamb)
- Minor: Clarify which parquet options are used for reading/writing #11511 (alamb)
- Support
newlines_in_values
CSV option #11533 (connec) - chore: Minor cleanup
simplify_demo()
example #11576 (kavirajk) - Move Datafusion Query Optimizer to library user guide #11563 (devesh-2002)
- Fix typo in doc of Partitioning #11612 (waruto210)
- Doc: A tiny typo in scalar function's doc #11620 (2010YOUY01)
- Change default Parquet writer settings to match arrow-rs (except for compression & statistics) #11558 (wiedld)
- Rename
functions-array
tofunctions-nested
#11602 (goldmedal) - Add parser option enable_options_value_normalization #11330 (xinlifoobar)
- Add reference to #comet channel in Arrow Rust Discord server #11637 (ajmarcus)
- Extract catalog API to separate crate, change
TableProvider::scan
to take a trait rather thanSessionState
#11516 (findepi) - doc: why nullable of list item is set to true #11626 (jcsherin)
- Docs: adding explicit mention of test_utils to docs #11670 (edmondop)
- Ensure statistic defaults in parquet writers are in sync #11656 (wiedld)
- Merge
string-view2
branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) #11667 (alamb) - Doc: Add Sail to known users list #11791 (shehabgamin)
- Move min and max to user defined aggregate function, remove
AggregateFunction
/AggregateFunctionDefinition::BuiltIn
#11013 (edmondop) - Change name of MAX/MIN udaf to lowercase max/min #11795 (edmondop)
- doc: Add support for
map
andmake_map
functions #11799 (Weijun-H) - Improve readme page in crates.io #11809 (lewiszlw)
- refactor: remove unneed mut for session context #11864 (sunng87)
Other:
- Prepare 40.0.0 Release #11343 (andygrove)
- Support
NULL
literals in where clause #11266 (xinlifoobar) - Implement TPCH substrait integration test, support tpch_6, tpch_10, t… #11349 (Lordworms)
- Fix bug when pushing projection under joins #11333 (jonahgao)
- Minor: some cosmetics in
filter.rs
, fix clippy due to logical conflict #11368 (comphead) - Update prost-derive requirement from 0.12 to 0.13 #11355 (dependabot[bot])
- Minor: update dashmap
6.0.1
#11335 (alamb) - Improve and test dataframe API examples in docs #11290 (alamb)
- Remove redundant
unalias_nested
calls for creating Filter's #11340 (alamb) - Enable
clone_on_ref_ptr
clippy lint on optimizer #11346 (lewiszlw) - Update termtree requirement from 0.4.1 to 0.5.0 #11383 (dependabot[bot])
- Introduce
resources_err!
error macro #11374 (comphead) - Enable
clone_on_ref_ptr
clippy lint on common #11384 (lewiszlw) - Track parquet writer encoding memory usage on MemoryPool #11345 (wiedld)
- Minor: remove clones and unnecessary Arcs in
from_substrait_rex
#11337 (alamb) - Minor: Change no-statement error message to be clearer #11394 (itsjunetime)
- Change
array_agg
to returnnull
on no input rather than empty list #11299 (jayzhan211) - Minor: return "not supported" for
COUNT DISTINCT
with multiple arguments #11391 (jonahgao) - Enable
clone_on_ref_ptr
clippy lint on sql #11380 (lewiszlw) - Move configuration information out of example usage page #11300 (alamb)
- chore: reuse a single function to create the Substrait TPCH consumer test contexts #11396 (Blizzara)
- refactor: change error type for "no statement" #11411 (crepererum)
- Implement prettier SQL unparsing (more human readable) #11186 (MohamedAbdeen21)
- Move
overlay
planning toExprPlanner
#11398 (dharanad) - Coerce types for all union children plans when eliminating nesting #11386 (gruuya)
- Add customizable equality and hash functions to UDFs #11392 (joroKr21)
- Implement ScalarFunction
MAKE_MAP
andMAP
#11361 (goldmedal) - Improve
CommonSubexprEliminate
rule with surely and conditionally evaluated stats #11357 (peter-toth) - fix(11397): surface proper errors in ParquetSink #11399 (wiedld)
- Minor: Add note about SQLLancer fuzz testing to docs #11430 (alamb)
- Trivial: use arrow csv writer's timestamp_tz_format #11407 (tmi)
- Improved unparser documentation #11395 (alamb)
- Avoid calling shutdown after failed write of AsyncWrite #11415 (joroKr21)
- Short term way to make
AggregateStatistics
still work when min/max is converted to udaf #11261 (Rachelint) - Implement TPCH substrait integration test, support tpch_13, tpch_14,16 #11405 (Lordworms)
- Minor: fix giuthub action labeler rules #11428 (alamb)
- Minor: change internal error to not supported error for nested field … #11446 (alamb)
- Minor: change Datafusion --> DataFusion in docs #11439 (alamb)
- Support serialization/deserialization for custom physical exprs in proto #11387 (lewiszlw)
- remove termtree dependency #11416 (Kev1n8)
- Add SessionStateBuilder and extract out the registration of defaults #11403 (Omega359)
- integrate consumer tests, implement tpch query 18 to 22 #11462 (Lordworms)
- Docs: Explain the usage of logical expressions for
create_aggregate_expr
#11458 (jayzhan211) - Return scalar result when all inputs are constants in
map
andmake_map
#11461 (Rachelint) - Enable
clone_on_ref_ptr
clippy lint on functions* #11468 (lewiszlw) - minor: non-overlapping
repart_time
andsend_time
metrics #11440 (korowa) - Minor: rename
row_groups.rs
torow_group_filter.rs
#11481 (alamb) - Support alternate formats for unparsing
datetime
totimestamp
andinterval
#11466 (y-f-u) - chore: Add criterion benchmark for CaseExpr #11482 (andygrove)
- Initial support for
StringView
, merge changes fromstring-view
development branch #11402 (alamb) - Replace to_lowercase with to_string in sql example #11486 (lewiszlw)
- Minor: Make execute_input_stream Accessible for Any Sinking Operators #11449 (berkaysynnada)
- Enable
clone_on_ref_ptr
clippy lints on proto #11465 (lewiszlw) - upgrade sqlparser 0.47 -> 0.48 #11453 (MohamedAbdeen21)
- Add extension hooks for encoding and decoding UDAFs and UDWFs #11417 (joroKr21)
- Remove element's nullability of array_agg function #11447 (jayzhan211)
- Get expr planners when creating new planner #11485 (jayzhan211)
- Support alternate format for Utf8 unparsing (CHAR) #11494 (sgrebnov)
- implement retract_batch for xor accumulator #11500 (drewhayward)
- Refactor: more clearly delineate between
TableParquetOptions
andParquetWriterOptions
#11444 (wiedld) - chore: fix typos of common and core packages #11520 (JasonLi-cn)
- Move spill related functions to spill.rs #11509 (findepi)
- Add tests that show the different defaults for
ArrowWriter
andTableParquetOptions
#11524 (wiedld) - Create
datafusion-physical-optimizer
crate #11507 (lewiszlw) - Minor: Assert
test_enabled_backtrace
requirements to run #11525 (comphead) - Move handlign of NULL literals in where clause to type coercion pass #11491 (xinlifoobar)
- Update parquet page pruning code to use the
StatisticsExtractor
#11483 (alamb) - Enable SortMergeJoin LeftAnti filtered fuzz tests #11535 (comphead)
- chore: fix typos of expr, functions, optimizer, physical-expr-common,… #11538 (JasonLi-cn)
- Minor: Remove clone in
PushDownFilter
#11532 (jayzhan211) - Minor: avoid a clone in type coercion #11530 (alamb)
- Move array
ArrayAgg
to aUserDefinedAggregate
#11448 (jayzhan211) - Move
MAKE_MAP
to ExprPlanner #11452 (goldmedal) - chore: fix typos of sql, sqllogictest and substrait packages #11548 (JasonLi-cn)
- Prevent bigger files from being checked in #11508 (findepi)
- Add dialect param to use double precision for float64 in Postgres #11495 (Sevenannn)
- Minor: move
SessionStateDefaults
into its own module #11566 (alamb) - refactor: rewrite mega type to an enum containing both cases #11539 (LorrensP-2158466)
- Move
sql_compound_identifier_to_expr
toExprPlanner
#11487 (dharanad) - Support SortMergeJoin spilling #11218 (comphead)
- Fix unparser invalid sql for query with order #11527 (y-f-u)
- Provide DataFrame API for
map
and movemap
tofunctions-array
#11560 (goldmedal) - Move OutputRequirements to datafusion-physical-optimizer crate #11579 (xinlifoobar)
- Minor: move
Column
related tests and renamecolumn.rs
#11573 (jonahgao) - Fix SortMergeJoin antijoin flaky condition #11604 (comphead)
- Improve Union Equivalence Propagation #11506 (mustafasrepo)
- Migrate
OrderSensitiveArrayAgg
to be a user defined aggregate #11564 (jayzhan211) - Minor:Disable flaky SMJ antijoin filtered test until the fix #11608 (comphead)
- support Decimal256 type in datafusion-proto #11606 (leoyvens)
- Chore/fifo tests cleanup #11616 (ozankabak)
- Fix Internal Error for an INNER JOIN query #11578 (xinlifoobar)
- test: get file size by func metadata #11575 (zhuliquan)
- Improve unparser MySQL compatibility #11589 (sgrebnov)
- Push scalar functions into cross join #11528 (lewiszlw)
- Remove ArrayAgg Builtin in favor of UDF #11611 (jayzhan211)
- refactor: simplify
DFSchema::field_with_unqualified_name
#11619 (jonahgao) - Minor: Use upstream
concat_batches
from arrow-rs #11615 (alamb) - Fix :
signum
function bug when0.0
input #11580 (getChan) - Enforce uniqueness of
named_struct
field names #11614 (dharanad) - Minor: unecessary row_count calculation in
CrossJoinExec
andNestedLoopsJoinExec
#11632 (alamb) - ExprBuilder for Physical Aggregate Expr #11617 (jayzhan211)
- Minor: avoid copying order by exprs in planner #11634 (alamb)
- Unify CI and pre-commit hook settings for clippy #11640 (findepi)
- Parsing SQL strings to Exprs with the qualified schema #11562 (Lordworms)
- Add some zero column tests covering LIMIT, GROUP BY, WHERE, JOIN, and WINDOW #11624 (Kev1n8)
- Refactor/simplify window frame utils #11648 (ozankabak)
- Minor: use
ready!
macro to simplifyFilterExec
#11649 (alamb) - Temporarily pin toolchain version to avoid clippy errors #11655 (findepi)
- Fix clippy errors for Rust 1.80 #11654 (findepi)
- Add
CsvExecBuilder
for creatingCsvExec
#11633 (connec) - chore(deps): update sqlparser requirement from 0.48 to 0.49 #11630 (dependabot[bot])
- Add support for USING to SQL unparser #11636 (wackywendell)
- Run CI with latest (Rust 1.80), add ticket references to commented out tests #11661 (alamb)
- Use
AccumulatorArgs::is_reversed
inNthValueAgg
#11669 (jcsherin) - Implement physical plan serialization for json Copy plans #11645 (Lordworms)
- Minor: improve documentation on
SessionState
#11642 (alamb) - Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #11652 (alihandroid)
- Update to arrow/parquet
52.2.0
#11691 (alamb) - Minor: Rename
RepartitionMetrics::repartition_time
toRepartitionMetrics::repart_time
to match metric #11478 (alamb) - Update cache key used in rust CI script #11641 (findepi)
- Fix bug in
remove_join_expressions
#11693 (jonahgao) - Initial changes to support using udaf min/max for statistics and opti… #11696 (edmondop)
- Handle nulls in approx_percentile_cont #11721 (Dandandan)
- Reduce repetition in try_process_group_by_unnest and try_process_unnest #11714 (JasonLi-cn)
- Minor: Add example for
ScalarUDF::call
#11727 (alamb) - Use
cargo release
inbench.sh
#11722 (alamb) - expose some fields on session state #11716 (waynexia)
- Make DefaultSchemaAdapterFactory public #11709 (adriangb)
- Check hashes first during probing the aggr hash table #11718 (Rachelint)
- Implement physical plan serialization for parquet Copy plans #11735 (Lordworms)
- Support cross-timezone
timestamp
comparison via coercsion #11711 (jeffreyssmith2nd) - Minor: Improve documentation for AggregateUDFImpl::state_fields #11740 (lewiszlw)
- Do not push down Sorts if it violates the sort requirements #11678 (alamb)
- Use upstream
StatisticsConverter
from arrow-rs in DataFusion #11479 (alamb) - Fix
plan_to_sql
: Add wildcard projection to SELECT statement if no projection was set #11744 (LatrecheYasser) - Use upstream
DataType::from_str
in arrow-cast #11254 (alamb) - Fix documentation warnings, make CsvExecBuilder and Unparsed pub #11729 (alamb)
- [Minor] Add test for only nulls (empty) as input in APPROX_PERCENTILE_CONT #11760 (Dandandan)
- Add
TrackedMemoryPool
with better error messages on exhaustion #11665 (wiedld) - Derive
Debug
for logical plan nodes #11757 (lewiszlw) - Minor: add "clickbench extended" queries to slt tests #11763 (alamb)
- Minor: Add comment explaining rationale for hash check #11750 (alamb)
- Fix bug that
COUNT(DISTINCT)
on StringView panics #11768 (XiangpengHao) - [Minor] Refactor approx_percentile #11769 (Dandandan)
- minor: always time batch_filter even when the result is an empty batch #11775 (andygrove)
- Improve OOM message when a single reservation request fails to get more bytes. #11771 (wiedld)
- [Minor] Short circuit
ApplyFunctionRewrites
if there are no function rewrites #11765 (gruuya) - Fix #11692: Improve doc comments within macros #11694 (Rafferty97)
- Extract
CoalesceBatchesStream
to a struct #11610 (alamb) - refactor: move ExecutionPlan and related structs into dedicated mod #11759 (waynexia)
- Minor: Add references to github issue in comments #11784 (findepi)
- Add docs and rename param for
Signature::numeric
#11778 (matthewmturner) - Support planning
Map
literal #11780 (goldmedal) - Support
LogicalPlan
Debug
differently thanDisplay
#11774 (lewiszlw) - Remove redundant Aggregate when
DISTINCT
&GROUP BY
are in the same query #11781 (mertak-synnada) - Minor: add ticket reference and fmt #11805 (alamb)
- Improve MSRV CI check to print out problems to log #11789 (alamb)
- Improve log func tests stability #11808 (lewiszlw)
- Add valid Distinct case for aggregation #11814 (mertak-synnada)
- Don't implement
create_sliding_accumulator
repeatedly #11813 (lewiszlw) - chore(deps): update rstest requirement from 0.21.0 to 0.22.0 #11811 (dependabot[bot])
- Minor: Update exected output due to logical conflict #11824 (alamb)
- Pass scalar to
eq
insidenullif
#11697 (simonvandel) - refactor: move
aggregate_statistics
todatafusion-physical-optimizer
#11798 (Weijun-H) - Minor: refactor probe check into function
should_skip_aggregation
#11821 (alamb) - Minor: consolidate
path_partition
test intocore_integration
#11831 (alamb) - Move optimizer integration tests to
core_integration
#11830 (alamb) - Bump deprecated version of SessionState::new_with_config_rt to 41.0.0 #11839 (kezhuw)
- Fix partial aggregation skipping with Decimal aggregators #11833 (alamb)
- Fix bug with zero-sized buffer for StringViewArray #11841 (XiangpengHao)
- Reduce clone of
Statistics
inListingTable
andPartitionedFile
#11802 (Rachelint) - Add
LogicalPlan::CreateIndex
#11817 (lewiszlw) - Update
object_store
to 0.10.2 #11860 (danlgrca) - Add
skipped_aggregation_rows
metric to aggregate operator #11706 (alamb) - Cast
Utf8View
toUtf8
to support||
fromStringViewArray
#11796 (dharanad) - Improve nested loop join code #11863 (lewiszlw)
- [Minor]: Refactor to use Result.transpose() #11882 (djanderson)
- support
ANY()
op #11849 (samuelcolvin)
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
48 Andrew Lamb
20 张林伟
9 Jay Zhan
9 Jonah Gao
8 Andy Grove
8 Lordworms
8 Piotr Findeisen
8 wiedld
7 Oleks V
6 Jax Liu
5 Alex Huang
5 Arttu
5 JasonLi
5 Trent Hauck
5 Xin Li
4 Dharan Aditya
4 Edmondo Porcu
4 dependabot[bot]
4 kamille
4 yfu
3 Daniël Heres
3 Eduard Karacharov
3 Georgi Krastev
2 Chris Connelly
2 Chunchun Ye
2 June
2 Marco Neumann
2 Marko Grujic
2 Mehmet Ozan Kabak
2 Michael J Ward
2 Mohamed Abdeen
2 Ruihang Xia
2 Sergei Grebnov
2 Xiangpeng Hao
2 jcsherin
2 kf zheng
2 mertak-synnada
1 Adrian Garcia Badaracco
1 Alexander Rafferty
1 Alihan Çelikcan
1 Ariel Marcus
1 Berkay Şahin
1 Bruce Ritchie
1 Devesh Rahatekar
1 Douglas Anderson
1 Drew Hayward
1 Jeffrey Smith II
1 Kaviraj Kanagaraj
1 Kezhu Wang
1 Leonardo Yvens
1 Lorrens Pantelis
1 Matthew Cramerus
1 Matthew Turner
1 Mustafa Akur
1 Namgung Chan
1 Ning Sun
1 Peter Toth
1 Qianqian
1 Samuel Colvin
1 Shehab Amin
1 Simon Vandel Sillesen
1 Tim Saucer
1 Wendell Smith
1 Yasser Latreche
1 Yongting You
1 danlgrca
1 tmi
1 waruto
1 zhuliquan
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.