Skip to content

Latest commit

 

History

History
330 lines (316 loc) · 44.6 KB

15.0.0.md

File metadata and controls

330 lines (316 loc) · 44.6 KB

15.0.0 (2022-12-01)

Full Changelog

Breaking changes:

  • Expose remaining parquet config options into ConfigOptions (try 2) #4427 (alamb)
  • Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization #4382 (alamb)
  • add {TDigest,ScalarValue,Accumulator}::size #4342 (crepererum)
  • API-break: Support SubqueryAlias and remove Alias in Projection #4333 [sql] (jackwener)
  • split try_new_with_schema_alias from original code #4284 (jackwener)
  • Collapse statistics in normal explain plan #4157 (alamb)
  • Linearize binary expressions to reduce proto tree complexity #4115 (isidentical)
  • support SET Timezone #4107 [sql] (waitingkuo)

Implemented enhancements:

  • Refactor Built-in, Aggregate window functions to increase code reuse. #4440
  • Helper to get "root" error #4435
  • Do NOT convert intermediate/source errors to strings. #4434
  • Estimate the total_byte_size of the filter expression's result when selectivity is available #4374
  • refactor the code of the HashJoin #4356
  • CoalesceBatchesExec reports no ordering #4331
  • Introduce tournament tree to achieve better k-way sort-merging #4300
  • Add a checker to confirm physical optimizer rules will keep the physical plan schema immutable #4299
  • Remove the macro rule unary_scalar_expr from expr_fn.rs #4298
  • Remove Alias-in-Projection, replace it with SubqueryAlias #4291
  • reimplement reduce_outer_join #4270
  • Reimplement filter_push_down #4266
  • Reimplement eliminate_limit #4264
  • Reimplement limit_push_down #4263
  • Make a data driven SQL testing tool (so we can reuse duckdb test suite, example) #4248
  • upgrade chrono to 0.4.23 #4224
  • support scan non-string columns partitioned parquet files #4218
  • Allow optimizer rules to skip optimizing plans #4209
  • Supporting specifying schema when create tables #4183
  • Improve ergonomics of creating ListingOptions #4178
  • Add ability to specify external sort information for ParquetExec #4169
  • Add another method to collect referenced columns from an expression #4152
  • Improve EXPLAIN ANALYZE output for parquet exec #4144
  • TableProviderFactory::create should have Optional<DFSchemaRef> parameter #4142
  • Support more expressions in equality join #4140
  • JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats #4139
  • Allow TPCH tooling to create a combined result for easier processing by outside tools #4127
  • Allow additional options when creating an external table #4125
  • reuse code utils::optimize_children instead of redundant implementation #4120
  • Add test field to PR template #4113
  • Allow for automatic registration of ListingTables #4111
  • Add CI check that configs.md is up-to-date #4108
  • Support SET timezone to non-UTC time zone #4106
  • Parquet predicates contains and true expressions #4091
  • Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4077
  • add support for .xz compressed files #4074
  • add a feature gate to make support for compressed files optional #4073
  • Support serializing more deeply nested AND / OR expressions #4066
  • Use f64::total_cmp instead of OrderedFloat #4051
  • Add documentation to make it clear that decimal support is still experimental #4036
  • Simplify Pushed Down Predicates #4020
  • Improve HashJoinExec metrics #4009
  • Move physical plan serde from Ballista to DataFusion #3949
  • Support SubqueryAlias better in planner #3927
  • A framework for expression boundary analysis (and statistics) #3898
  • Replace Filter: Boolean(false) with EmptyRelation #3864
  • Implement statistics estimation for FilterExec #3845
  • Support parquet page filtering for more types: String, Binary(Decimal), Int96 #3833
  • Allow configuring parquet filter pushdown dynamically #3821
  • Unable to register tables in non-cloud S3 servers #3640
  • support more data type in prune for cast/try_cast #3442
  • Disable spill to disk globally #3264
  • Consider to categorize Operator #3216
  • Replace Projection.alias with SubqueryAlias #2212
  • [Optimizer] Eliminate the distinct #2045
  • beautify datafusion's site: https://arrow.apache.org/datafusion/ #1819
  • split datafusion-logical-plan sub-module #1755
  • convert outer join to inner join to improve performance #1585
  • Add sqllogictest for datafusion #1453
  • Add additional simplification rules #1406
  • support more subqueries #1209
  • Add baseline metrics for remaining execution plan nodes #1019
  • Make ExecutionPlan implementations immutable #987
  • Architecture overview may be insufficient in README #980
  • Add a separate configuration setting for parallelism of scanning parquet files #924
  • Support hash repartion elimination #41

Fixed bugs:

  • pyarrow CI failed #4448
  • UnwrapCastInComparison exist bug #4430
  • The CLI panics when passing an invalid explain query #4378
  • HashJoin should return Err when the right side input stream produce Err #4362
  • Optimizer check errors if resulting schema has different metadata #4346
  • Panic with function to_hex #4339
  • LimitPushDown pushdown into limit, result is wrong #4308
  • DESCRIBE statement issue with qualified table references #4303
  • Panic with window function LAST_VALUE #4297
  • CI failed in Compare to postgres #4294
  • Field alias can't work in where clause #4288
  • Some valid filters are not pushed down to parquet scan #4282
  • The type renaming pub type NullColumnarValue = ColumnarValue makes no sense #4271
  • Current limit_push_down can't support cross_join #4256
  • Cargo test fail #4253
  • RightSemi/RightAnti HashJoin has bug, the left_indices is never populated, causing failure to apply join filters. #4247
  • Clippy failures #4245
  • Cannot query s3 data from datafusion-cli #4239
  • Bug parsing interval with negative values #4237
  • cargo test reports errors on the master branch. #4236
  • Doc of the expression functionlog2 is incorrect #4231
  • HashJoin with mode PartitionMode:CollectLeft has bug and can produce wrong result #4230
  • Add ambiguous check when generate projection plan #4210
  • What happened for NDJSON support on CLI? #4198
  • Add ambiguous check when generate join plan #4197
  • Clippy failing on master : error: use of deprecated associated function chrono::NaiveDate::from_ymd: use from_ymd_opt() instead #4187
  • Reimplement the eliminate_cross_join #4176
  • Incorrect handling of column names #4166
  • Update release scripts to support datafusion-benchmarks #4134
  • Bug in interpreting correctly parsed SQL with aliases #4123
  • The percentile argument for ApproxPercentileCont must be Float64, not Decimal128(2, 1) #4103
  • Panic when using array_agg #4080
  • Wrong result for FIRST_VALUE AND LAST_VALUE window functions #4076
  • Round error when casting float to decimal #4071
  • Predicate still has cast when comparing Timestamp(Nano, None) to a timestamp literal, so can't be pushed down or used for pruning #3938
  • Revisit required_child_distribution(), output_partitioning(), output_ordering() implementations in ExecutionPlan's implementations #3653
  • Can't push down projection after do type coercion #3583
  • In some circumstances cast expression is not working #3499
  • output_partitioning() and output_ordering() implementations are wrong in some physical plan implementations with alias #3400
  • Interval Literal doesn't work for timeunit less than millisecond #3204
  • INTERVAL literal with duplicated interval types should raise error #3183
  • Error occurs when only using partition columns in query #1999
  • regex_match does not compile using the g flag #1429
  • between with NULL literals does not work: can't be evaluated because there isn't a common type to coerce the types to #1193
  • [Datafusion] Error with CAST: Unsupported SQL type Time #193

Closed issues:

  • SQL level coverage for when memory limit is exceeded #4404
  • Throw error (not panic) if a listing table specifies an missing partition column #4350
  • Page index pruning fail on complex_expr #4317
  • optimize limit-full join in the limit push down rule #4275
  • infer_schema function is not working with s3 Urls or http endpoints #4269
  • Add support binary boolean operators with nulls #4241
  • Add additional testing to parquet predicate pushdown integration tests #4087
  • Add metrics for parquet page level skipping #4086
  • Add parquet page index pushdown metrics #4058
  • Throw a runtime error if the memory allocated to GroupByHash exceeds a limit #3940
  • support unsigned numeric data type in UnwrapCastInBinaryComparison rule #3702
  • Support type cast in union #2125
  • [EPIC] Memory Limited Sort (Externalized / Spill) #1568
  • Maintain partition information in Union #189
  • Add coercion support for NULL literals #185

Merged pull requests:

  • Make datafusion-sql depend on arrow-schema instead of arrow #4456 [sql] (mbrobbel)
  • replace the comparator for decimal array op scalar using arrow kernel #4453 (liukun4515)
  • Fix pyarrow test #4450 (mvanschellebeeck)
  • Replace &Option<T> with Option<&T> #4446 [sql] (askoa)
  • Improve error handling for array downcasting #4445 (retikulum)
  • Refactor Builtin Window Function Implementation #4441 (mustafasrepo)
  • feat: DataFusionError::find_root #4437 (crepererum)
  • fix: do NOT convert errors to strings but keep the type #4436 (crepererum)
  • The CLI panics when passing an invalid explain query #4429 (comphead)
  • [minor] use arrow kernel concat_batches instead combine_batches #4423 (Ted-Jiang)
  • fix panic on to_hex function for negative numbers #4422 (retikulum)
  • Optimize filter executor in pull-based executor #4421 (xudong963)
  • optimize limit push for join case #4411 (liukun4515)
  • Add integration test for erroring when memory limits are hit #4406 (alamb)
  • feat: ResourceExhausted for memory limit in AggregateStream #4405 (crepererum)
  • Update to arrow 28 #4400 [sql] (tustvold)
  • Update rstest requirement from 0.15.0 to 0.16.0 #4399 (dependabot[bot])
  • Add sqllogictests (v0) #4395 (mvanschellebeeck)
  • improve hashjoin execution metrics #4394 (AssHero)
  • Add with_new_inputs for LogicalPlan #4393 (jackwener)
  • Clean the code in limit.rs. #4391 (HaoYang670)
  • Move physical plan serde from Ballista to DataFusion #4390 (Kikkon)
  • Fix page index pruning fail on complex_expr #4387 (Ted-Jiang)
  • Add check for nested types in equivalent names and types #4380 (alamb)
  • refine the code of build schema for ambiguous check, factor this out into a function #4379 [sql] (AssHero)
  • Refactor the Hash Join #4377 (liukun4515)
  • Minor: Fix typos in the documentation #4376 (martin-g)
  • Include byte size estimates in the filter statistics #4375 (isidentical)
  • HashJoin should return Err when the right side input stream produce Err, add more join UTs to cover different join types #4373 [sql] (mingmwang)
  • feat: ResourceExhausted for memory limit in GroupedHashAggregateStream #4371 (crepererum)
  • Use limit() function instead of show_limit() in the first example #4369 (martin-g)
  • Update env_logger requirement from 0.9 to 0.10 #4367 (dependabot[bot])
  • reimplement push_down_filter to remove global-state #4365 (jackwener)
  • Support to use Schedular in tpch benchmark #4361 (xudong963)
  • Adding more dataframe example to read csv files #4360 (DataPsycho)
  • minor: correct name and typo #4359 (jackwener)
  • Do not log error if page index can not be evaluated #4358 (alamb)
  • Clean the expr_fn - use scalar_expr to create unary scalar expr functions, remove macro unary_scalar_functions #4357 (HaoYang670)
  • Throw error (not panic) if a listing table specifies an missing partition column #4354 (doki23)
  • Improve error handling and add some more types for proper downcasting #4352 (retikulum)
  • Add check to avoid underflow in memory manager #4351 (askoa)
  • Improve error messages when memory is exhausted while sorting #4348 (alamb)
  • Do not error in optimizer if resulting schema has different metadata #4347 (alamb)
  • minor: improve optimizer logging and do not repeat rule name #4345 (alamb)
  • minor: fix typos in test names #4344 [sql] (alamb)
  • Minor: Add docstrings to EliminateOuterJoins optimizer pass #4343 (alamb)
  • Minor: refactor: isolate common memory accounting utils #4341 (crepererum)
  • minor: make plan_from_tables return one plan instead of Vec #4336 [sql] (jackwener)
  • enhancement: when fetch == 0, pushdown limit 0 instead skip+fetch. #4334 (jackwener)
  • Teach optimizer that CoalesceBatchesExec does not destroy output order #4332 (alamb)
  • Add ability to disable DiskManager #4330 (tustvold)
  • Update cli.md #4329 (psvri)
  • fix bug: right semi join can't support the filter #4327 (liukun4515)
  • reimplment eliminate_limit to remove global-state. #4324 (jackwener)
  • Refine Err propagation and avoid unwrap in transform closures #4318 (mingmwang)
  • Add a checker to confirm physical optimizer rules will keep the physical plan schema immutable #4316 (mingmwang)
  • Refactor downcasting functions with downcastvalue macro and improve error handling of ListArray downcasting #4313 (retikulum)
  • minor: add another test case to cover join ambiguous check #4305 [sql] (ygf11)
  • Fix DESCRIBE statement qualified table issue #4304 [sql] (gruuya)
  • Use tournament loser tree for k-way sort-merging, increase merge speed by 50% #4301 (richox)
  • Pin Python setuptools in the CI to fix integration tests #4296 (isidentical)
  • Support SubqueryAlias in optimizer, physcial planner. #4293 (jackwener)
  • minor: avoid a clone into string when checking ambiguous #4292 [sql] (ygf11)
  • replace the comparison op for decimal array op using the arrow-rs kernel #4290 (liukun4515)
  • MINOR: replace {..} with (_), typo, remove outdated TODO #4286 (jackwener)
  • Reduce Expr copies in ParquetExec #4283 (alamb)
  • Fix issue in filter pushdown with overloaded projection index #4281 (thinkharderdev)
  • Skip useless pruning predicates in ParquetExec #4280 (alamb)
  • Push down more predicates into ParquetExec #4279 (alamb)
  • Fix EXPLAIN plan for ParquetExec to show pruning_predicate #4278 (alamb)
  • reimplement limit_push_down to remove global-state, enhance optimize and simplify code. #4276 (jackwener)
  • Bump actions/labeler from 4.0.2 to 4.1.0 #4274 (dependabot[bot])
  • Remove the type alias NullColumnarValue #4273 (HaoYang670)
  • reimplement eliminate_outer_join #4272 (jackwener)
  • Fix bugs in parsing with header row and partitioned by #4268 [sql] (HaoYang670)
  • improve error messages while downcasting UInt32Array, UInt64Array and BooleanArray #4261 (retikulum)
  • add ambiguous check for projection #4260 [sql] (AssHero)
  • Add ambiguous check for join #4258 [sql] (ygf11)
  • support cross_join in limit_push_down #4257 (jackwener)
  • Support parquet page filtering on min_max for decimal128 and string columns #4255 (Ted-Jiang)
  • fix conflict and UT, cleanup redundant legacy code #4252 (jackwener)
  • Minor: remove unecessary clone() in planner #4249 [sql] (alamb)
  • Fix nightly clippy failures #4246 (mvanschellebeeck)
  • Improve Error Handling and Readibility for downcasting Float32Array, Float64Array, StringArray #4244 (retikulum)
  • Use defaults for ListingOptions builder #4243 (mvanschellebeeck)
  • Support binary boolean operators with nulls #4242 (Ted-Jiang)
  • Fixing doc of the expression #4240 (Creampanda)
  • Fix negative interval parsing bug #4238 (Jefffrey)
  • remove duplicate or redundant code #4235 (jackwener)
  • add a checker to confirm optimizer can keep plan schema immutable. #4233 (jackwener)
  • Fix the percentile argument for ApproxPercentileCont must be Float64, not Decimal128(2, 1) #4228 (comphead)
  • refactor how we create listing tables #4227 (timvw)
  • Update sqlparser requirement from 0.26 to 0.27 #4226 [sql] (alamb)
  • upgrade required chrono version to 0.4.23 #4225 (waitingkuo)
  • Support types other than String for partition columns on ListingTables #4221 (doki23)
  • [CBO] JoinSelection Rule, select HashJoin Partition Mode based on the Join Type and available statistics, option for SortMergeJoin #4219 (mingmwang)
  • Remove alias in Union #4212 (jackwener)
  • Add try_optimize method #4208 (andygrove)
  • Provide a builder for ListingOptions with fixups #4207 (alamb)
  • Avoid error with empty iterators used for ScalarValue::iter_to_array #4206 (GrandChaman)
  • Improve error message for regexp_match 'g' flag #4203 (Jefffrey)
  • Return ResourceExhausted errors when memory limit is exceed in GroupedHashAggregateStreamV2 (Row Hash) #4202 (crepererum)
  • Add additional expr boolean simplification rules #4200 (Jefffrey)
  • Update to arrow and parquet 27.0.0 #4199 [sql] (tustvold)
  • Support create table with explicit column definitions #4194 [sql] (doki23)
  • Support all equality predicates in equality join #4193 [sql] (ygf11)
  • add propagate_empty_relation optimizer rule #4192 (jackwener)
  • fix clippy #4190 [sql] (jackwener)
  • Fix clippy by avoiding deprecated functions in chrono #4189 (alamb)
  • Disallow duplicate interval types during parsing #4188 (Jefffrey)
  • Parse nanoseconds for intervals #4186 (Jefffrey)
  • Add rule to reimplement Eliminate cross join and remove it in planner #4185 [sql] (jackwener)
  • [FOLLOWUP] Enforcement Rule: resolve review comments, refactor adjust_input_keys_ordering() #4184 (mingmwang)
  • Simplify boolean parquet pushdown predicate #4182 (Jefffrey)
  • Minor: consolidate parquet custom_reader integration test into parquet_exec #4175 (alamb)
  • minor: remove redundant println and cleanup #4173 (jackwener)
  • Add ability to specify external sort information for ListingTables #4170 (alamb)
  • Improve Error Handling and Readibility for downcasting Decimal128Array #4168 (retikulum)
  • Minor: Remove completed comment on parquet row group pruning #4167 (alamb)
  • Update hashbrown requirement from 0.12 to 0.13 #4164 (dependabot[bot])
  • MINOR: enable dyn_cmp_dict feature on arrow for physical expr crate #4163 (isidentical)
  • Derive filter statistic estimates from the predicate expression #4162 (isidentical)
  • Minor: pass ParquetFileMetrics to build_row_filter in parquet #4161 (alamb)
  • Minor: Extract parquet row group pruning code into its own module #4160 (alamb)
  • Full support for time32 and time64 literal values (ScalarValue) #4156 (andre-cc-natzka)
  • Window frame GROUPS mode support #4155 (zembunia)
  • Improve error messages while downcasting Int64Array #4154 (retikulum)
  • Add another method to collect referenced columns from an expression #4153 [sql] (ygf11)
  • Remove BoxedAsyncFileReader #4150 (tustvold)
  • Support unsigned integers in unwrap_cast_in_comparison Optimizer rule #4149 (alamb)
  • Add support for DataType::Timestamp casts in unwrap_cast_in_comparison optimizer pass #4148 (alamb)
  • Add additional testing for unwrap_cast_in_comparison #4147 (alamb)
  • improve error messages while downcasting Int32Array #4146 (retikulum)
  • Minor: Update docstring on unwrap_cast_in_comparison #4145 (alamb)
  • add schema parameter to table provider factory create method #4143 (milenkovicm)
  • fix: shouldn't pass alias through into subquery. #4141 [sql] (jackwener)
  • Preserve the Cast expression in columnize_expr #4137 [sql] (HaoYang670)
  • Set versions to dependencies with path in benchmarks Cargo.toml file #4136 (ArkashaJavelin)
  • Fix links #4135 (mvanschellebeeck)
  • Use f64::total_cmp instead of OrderedFloat #4133 (comphead)
  • Add parquet integration tests for explicitly smaller page sizes, page pruning #4131 (alamb)
  • Consolidate ParquetExec tests in parquet_exec integration test #4130 (alamb)
  • Minor: Use upstream BooleanArray::true_count #4129 (alamb)
  • Combined TPCH runs & uniformed summaries for benchmarks #4128 (isidentical)
  • Enable TableProviderFactories to receive additional options when creating an external table #4126 [sql] (timvw)
  • Add CI check that configs.md is up-to-date #4124 (mvanschellebeeck)
  • [Part3] Partition and Sort Enforcement, Enforcement rule implementation #4122 (mingmwang)
  • reuse code utils::optimize_children but affect inline. #4121 (jackwener)
  • reuse code utils::optimize_children instead of redundant implementation #4119 (jackwener)
  • Allow listing tables to be created via TableFactories #4112 (avantgardnerio)
  • Update SQL reference to state that decimal support is currently experimental #4109 (andygrove)
  • Add metrics for parquet page level skipping #4105 (Ted-Jiang)
  • Add parser option for parsing SQL numeric literals as decimal #4102 [sql] (andygrove)
  • Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4079 (yahoNanJing)
  • Custom window frame support extended to built-in window functions #4078 (mustafasrepo)
  • Enable tests for page index filtering in parquet filter pushdown test #4062 (alamb)
  • [Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043 (mingmwang)
  • add support for xz file compression and compression feature #3993 [sql] (Jimexist)
  • Expression boundary analysis framework #3912 (isidentical)