Skip to content

Latest commit

 

History

History
2790 lines (2301 loc) · 384 KB

CHANGELOG-old.md

File metadata and controls

2790 lines (2301 loc) · 384 KB

Historical Changelog

30.0.1 (2023-01-04)

Full Changelog

Implemented enhancements:

Fixed bugs:

  • nullif kernel no longer exported #3454 [arrow]
  • PrimitiveArray from ArrayData Unsound For IntervalArray #3439 [arrow]
  • LZ4-compressed PQ files unreadable by Pandas and ClickHouse #3433 [parquet]
  • Parquet Record API: Cannot convert date before Unix epoch to json #3430 [parquet]
  • parquet-fromcsv with writer version v2 does not stop #3408 [parquet]

30.0.0 (2022-12-29)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add derived implementations of Clone and Debug for ParquetObjectReader #3381 [parquet]
  • Speed up TrackedWrite #3366 [parquet]
  • Is it possible for ArrowWriter to write key_value_metadata after write all records #3356 [parquet]
  • Add UnionArray test to arrow-pyarrow integration test #3346
  • Document / Deprecate arrow_flight::utils::flight_data_from_arrow_batch #3312 [arrow] [arrow-flight]
  • [FlightSQL] Support HTTPs #3309 [arrow-flight]
  • Support UnionArray in ffi #3304 [arrow]
  • Add support for Azure Data Lake Storage Gen2 (aka: ADLS Gen2) in Object Store library #3283
  • Support casting from String to Decimal #3280 [arrow]
  • Allow ArrowCSV writer to control the display of NULL values #3268 [arrow]

Fixed bugs:

  • FlightSQL example is broken #3386 [arrow-flight]
  • CSV Reader Bounds Incorrectly Handles Header #3364 [arrow]
  • Incorrect output string from try_to_type #3350
  • Decimal arithmetic computation fails to run because decimal type equality #3344 [arrow]
  • Pretty print not implemented for Map #3322 [arrow]
  • ILIKE Kernels Inconsistent Case Folding #3311 [arrow]

Documentation updates:

Merged pull requests:

29.0.0 (2022-12-09)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Support writing BloomFilter in arrow_writer #3275 [parquet]
  • Support casting from unsigned numeric to Decimal256 #3272 [arrow]
  • Support casting from Decimal256 to float types #3266 [arrow]
  • Make arithmetic kernels supports DictionaryArray of DecimalType #3254 [arrow]
  • Casting from Decimal256 to unsigned numeric #3239 [arrow]
  • precision is not considered when cast value to decimal #3223 [arrow]
  • Use RegexSet in arrow_csv::infer_field_schema #3211 [arrow]
  • Implement FlightSQL Client #3206 [arrow-flight]
  • Add binary_mut and try_binary_mut #3143 [arrow]
  • Add try_unary_mut #3133 [arrow]

Fixed bugs:

  • Skip null buffer when importing FFI ArrowArray struct if no null buffer in the spec #3290 [arrow]
  • using ahash compile-time-rng kills reproducible builds #3271 [parquet]
  • Decimal128 to Decimal256 Overflows #3265 [arrow]
  • nullif panics on empty array #3261 [arrow]
  • Some more inconsistency between can_cast_types and cast_with_options #3250 [arrow]
  • Enable casting between Dictionary of DecimalArray and DecimalArray #3237 [arrow]
  • new_null_array Panics creating StructArray with non-nullable fields #3226 [arrow]
  • bool should cast from/to Float16Type as can_cast_types returns true #3221 [arrow]
  • Utf8 and LargeUtf8 cannot cast from/to Float16 but can_cast_types returns true #3220 [arrow]
  • Re-enable some tests in arrow-cast crate #3219 [arrow]
  • Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error) #3215 [arrow]
  • arrow to and from pyarrow conversion results in changes in schema #3136 [arrow]

Documentation updates:

  • better document when we need LargeUtf8 instead of Utf8 #3228 [arrow]

Merged pull requests:

28.0.0 (2022-11-25)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add iterator to RowSelection #3172 [parquet]
  • create an integration test set for parquet crate against pyspark for working with bloom filters #3167 [parquet]
  • Row Format Size Tracking #3160 [arrow]
  • Add ArrayBuilder::finish_cloned() #3154 [arrow]
  • Optimize memory usage of json reader #3150
  • Add Field::size and DataType::size #3147 [parquet] [arrow]
  • Add like_utf8_scalar_dyn kernel #3145 [arrow]
  • support comparison for decimal128 array with scalar in kernel #3140 [arrow]
  • audit and create a document for bloom filter configurations #3138 [parquet]
  • Should be the rounding vs truncation when cast decimal to smaller scale #3137 [arrow]
  • Upgrade chrono to 0.4.23 #3120
  • Implements more temporal kernels using time_fraction_dyn #3108 [arrow]
  • Upgrade to thrift 0.17 #3105 [parquet] [arrow]
  • Be able to parse time formatted strings #3100 [arrow]
  • Improve "Fail to merge schema" error messages #3095 [arrow]
  • Expose SortingColumn when reading and writing parquet metadata #3090 [parquet]
  • Change Field::metadata to HashMap #3086 [parquet] [arrow]
  • Support bloom filter reading and writing for parquet #3023 [parquet]
  • API to take back ownership of an ArrayRef #2901 [arrow]
  • Specialized Interleave Kernel #2864 [arrow]

Fixed bugs:

  • arithmatic overflow leads to segfault in concat_batches #3123 [arrow]
  • Clippy failing on master : error: use of deprecated associated function chrono::NaiveDate::from_ymd: use from_ymd_opt() instead #3097 [parquet] [arrow]
  • Pretty print for interval types has wrong formatting #3092 [arrow]
  • Field is not serializable with binary formats #3082 [arrow]
  • Decimal Casts are Unchecked #2986 [arrow]

Closed issues:

Merged pull requests:

27.0.0 (2022-11-11)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Row Format: Option to detach/own a row #3078 [arrow]
  • Row Format: API to check if datatypes are supported #3077 [arrow]
  • Deprecate Buffer::count_set_bits #3067 [arrow]
  • Add Decimal128 and Decimal256 to downcast_primitive #3055 [arrow]
  • Improved UX of creating TimestampNanosecondArray with timezones #3042 [arrow]
  • Cast decimal256 to signed integer #3039 [arrow]
  • Support casting Date64 to Timestamp #3037 [arrow]
  • Check overflow when casting floating point value to decimal256 #3032 [arrow]
  • Compare i256 in validate_decimal256_precision #3024 [arrow]
  • Check overflow when casting floating point value to decimal128 #3020 [arrow]
  • Add macro downcast_temporal_array #3008 [arrow]
  • Replace hour_generic with hour_dyn #3005 [arrow]
  • Replace temporal _generic kernels with dyn #3004 [arrow]
  • Add RowSelection::intersection #3003 [parquet]
  • I would like to round rather than truncate when casting f64 to decimal #2997 [arrow]
  • arrow::compute::kernels::temporal should support nanoseconds #2995 [arrow]
  • Release Arrow 26.0.0 (next release after 25.0.0) #2953 [parquet] [arrow] [arrow-flight]
  • Add timezone offset for debug format of Timestamp with Timezone #2917 [arrow]
  • Support merge RowSelectors when creating RowSelection #2858 [parquet]

Fixed bugs:

  • Inconsistent Nan Handling Between Scalar and Non-Scalar Comparison Kernels #3074 [arrow]
  • Debug format for timestamp ignores timezone #3069 [arrow]
  • Row format decode loses timezone #3063 [arrow]
  • binary operator produces incorrect result on arrays with resized null buffer #3061 [arrow]
  • RLEDecoder Panics on Null Padded Pages #3035 [parquet]
  • Nullif with incorrect valid_count #3031 [arrow]
  • RLEDecoder::get_batch_with_dict may panic on bit-packed runs longer than 1024 #3029 [parquet]
  • Converted type is None according to Parquet Tools then utilizing logical types #3017
  • CompressionCodec LZ4 incompatible with C++ implementation #2988 [parquet]

Documentation updates:

Merged pull requests:

26.0.0 (2022-10-28)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Optimized way to count the numbers of true and false values in a BooleanArray #2963 [arrow]
  • Add pow to i256 #2954 [arrow]
  • Write Generic Code over [Large]BinaryArray and [Large]StringArray #2946 [arrow]
  • Add Page Row Count Limit #2941 [parquet]
  • prettyprint to show timezone offset for timestamp with timezone #2937 [arrow]
  • Cast numeric to decimal256 #2922 [arrow]
  • Add freeze_with_dictionary API to MutableArrayData #2914 [arrow]
  • Support decimal256 array in sort kernels #2911 [arrow]
  • support [+/-]hhmm and [+/-]hh as fixedoffset timezone format #2910 [arrow]
  • Cleanup decimal sort function #2907 [arrow]
  • replace from_timestamp by from_timestamp_opt #2892 [arrow]
  • Move Primitive arity kernels to arrow-array #2787 [arrow]
  • add overflow-checking for negative arithmetic kernel #2662 [arrow]

Fixed bugs:

  • Subtle compatibility issue with serve_arrow #2952
  • error[E0599]: no method named total_cmp found for struct f16 in the current scope #2926 [arrow]
  • Fail at rowSelection and_then method #2925 [parquet]
  • Ordering not implemented for FixedSizeBinary types #2904 [arrow]
  • Parquet API: Could not convert timestamp before unix epoch to string/json #2897 [parquet]
  • Overly Pessimistic RLE Size Estimation #2889 [parquet]
  • Memory alignment error in RawPtrBox::new #2882 [arrow]
  • Compilation error under chrono-tz feature #2878 [arrow]
  • AHash Statically Allocates 64 bytes #2875 [parquet]
  • parquet::arrow::arrow_writer::ArrowWriter ignores page size properties #2853 [parquet]

Documentation updates:

Closed issues:

  • SerializedFileWriter comments about multiple call on consumed self #2935 [parquet]
  • Pointer freed error when deallocating ArrayData with shared memory buffer #2874
  • Release Arrow 25.0.0 (next release after 24.0.0) #2820 [parquet] [arrow] [arrow-flight]
  • Replace DecimalArray with PrimitiveArray #2637 [parquet] [arrow]

Merged pull requests:

25.0.0 (2022-10-14)

Full Changelog

Breaking changes:

Implemented enhancements:

Fixed bugs:

  • Don't try to infer nulls in CSV schema inference #2859 [arrow]
  • parquet::arrow::arrow_writer::ArrowWriter ignores page size properties #2853 [parquet]
  • Introducing ArrowNativeTypeOp made it impossible to call kernels from generics #2839 [arrow]
  • Unsound ArrayData to Array Conversions #2834 [parquet] [arrow]
  • Regression: the trait bound for<'de> arrow::datatypes::Schema: serde::de::Deserialize<'de> is not satisfied #2825 [arrow]
  • convert string to timestamp shouldn't apply local timezone offset if there's no explicit timezone info in the string #2813 [arrow]

Closed issues:

  • Add pub api for checking column index is sorted #2848 [parquet]

Merged pull requests:

24.0.0 (2022-09-30)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Include field name in Parquet PrimitiveTypeBuilder error messages #2804 [parquet]
  • Add PrimitiveArray::reinterpret_cast #2785
  • BinaryBuilder and StringBuilder initialization parameters in struct_builder may be wrong #2783 [arrow]
  • Add divide scalar dyn kernel which produces null for division by zero #2767 [arrow]
  • Add divide dyn kernel which produces null for division by zero #2763 [arrow]
  • Improve performance of checked kernels on non-null data #2747 [arrow]
  • Add overflow-checking variants of arithmetic dyn kernels #2739 [arrow]
  • The binary function should not panic on unequaled array length. #2721 [arrow]

Fixed bugs:

  • min compute kernel is incorrect with sliced buffers in arrow 23 #2779 [arrow]
  • try_unary_dict should check value type of dictionary array #2754 [arrow]

Closed issues:

  • Add back JSON import/export for schema #2762
  • null casting and coercion for Decimal128 #2761
  • Json decoder behavior changed from versions 21 to 21 and returns non-sensical num_rows for RecordBatch #2722 [arrow]
  • Release Arrow 23.0.0 (next release after 22.0.0) #2665 [parquet] [arrow] [arrow-flight]

Merged pull requests:

23.0.0 (2022-09-16)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Cleanup like and nlike utf8 kernels #2744 [arrow]
  • Speedup eq and neq kernels for utf8 arrays #2742 [arrow]
  • API for more ergonomic construction of RecordBatchOptions #2728 [arrow]
  • Automate updates to CHANGELOG-old.md #2726
  • Don't check the DivideByZero error for float modulus #2720 [arrow]
  • try_binary should not panic on unequaled array length. #2715 [arrow]
  • Add benchmark for bitwise operation #2714 [arrow]
  • Add overflow-checking variants of arithmetic scalar dyn kernels #2712 [arrow]
  • Add divide_opt kernel which produce null values on division by zero error #2709 [arrow]
  • Add DataType function to detect nested types #2704 [arrow]
  • Add support of sorting dictionary of other primitive types #2700 [arrow]
  • Sort indices of dictionary string values #2697 [arrow]
  • Support empty projection in RecordBatch::project #2690 [arrow]
  • Support sorting dictionary encoded primitive integer arrays #2679 [arrow]
  • Use BitIndexIterator in min_max_helper #2674 [arrow]
  • Support building comparator for dictionaries of primitive integer values #2672 [arrow]
  • Change max/min string macro to generic helper function min_max_helper #2657 [arrow]
  • Add overflow-checking variant of arithmetic scalar kernels #2651 [arrow]
  • Compare dictionary with binary array #2644 [arrow]
  • Add overflow-checking variant for primitive arithmetic kernels #2642 [arrow]
  • Use downcast_primitive_array in arithmetic kernels #2639 [arrow]
  • Support DictionaryArray in temporal kernels #2622 [arrow]
  • Inline Generated Thift Code Into Parquet Crate #2502 [parquet]

Fixed bugs:

  • Escape contains patterns for utf8 like kernels #2745 [arrow]
  • Float Array should not panic on DivideByZero in the Divide kernel #2719 [arrow]
  • DictionaryBuilders can Create Invalid DictionaryArrays #2684 [parquet] [arrow]
  • arrow crate does not build with features = ["ffi"] and default_features = false. #2670 [arrow]
  • Invalid results with RowSelector having row_count of 0 #2669 [parquet]
  • clippy error: unresolved import crate::array::layout #2659 [arrow]
  • Cast the numeric without the CastOptions #2648 [arrow]
  • Explicitly define overflow behavior for primitive arithmetic kernels #2641 [arrow]
  • update the flight.proto and fix schema to SchemaResult #2571 [arrow] [arrow-flight]
  • Panic when first data page is skipped using ColumnChunkData::Sparse #2543 [parquet]
  • SchemaResult in IPC deviates from other implementations #2445 [arrow] [arrow-flight]

Closed issues:

Merged pull requests:

22.0.0 (2022-09-02)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add Macros to assist with static dispatch #2635 [arrow]
  • Support comparison between DictionaryArray and BooleanArray #2617 [arrow]
  • Use total_cmp for floating value ordering and remove nan_ordering feature flag #2613 [arrow]
  • Support empty projection in CSV, JSON readers #2603 [arrow]
  • Support SQL-compliant NaN ordering between for DictionaryArray and non-DictionaryArray #2599 [arrow]
  • Add dyn_cmp_dict feature flag to gate dyn comparison of dictionary arrays #2596 [arrow]
  • Add max_dyn and min_dyn for max/min for dictionary array #2584 [arrow]
  • Allow FlightSQL implementers to extend do_get() #2581 [arrow-flight]
  • Support SQL-compliant behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2569 [arrow]
  • Add sql-compliant feature for enabling sql-compliant kernel behavior #2568
  • Calculate sum for dictionary array #2565 [arrow]
  • Add test for float nan comparison #2556 [arrow]
  • Compare dictionary with string array #2548 [arrow]
  • Compare dictionary with primitive array in lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2538 [arrow]
  • Compare dictionary with primitive array in eq_dyn and neq_dyn #2535 [arrow]
  • UnionBuilder Create Children With Capacity #2523 [arrow]
  • Speed up like_utf8_scalar for %pat% #2519 [arrow]
  • Replace macro with TypedDictionaryArray in comparison kernels #2513 [arrow]
  • Use same codebase for boolean kernels #2507 [arrow]
  • Use u8 for Decimal Precision and Scale #2496 [arrow]
  • Integrate skip row without pageIndex in SerializedPageReader in Fuzz Test #2475 [parquet]
  • Avoid unecessary copies in Arrow IPC reader #2437 [arrow]
  • Add GenericColumnReader::skip_records Missing OffsetIndex Fallback #2433 [parquet]
  • Support Reading PageIndex with ParquetRecordBatchStream #2430 [parquet]
  • Specialize FixedLenByteArrayReader for Parquet #2318 [parquet]
  • Make JSON support Optional via Feature Flag #2300 [arrow]

Fixed bugs:

  • Casting timestamp array to string should not ignore timezone #2607 [arrow]
  • Ilike_ut8_scalar kernals have incorrect logic #2544 [arrow]
  • Always validate the array data when creating array in IPC reader #2541 [arrow]
  • Int96Converter Truncates Timestamps #2480 [parquet]
  • Error Reading Page Index When Not Available #2434 [parquet]
  • ParquetFileArrowReader::get_record_reader[_by_colum] batch_size overallocates #2321 [parquet]

Documentation updates:

  • Document All Arrow Features in docs.rs #2633 [arrow]

Closed issues:

  • Add support for CAST from Interval(DayTime) to Timestamp(Nanosecond, None) #2606 [arrow]
  • Why do we check for null in TypedDictionaryArray value function #2564 [arrow]
  • Add the length field for Buffer #2524 [arrow]
  • Avoid large over allocate buffer in async reader #2512 [parquet]
  • Rewriting Decimal Builders using const_generic. #2390 [arrow]
  • Rewrite Decimal Array using const_generic #2384 [arrow]

Merged pull requests:

21.0.0 (2022-08-18)

Full Changelog

Breaking changes:

Implemented enhancements:

  • add into_inner method to ArrowWriter #2491 [parquet]
  • Remove byteorder dependency #2472 [parquet]
  • Return Structured ColumnCloseResult from GenericColumnWriter::close #2465 [parquet]
  • Push ChunkReader into SerializedPageReader #2463 [parquet]
  • Support SerializedPageReader::skip_page without OffsetIndex #2459 [parquet]
  • Support Time64/Time32 comparison #2457 [arrow]
  • Revise FromIterator for Decimal128Array to use Into instead of Borrow #2441 [parquet]
  • Support RowFilter withinParquetRecordBatchReader #2431 [parquet]
  • Remove the field StructBuilder::len #2429 [arrow]
  • Standardize creation and configuration of parquet --> Arrow readers ( ParquetRecordBatchReaderBuilder) #2427 [parquet]
  • Use OffsetIndex to Prune IO in ParquetRecordBatchStream #2426 [parquet]
  • Support peek_next_page and skip_next_page in InMemoryPageReader #2406 [parquet]
  • Support casting from Utf8/LargeUtf8 to Binary/LargeBinary #2402 [arrow]
  • Support casting between Decimal128 and Decimal256 arrays #2375 [arrow]
  • Combine multiple selections into the same batch size in skip_records #2358 [parquet]
  • Add API to change timezone for timestamp array #2346 [arrow]
  • Change the output of read_buffer Arrow IPC API to return Result<_> #2342 [arrow]
  • Allow skip_records in GenericColumnReader to skip across row groups #2331 [parquet]
  • Optimize the validation of Decimal256 #2320 [arrow]
  • Implement Skip for DeltaBitPackDecoder #2281 [parquet]
  • Changes to ParquetRecordBatchStream to support row filtering in DataFusion #2270 [parquet]
  • Add ArrayReader::skip_records API #2197 [parquet]

Fixed bugs:

  • Panic in SerializedPageReader without offset index #2503 [parquet]
  • MapArray columns don't handle null values correctly #2484 [arrow]
  • There is no compiler error when using an invalid Decimal type. #2440 [arrow]
  • Flight SQL Server sends incorrect response for DoPutUpdateResult #2403 [arrow-flight]
  • AsyncFileReaderNo Longer Object-Safe #2372 [parquet]
  • StructBuilder Does not Verify Child Lengths #2252 [arrow]

Closed issues:

Merged pull requests:

20.0.0 (2022-08-05)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add the constant data type constructors for ListArray #2311 [arrow]
  • Update FlightSqlService trait to pass session info along #2308 [arrow-flight]
  • Optimize take_bits for non-null indices #2306 [arrow]
  • Make FFI support optional via Feature Flag ffi #2302 [arrow]
  • Mark ffi::ArrowArray::try_new is safe #2301 [arrow]
  • Remove test_utils from default arrow-rs features #2298 [arrow]
  • Remove JsonEqual trait #2296 [arrow]
  • Move with_precision_and_scale to Decimal array traits #2291 [arrow]
  • Improve readability and maybe performance of string --> numeric/time/date/timetamp cast kernels #2285 [arrow]
  • Add vectorized unpacking for 8, 16, and 64 bit integers #2276 [parquet]
  • Use initial capacity for interner hashmap #2273 [arrow]
  • Impl FromIterator for Decimal256Array #2248 [arrow]
  • Separate ArrayReader::next_batchwith ArrayReader::read_records and ArrayReader::consume_batch #2236 [parquet]
  • Rename DataType::Decimal to DataType::Decimal128 #2228 [arrow]
  • Automatically Grow Parquet BitWriter Buffer #2226 [parquet]
  • Add append_option support to Decimal128Builder and Decimal256Builder #2224 [arrow]
  • Split the FixedSizeBinaryArray and FixedSizeListArray from array_binary.rs and array_list.rs #2217 [arrow]
  • Don't Box Values in PrimitiveDictionaryBuilder #2215 [arrow]
  • Use BitChunks in equal_bits #2186 [arrow]
  • Implement Hash for Schema #2182 [arrow]
  • read decimal data type from parquet file with binary physical type #2159 [parquet]
  • The GenericStringBuilder should use GenericBinaryBuilder #2156 [arrow]
  • Update Rust version to 1.62 #2143 [parquet] [arrow] [arrow-flight]
  • Check precision and scale against maximum value when constructing Decimal128 and Decimal256 #2139 [arrow]
  • Use ArrayAccessor in Decimal128Iter and Decimal256Iter #2138 [arrow]
  • Use ArrayAccessor and FromIterator in Cast Kernels #2137 [arrow]
  • Add TypedDictionaryArray for more ergonomic interaction with DictionaryArray #2136 [arrow]
  • Use ArrayAccessor in Comparison Kernels #2135 [arrow]
  • Support peek_next_page() and skip_next_page in InMemoryColumnChunkReader #2129 [parquet]
  • Lazily materialize the null buffer builder for all array builders. #2125 [arrow]
  • Do value validation for Decimal256 #2112 [arrow]
  • Support skip_def_levels for ColumnLevelDecoder #2107 [parquet]
  • Add integration test for scan rows with selection #2106 [parquet]
  • Support for casting from Utf8/String to Time32 / Time64 #2053 [arrow]
  • Update prost and tonic related crates #2268 [arrow-flight] (carols10cents)

Fixed bugs:

  • temporal conversion functions cannot work on negative input properly #2325 [arrow]
  • IPC writer should truncate string array with all empty string #2312 [arrow]
  • Error order for comparing Decimal128 or Decimal256 #2256 [arrow]
  • Fix maximum and minimum for decimal values for precision greater than 38 #2246 [arrow]
  • IntervalMonthDayNanoType::make_value() does not match C implementation #2234 [arrow]
  • FlightSqlService trait does not allow impls to do handshake #2210 [arrow-flight]
  • EnabledStatistics::None not working #2185 [parquet]
  • Boolean ArrayData Equality Incorrect Slice Handling #2184 [arrow]
  • Publicly export MapFieldNames #2118 [arrow]

Documentation updates:

  • Update instructions on How to join the slack #arrow-rust channel -- or maybe try to switch to discord?? #2192
  • [Minor] Improve arrow and parquet READMEs, document parquet feature flags #2324 [parquet] [arrow] (alamb)

Performance improvements:

Closed issues:

  • Fix wrong logic in calculate_row_count when skipping values #2328 [parquet]
  • Support filter for parquet data type #2126 [parquet]
  • Make skip value in ByteArrayDecoderDictionary avoid decoding #2088 [parquet]

Merged pull requests:

19.0.0 (2022-07-22)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Use total_cmp from std #2130 [arrow]
  • Permit parallel fetching of column chunks in ParquetRecordBatchStream #2110 [parquet]
  • The GenericBinaryBuilder should use buffer builders directly. #2104 [arrow]
  • Pass generate_decimal256_case arrow integration test #2093 [arrow]
  • Rename weekday and weekday0 kernels to to num_days_from_monday and days_since_sunday #2065 [arrow]
  • Improve performance of filter_dict #2062 [arrow]
  • Improve performance of set_bits #2060 [arrow]
  • Lazily materialize the null buffer builder of BooleanBuilder #2058 [arrow]
  • BooleanArray::from_iter should omit validity buffer if all values are valid #2055 [arrow]
  • FFI_ArrowSchema should set DICTIONARY_ORDERED flag if a field's dictionary is ordered #2049 [arrow]
  • Support peek_next_page() and skip_next_page in SerializedPageReader #2043 [parquet]
  • Support FFI / C Data Interface for MapType #2037 [arrow]
  • The DecimalArrayBuilder should use FixedSizedBinaryBuilder #2026 [arrow]
  • Enable serialized_reader read specific Page by passing row ranges. #1976 [parquet]

Fixed bugs:

  • type_id and value_offset are incorrect for sliced UnionArray #2086 [arrow]
  • Boolean take kernel does not handle null indices correctly #2057 [arrow]
  • Don't double-count nulls in write_batch_with_statistics #2046 [parquet]
  • Parquet Writer Ignores Statistics specification in WriterProperties #2014 [parquet]

Documentation updates:

  • Improve docstrings + examples for as_primitive_array cast functions #2114 [arrow] (alamb)

Closed issues:

  • Why does serde_json specify the preserve_order feature in arrow package #2095 [arrow]
  • Support skip_values in DictionaryDecoder #2079 [parquet]
  • Support skip_values in ColumnValueDecoderImpl #2078 [parquet]
  • Support skip_values in ByteArrayColumnValueDecoder #2072 [parquet]
  • Several Builder::append methods returning results even though they are infallible #2071
  • Improve formatting of logical plans containing subqueries #2059
  • Return reference from UnionArray::child #2035
  • support write page index #1777 [parquet]

Merged pull requests:

18.0.0 (2022-07-08)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add DataType::Dictionary support to subtract_scalar, multiply_scalar, divide_scalar #2019 [arrow]
  • Support DictionaryArray in add_scalar kernel #2017 [arrow]
  • Enable column page index read test for all types #2010 [parquet]
  • Simplify FixedSizeBinaryBuilder #2007 [arrow]
  • Support Decimal256Builder and Decimal256Array #1999 [arrow]
  • Support DictionaryArray in unary kernel #1989 [arrow]
  • Add kernel to quickly compute comparisons on Arrays #1987 [arrow]
  • Support DictionaryArray in divide kernel #1982 [arrow]
  • Implement Into<ArrayData> for T: Array #1979 [arrow]
  • Support DictionaryArray in multiply kernel #1972 [arrow]
  • Support DictionaryArray in subtract kernel #1970 [arrow]
  • Declare DecimalArray::length as a constant #1967 [arrow]
  • Support DictionaryArray in add kernel #1950 [arrow]
  • Add builder style methods to Field #1934 [arrow]
  • Make StringDictionaryBuilder faster #1851 [arrow]
  • concat_elements_utf8 should accept arbitrary number of input arrays #1748 [arrow]

Fixed bugs:

  • Array reader for list columns fails to decode if batches fall on row group boundaries #2025 [parquet]
  • ColumnWriterImpl::write_batch_with_statistics incorrect distinct count in statistics #2016 [parquet]
  • ColumnWriterImpl::write_batch_with_statistics can write incorrect page statistics #2015 [parquet]
  • RowFormatter is not part of the public api #2008 [parquet]
  • Infinite Loop possible in ColumnReader::read_batch For Corrupted Files #1997 [parquet]
  • PrimitiveBuilder::finish_dict does not validate dictionary offsets #1978 [arrow]
  • Incorrect n_buffers in FFI_ArrowArray #1959 [arrow]
  • DecimalArray::from_fixed_size_list_array fails when offset > 0 #1958 [arrow]
  • Incorrect (but ignored) metadata written after ColumnChunk #1946 [parquet]
  • Send + Sync impl for Allocation may not be sound unless Allocation is Send + Sync as well #1944 [arrow]
  • Disallow cast from other datatypes to NullType #1923 [arrow]

Documentation updates:

  • The doc of FixedSizeListArray::value_length is incorrect. #1908 [arrow]

Closed issues:

  • Column chunk statistics of min_bytes and max_bytes return wrong size #2021 [parquet]
  • [Discussion] Refactor the Decimals by using constant generic. #2001
  • Move DecimalArray to a new file #1985 [arrow]
  • Support DictionaryArray in multiply kernel #1974
  • close function instead of mutable reference #1969 [parquet]
  • Incorrect null_count of DictionaryArray #1962 [arrow]
  • Support multi diskRanges for ChunkReader #1955 [parquet]
  • Persisting Arrow timestamps with Parquet produces missing TIMESTAMP in schema #1920 [parquet]
  • Sperate get_next_page_header from get_next_page in PageReader #1834 [parquet]

Merged pull requests:

17.0.0 (2022-06-24)

Full Changelog

Breaking changes:

Implemented enhancements:

  • add a small doc example showing ArrowWriter being used with a cursor #1927 [parquet]
  • Support cast to/from NULL and DataType::Decimal #1921 [arrow]
  • Add Decimal256 API #1913 [arrow]
  • Add DictionaryArray::key function #1911 [arrow]
  • Support specifying capacities for ListArrays in MutableArrayData #1884 [arrow]
  • Explicitly declare the features used for each dependency #1876 [parquet] [arrow] [arrow-flight]
  • Add Decimal128 API and use it in DecimalArray and DecimalBuilder #1870 [arrow]
  • PrimitiveArray::from_iter should omit validity buffer if all values are valid #1856 [arrow]
  • Add from(v: Vec<Option<&[u8]>>) and from(v: Vec<&[u8]>) for FixedSizedBInaryArray #1852 [arrow]
  • Add Vec-inspired APIs to BufferBuilder #1850 [arrow]
  • PyArrow intergation test for C Stream Interface #1847 [arrow]
  • Add nilike support in comparison #1845 [arrow]
  • Split up arrow::array::builder module #1843 [arrow]
  • Add quarter support in temporal kernels #1835 [arrow]
  • Rename ArrayData::validate_dictionary_offset to ArrayData::validate_values #1812 [arrow]
  • Clean up the testing code for substring kernel #1801 [arrow]
  • Speed up substring_by_char kernel #1800 [arrow]

Fixed bugs:

  • unable to write parquet file with UTC timestamp #1932 [parquet]
  • Incorrect max and min decimals #1916 [arrow]
  • dynamic_types example does not print the projection #1902 [arrow]
  • log2(0) panicked at 'attempt to subtract with overflow', parquet/src/util/bit_util.rs:148:5 #1901 [parquet]
  • Final slicing in combine_option_bitmap needs to use bit slices #1899 [arrow]
  • Dictionary IPC writer writes incorrect schema #1892 [arrow]
  • Creating a RecordBatch with null values in non-nullable fields does not cause an error #1888 [arrow]
  • Upgrade regex dependency #1874 [arrow]
  • Miri reports leaks in ffi tests #1872 [arrow]
  • AVX512 + simd binary and/or kernels slower than autovectorized version #1829 [arrow]

Documentation updates:

  • Blog post about arrow 10.0.0 - 16.0.0 #1808
  • Add README for the compute module. #1940 [arrow] (HaoYang670)
  • minor: clarify docstring on DictionaryArray::lookup_key #1910 [arrow] (alamb)
  • minor: add a diagram to docstring for DictionaryArray #1909 [arrow] (alamb)
  • Closes #1902: Print the original and projected RecordBatch in dynamic_types example #1903 [arrow] (martin-g)

Closed issues:

Merged pull requests:

16.0.0 (2022-06-10)

Full Changelog

Breaking changes:

Implemented enhancements:

  • List equality method should work on empty offset ListArray #1817 [arrow]
  • Command line tool for convert CSV to Parquet #1797 [parquet]
  • IPC writer should write validity buffer for UnionArray in V4 IPC message #1793 [arrow]
  • Add function for row alignment with page mask #1790 [parquet]
  • Rust IPC Read should be able to read V4 UnionType Array #1788 [arrow]
  • combine_option_bitmap should accept arbitrary number of input arrays. #1780 [arrow]
  • Add substring_by_char kernels for slicing on character boundaries #1768 [arrow]
  • Support reading PageIndex from column metadata #1761 [parquet]
  • Support casting from DataType::Utf8 to DataType::Boolean #1740 [arrow]
  • Make current position available in FileWriter. #1691 [parquet]
  • Support writing parquet to stdout #1687 [parquet]

Fixed bugs:

  • Incorrect Offset Validation for Sliced List Array Children #1814 [arrow]
  • Parquet Snappy Codec overwrites Existing Data in Decompression Buffer #1806 [parquet]
  • flight_data_to_arrow_batch does not support RecordBatches with no columns #1783 [arrow-flight]
  • parquet does not compile with features=["zstd"] #1630 [parquet]

Documentation updates:

Closed issues:

Merged pull requests:

15.0.0 (2022-05-27)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Rename the string kernel to concatenate_elements #1747 [arrow]
  • ArrayDataBuilder::null_bit_buffer should accept Option<Buffer> as input type #1737 [arrow]
  • Fix schema comparison for non_canonical_map when running flight test #1730 [arrow]
  • Add support in aggregate kernel for BinaryArray #1724 [arrow]
  • Fix incorrect null_count in generate_unions_case integration test #1712 [arrow]
  • Keep type ids in Union datatype to follow Arrow spec and integrate with other implementations #1690 [arrow]
  • Support Reading Alternative List Representations to Arrow From Parquet #1680 [parquet]
  • Speed up the offsets checking #1675 [arrow]
  • Separate Parquet -> Arrow Schema Conversion From ArrayBuilder #1655 [parquet]
  • Add leaf_columns argument to ArrowReader::get_record_reader_by_columns #1653 [parquet]
  • Implement string_concat kernel #1540 [arrow]
  • Improve Unit Test Coverage of ArrayReaderBuilder #1484 [parquet]

Fixed bugs:

  • Parquet write failure (from record batches) when data is nested two levels deep #1744 [parquet]
  • IPC reader may break on projection #1735 [arrow]
  • Latest nightly fails to build with feature simd #1734 [arrow]
  • Trying to write parquet file in parallel results in corrupt file #1717 [parquet]
  • Roundtrip failure when using DELTA_BINARY_PACKED #1708 [parquet]
  • ArrayData::try_new cannot always return expected error. #1707 [arrow]
  • "out of order projection is not supported" after Fix Parquet Arrow Schema Inference #1701 [parquet]
  • Rust is not interoperability with C++ for IPC schemas with dictionaries #1694 [arrow]
  • Incorrect Repeated Field Schema Inference #1681 [parquet]
  • Parquet Treats Embedded Arrow Schema as Authoritative #1663 [parquet]
  • parquet_to_arrow_schema_by_columns Incorrectly Handles Nested Types #1654 [parquet]
  • Inconsistent Arrow Schema When Projecting Nested Parquet File #1652 [parquet]
  • StructArrayReader Cannot Handle Nested Lists #1651 [parquet]
  • Bug (substring kernel): The null buffer is not aligned when offset != 0 #1639 [arrow]

Documentation updates:

  • Parquet command line tool does not install "globally" #1710 [parquet]
  • Improve integration test document to follow Arrow C++ repo CI #1742 [arrow] (viirya)

Merged pull requests:

14.0.0 (2022-05-13)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add support for DataType::Duration in ffi interface #1688 [arrow]
  • Fix generate_unions_case integration test #1676 [arrow]
  • Add DictionaryArray support for bit_length kernel #1673 [arrow]
  • Add DictionaryArray support for length kernel #1672 [arrow]
  • flight_client_scenarios integration test should receive schema from flight data #1669 [arrow]
  • Unpin Flatbuffer version dependency #1667 [arrow]
  • Add dictionary array support for substring function #1656 [arrow]
  • Exclude dict_id and dict_is_ordered from equality comparison of Field #1646 [arrow]
  • Remove StringOffsetTrait and BinaryOffsetTrait #1644 [arrow]
  • Add tests and examples for UnionArray::from(data: ArrayData) #1643 [arrow]
  • Add methods pub fn offsets_buffer, pub fn types_ids_bufferand pub fn data_buffer for ArrayDataBuilder #1640 [arrow]
  • Fix generate_nested_dictionary_case integration test failure for Rust cases #1635 [arrow]
  • Expose ArrowWriter row group flush in public API #1626 [parquet]
  • Add substring support for FixedSizeBinaryArray #1618 [arrow]
  • Add PrettyPrint for UnionArrays #1594 [arrow]
  • Add SIMD support for the length kernel #1489 [arrow]
  • Support dictionary arrays in length and bit_length #1674 [arrow] (viirya)
  • Add dictionary array support for substring function #1665 [arrow] (sunchao)
  • Add DecimalType support in new_null_array #1659 [arrow] (yjshen)

Fixed bugs:

  • Docs.rs build is broken #1695
  • Interoperability with C++ for IPC schemas with dictionaries #1694
  • UnionArray::is_null incorrect #1625 [arrow]
  • Published Parquet documentation missing arrow::async_reader #1617 [parquet]
  • Files written with Julia's Arrow.jl in IPC format cannot be read by arrow-rs #1335 [arrow]

Documentation updates:

Closed issues:

  • Make OffsetSizeTrait::IS_LARGE as a const value #1658
  • Question: Why are there 3 types of OffsetSizeTraits? #1638
  • Written Parquet file way bigger than input files #1627
  • Ensure there is a single zero in the offsets buffer for an empty ListArray. #1620
  • Filtering UnionArray Changes DataType #1595

Merged pull requests:

13.0.0 (2022-04-29)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Read/write nested dictionary under fixed size list in ipc stream reader/write #1609 [arrow]
  • Add support for BinaryArray in substring kernel #1593 [arrow]
  • Read/write nested dictionary under large list in ipc stream reader/write #1584 [arrow]
  • Read/write nested dictionary under map in ipc stream reader/write #1582 [arrow]
  • Implement Clone for JSON DecoderOptions #1580 [arrow]
  • Add utf-8 validation checking to substring kernel #1575 [arrow]
  • Support casting to/from DataType::Null in cast kernel #1572 [arrow] (WinkerDu)

Fixed bugs:

  • Parquet schema should allow scale == precision for decimal type #1606 [parquet]
  • ListArray::from(ArrayData) dereferences invalid pointer when offsets are empty #1601 [arrow]
  • ArrayData Equality Incorrect Null Mask Offset Handling #1599
  • Filtering UnionArray Incorrect Handles Runs #1598
  • [Safety] Filtering Dense UnionArray Produces Invalid Offsets #1596
  • [Safety] UnionBuilder Doesn't Check Types #1591
  • Union Layout Should Not Support Separate Validity Mask #1590
  • Incorrect nullable flag when reading maps ( test_read_maps fails when force_validate is active) #1587 [parquet]
  • Output of ipc::reader::tests::projection_should_work fails validation #1548 [arrow]
  • Incorrect min/max statistics for decimals with byte-array notation #1532

Documentation updates:

Closed issues:

  • Dense UnionArray Offsets Are i32 not i8 #1597 [arrow]
  • Replace &Option<T> with Option<&T> in some APIs #1556 [parquet] [arrow]
  • Improve ergonomics of parquet::basic::LogicalType #1554 [parquet]
  • Mark the current substring function as unsafe and rename it. #1541 [arrow]
  • Requirements for Async Parquet API #1473 [parquet]

Merged pull requests:

12.0.0 (2022-04-15)

Full Changelog

Breaking changes:

  • Add ArrowReaderOptions to ParquetFileArrowReader, add option to skip decoding arrow metadata from parquet (#1459) #1558 [parquet] (tustvold)
  • Support RecordBatch with zero columns but non zero row count, add field to RecordBatchOptions (#1536) #1552 [arrow] (tustvold)
  • Consolidate JSON Reader options and DecoderOptions #1539 [arrow] (alamb)
  • Update prost, prost-derive and prost-types to 0.10, tonic, and tonic-build to 0.7 #1510 [arrow-flight] (alamb)
  • Add Json DecoderOptions and support custom format_string for each field #1451 [arrow] (sum12)

Implemented enhancements:

  • Read/write nested dictionary in ipc stream reader/writer #1565 [arrow]
  • Support FixedSizeBinary in the Arrow C data interface #1553 [arrow]
  • Support Empty Column Projection in ParquetRecordBatchReader #1537 [parquet]
  • Support RecordBatch with zero columns but non zero row count #1536 [arrow]
  • Add support for Date32/Date64<--> String/LargeString in cast kernel #1535 [arrow]
  • Support creating arrays from externally owned memory like Vec or String #1516 [arrow]
  • Speed up the substring kernel #1511 [arrow]
  • Handle Parquet Files With Inconsistent Timestamp Units #1459 [parquet]

Fixed bugs:

  • Error Infering Schema for LogicalType::UNKNOWN #1557 [parquet]
  • Read dictionary from nested struct in ipc stream reader panics #1549 [arrow]
  • filter produces invalid sparse UnionArrays #1547 [arrow]
  • Documentation for GenericListBuilder is not exposed. #1518 [arrow]
  • cannot read parquet file #1515 [parquet]
  • The substring kernel panics when chars > U+0x007F #1478 [arrow]
  • Hang due to infinite loop when reading some parquet files with RLE encoding and bit packing #1458 [parquet]

Documentation updates:

Closed issues:

  • Interesting benchmark results of min_max_helper #1400

Merged pull requests:

11.1.0 (2022-03-31)

Full Changelog

Implemented enhancements:

  • Implement size_hint and ExactSizedIterator for DecimalArray #1505 [arrow]
  • Support calculate length by chars for StringArray #1493 [arrow]
  • Add length kernel support for ListArray #1470 [arrow]
  • The length kernel should work with BinaryArrays #1464 [arrow]
  • FFI for Arrow C Stream Interface #1348 [arrow]
  • Improve performance of DictionaryArray::try_new() #1313 [arrow]

Fixed bugs:

  • MIRI error in math_checked_divide_op/try_from_trusted_len_iter #1496 [arrow]
  • Parquet Writer Incorrect Definition Levels for Nested NullArray #1480 [parquet]
  • FFI: ArrowArray::try_from_raw shouldn't clone #1425 [arrow]
  • Parquet reader fails to read null list. #1399 [parquet]

Documentation updates:

  • A small mistake in the doc of BinaryArray and LargeBinaryArray #1455 [arrow]
  • A small mistake in the doc of GenericBinaryArray::take_iter_unchecked #1454 [arrow]
  • Add links in the doc of BinaryOffsetSizeTrait #1453 [arrow]
  • The doc of FixedSizeBinaryArray is confusing. #1452 [arrow]
  • Clarify docs that SlicesIterator ignores null values #1504 [arrow] (alamb)
  • Update the doc of BinaryArray and LargeBinaryArray #1471 [arrow] (HaoYang670)

Closed issues:

  • packed_simd v.s. portable_simd, which should be used? #1492
  • Cleanup: Use Arrow take kernel Within parquet ListArrayReader #1482 [parquet]

Merged pull requests:

11.0.0 (2022-03-17)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Fix generate_interval_case integration test failure #1445
  • Make the doc examples of ListArray and LargeListArray more readable #1433
  • Redundant if and abs in shift() #1427
  • Improve substring kernel performance #1422 [arrow]
  • Add missing value_unchecked() of FixedSizeBinaryArray #1419
  • Remove duplicate bound check in function shift #1408
  • Support dictionary array in C data interface #1397
  • filter kernel should work with UnionArrays #1394 [arrow]
  • filter kernel should work with FixedSizeListArrayss #1393 [arrow]
  • Add doc examples for creating FixedSizeListArray #1392 [arrow]
  • Update rust-version to 1.59 #1377
  • Arrow IPC projection support #1338
  • Implement basic FlightSQL Server #1386 [arrow-flight] (wangfenjin)

Fixed bugs:

  • DictionaryArray::try_new ignores validity bitmap of the keys #1429 [arrow]
  • The doc of GenericListArray is confusing #1424
  • DeltaBitPackDecoder Incorrectly Handles Non-Zero MiniBlock Bit Width Padding #1417 [parquet]
  • DeltaBitPackEncoder Pads Miniblock BitWidths With Arbitrary Values #1416 [parquet]
  • Possible unaligned write with MutableBuffer::push #1410 [arrow]
  • Integration Test is failing on master branch #1398 [arrow]

Documentation updates:

Merged pull requests:

10.0.0 (2022-03-04)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Add extract month and day in temporal.rs #1387
  • Add clone to IpcWriteOptions #1381 [arrow]
  • Support MapArray in filter kernel #1378 [arrow]
  • Add week temporal kernel #1375 [arrow]
  • Improve performance of compare_dict_op #1371 [arrow]
  • Add support for LargeUtf8 in json writer #1357 [parquet]
  • Make arrow::array::builder::MapBuilder public #1354 [arrow]
  • Refactor StructArray::from #1351 [arrow]
  • Refactor RecordBatch::validate_new_batch #1350 [arrow]
  • Remove redundant has_ methods for optional column metadata fields #1344 [parquet]
  • Add write method to JsonWriter #1340 [arrow]
  • Refactor the code of Bitmap::new #1337 [arrow]
  • Use DictionaryArray's iterator in compare_dict_op #1329 [arrow]
  • Add as_decimal_array(arr: &dyn Array) -> &DecimalArray #1312 [arrow]
  • More ergonomic / idiomatic primitive array creation from iterators #1298 [arrow]
  • Implement DictionaryArray support in eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #1201 [arrow]

Fixed bugs:

  • cargo clippy fails on the master branch #1362 [arrow]
  • ArrowArray::try_from_raw should not assume the pointers are from Arc #1333 [arrow]
  • Fix CSV Writer::new to accept delimiter and make WriterBuilder::build use it #1328 [arrow]
  • Make bounds configurable via builder when reading CSV #1327 [arrow]
  • Add with_datetime_format() to CSV WriterBuilder #1272 [arrow]

Performance improvements:

  • Improve performance of min and max aggregation kernels without nulls #1373 [arrow]

Closed issues:

  • Consider removing redundant has_XXX metadata functions in ColumnChunkMetadata #1332

Merged pull requests:

9.1.0 (2022-02-19)

Full Changelog

Implemented enhancements:

Fixed bugs:

  • len is not a parameter of MutableArrayData::extend #1316
  • module data_type is private in Rust Parquet 8.0.0 #1302 [parquet]
  • Test failure: bit_chunk_iterator #1294
  • csv_writer benchmark fails with "no such file or directory" #1292

Documentation updates:

Performance improvements:

Closed issues:

  • Expose column and offset index metadata offset #1317
  • Expose bloom filter metadata offset #1308
  • Improve ergonomics to construct DictionaryArrays from Key and Value arrays #1299
  • Make it easier to iterate over DictionaryArray #1295 [arrow]
  • (WON'T FIX) Don't Interwine Bit and Byte Aligned Operations in BitReader #1282
  • how to create arrow::array from streamReader #1278
  • Remove scientific notation when converting floats to strings. #983

Merged pull requests:

9.0.2 (2022-02-09)

Full Changelog

Breaking changes:

  • Add Send + Sync to DataType, RowGroupReader, FileReader, ChunkReader. #1264
  • Rename the function Bitmap::len to Bitmap::bit_len to clarify its meaning #1242 [parquet] [arrow] (HaoYang670)
  • Remove unused / broken memory-check feature #1222 [arrow] (jhorstmann)
  • Potentially buffer multiple RecordBatches before writing a parquet row group in ArrowWriter #1214 [parquet] [arrow] (tustvold)

Implemented enhancements:

  • Add async arrow parquet reader #1154 [parquet] [arrow] (tustvold)
  • Rename Bitmap::len to Bitmap::bit_len #1233
  • Extend CSV schema inference to allow scientific notation for floating point types #1215 [arrow]
  • Write Multiple RecordBatch to Parquet Row Group #1211
  • Add doc examples for eq_dyn etc. #1202 [arrow]
  • Add comparison kernels for BinaryArray #1108
  • impl ArrowNativeType for i128 #1098
  • Remove Copy trait bound from dyn scalar kernels #1243 [arrow] (matthewmturner)
  • Add into_inner for IPC FileWriter #1236 [arrow] (yjshen)
  • [Minor]Re-export array::builder::make_builder to make it available for downstream #1235 [arrow] (yjshen)

Fixed bugs:

  • Parquet v8.0.0 panics when reading all null column to NullArray #1245 [parquet]
  • Get Unknown configuration option rust-version when running the rust format command #1240
  • Bitmap Length Validation is Incorrect #1231 [arrow]
  • Writing sliced ListArray or MapArray ignore offsets #1226 [parquet]
  • Remove broken memory-tracking crate feature #1171
  • Revert making parquet::data_type and parquet::arrow::schema experimental #1244 [parquet] (tustvold)

Documentation updates:

Performance improvements:

  • Improve performance for arithmetic kernels with simd feature enabled (except for division/modulo) #1221 [arrow] (jhorstmann)
  • Do not concatenate identical dictionaries #1219 [arrow] (tustvold)
  • Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180 [parquet] (tustvold)

Closed issues:

  • UnalignedBitChunkIterator to that iterates through already aligned u64 blocks #1227
  • Remove unused ArrowArrayReader in parquet #1197 [parquet]

Merged pull requests:

8.0.0 (2022-01-20)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Parquet reader should be able to read structs within list #1186 [parquet]
  • Disable serde_json arbitrary_precision feature flag #1174 [arrow]
  • Simplify and reduce code duplication in arithmetic.rs #1160 [arrow]
  • Return Err from JSON writer rather than panic! for unsupported types #1157 [arrow]
  • Support scalar mathematics kernels for Array and scalar value #1153 [arrow]
  • Support DecimalArray in sort kernel #1137
  • Parquet Fuzz Tests #1053
  • BooleanBufferBuilder Append Packed #1038 [arrow]
  • parquet Performance Optimization: StructArrayReader Redundant Level & Bitmap Computation #1034 [parquet]
  • Reduce Public Parquet API #1032 [parquet]
  • Add from_iter_values for binary array #1188 [arrow] (Jimexist)
  • Add support for MapArray in json writer #1149 [arrow] (helgikrs)

Fixed bugs:

  • Empty string arrays with no nulls are not equal #1208 [arrow]
  • Pretty print a RecordBatch containing Float16 triggers a panic #1193 [arrow]
  • Writing structs nested in lists produces an incorrect output #1184 [parquet]
  • Undefined behavior for GenericStringArray::from_iter_values if reported iterator upper bound is incorrect #1144 [arrow]
  • Interval comparisons with simd feature asserts #1136 [arrow]
  • RecordReader Permits Illegal Types #1132 [parquet]

Security fixes:

Documentation updates:

Performance improvements:

  • Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054 [parquet] [arrow] (tustvold)
  • Improve parquet performance: Skip levels computation for required struct arrays in parquet #1035 [parquet] (tustvold)

Closed issues:

  • Generify ColumnReaderImpl and RecordReader #1040 [parquet]
  • Parquet Preserve BitMask #1037

Merged pull requests:

7.0.0 (2022-1-07)

Full Changelog

Arrow

Breaking changes:

  • pretty_format_batches now returns Result<impl Display> rather than String: #975
  • MutableBuffer::typed_data_mut is marked unsafe: #1029
  • UnionArray updated match latest Arrow spec, added UnionMode, UnionArray::new() marked unsafe: #885

New Features:

  • Support for Float16Array types #888
  • IPC support for UnionArray #654
  • Dynamic comparison kernels for scalars (e.g. eq_dyn_scalar), including DictionaryArray: #1113

Enhancements:

  • Added Schema::with_metadata and Field::with_metadata #1092
  • Support for custom datetime format for inference and parsing csv files #1112
  • Implement Array for ArrayRef for easier use #1129
  • Pretty printing display support for FixedSizeBinaryArray #1097
  • Dependency Upgrades: pyo3, parquet-format, prost, tonic
  • Avoid allocating vector of indices in lexicographical_partition_ranges#998

Parquet

Fixed bugs:

  • (parquet) Fix reading of dictionary encoded pages with null values: #1130

Changelog

6.5.0 (2021-12-23)

Full Changelog

6.4.0 (2021-12-10)

Full Changelog

6.3.0 (2021-11-26)

Full Changelog

Changes:

6.2.0 (2021-11-12)

Full Changelog

Features / Fixes:

6.1.0 (2021-10-29)

Full Changelog

Features / Fixes:

Other:

6.0.0 (2021-10-13)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Improve parquet binary writer speed by reducing allocations #819
  • Expose buffer operations #808
  • Add doc examples of writing parquet files using ArrowWriter #788

Fixed bugs:

  • JSON reader can create null struct children on empty lists #825
  • Incorrect null count for cast kernel for list arrays #815
  • minute and second temporal kernels do not respect timezone #500
  • Fix data corruption in json decoder f64-to-i64 cast #652 [arrow] (xianwill)

Documentation updates:

5.5.0 (2021-09-24)

Full Changelog

Implemented enhancements:

  • parquet should depend on a small set of arrow features #800
  • Support equality on RecordBatch #735

Fixed bugs:

  • Converting from string to timestamp uses microseconds instead of milliseconds #780
  • Document has no link to RowColumIter #762
  • length on slices with null doesn't work #744

5.4.0 (2021-09-10)

Full Changelog

Implemented enhancements:

  • Upgrade lexical-core to 0.8 #747
  • append_nulls and append_trusted_len_iter for PrimitiveBuilder #725
  • Optimize MutableArrayData::extend for null buffers #397

Fixed bugs:

  • Arithmetic with scalars doesn't work on slices #742
  • Comparisons with scalar don't work on slices #740
  • unary kernel doesn't respect offset #738
  • new_null_array creates invalid struct arrays #734
  • --no-default-features is broken for parquet #733 [parquet]
  • Bitmap::len returns the number of bytes, not bits. #730
  • Decimal logical type is formatted incorrectly by print_schema #713
  • parquet_derive does not support chrono time values #711
  • Numeric overflow when formatting Decimal type #710
  • The integration tests are not running #690

Closed issues:

  • Question: Is there no way to create a DictionaryArray with a pre-arranged mapping? #729

5.3.0 (2021-08-26)

Full Changelog

Implemented enhancements:

  • Add optimized filter kernel for regular expression matching #697
  • Can't cast from timestamp array to string array #587

Fixed bugs:

  • 'Encoding DELTA_BYTE_ARRAY is not supported' with parquet arrow readers #708
  • Support reading json string into binary data type. #701

Closed issues:

  • Resolve Issues with prettytable-rs dependency #69 [arrow]

5.2.0 (2021-08-12)

Full Changelog

Implemented enhancements:

  • Make rand an optional dependency #671
  • Remove undefined behavior in value method of boolean and primitive arrays #645
  • Avoid materialization of indices in filter_record_batch for single arrays #636
  • Add a note about arrow crate security / safety #627
  • Allow the creation of String arrays from an interator of &Option<&str> #598
  • Support arrow map datatype #395

Fixed bugs:

  • Parquet fixed length byte array columns write byte array statistics #660 [parquet]
  • Parquet boolean columns write Int32 statistics #659 [parquet]
  • Writing Parquet with a boolean column fails #657
  • JSON decoder data corruption for large i64/u64 #653
  • Incorrect min/max statistics for strings in parquet files #641 [parquet]

Closed issues:

  • Release candidate verifying script seems work on macOS #640
  • Update CONTRIBUTING #342

5.1.0 (2021-07-29)

Full Changelog

Implemented enhancements:

  • Make FFI_ArrowArray empty() public #602
  • exponential sort can be used to speed up lexico partition kernel #586
  • Implement sort() for binary array #568
  • primitive sorting can be improved and more consistent with and without limit if sorted unstably #553

Fixed bugs:

  • Confusing memory usage with CSV reader #623
  • FFI implementation deviates from specification for array release #595
  • Parquet file content is different if ~/.cargo is in a git checkout #589
  • Ensure output of MIRI is checked for success #581
  • MIRI failure in array::ffi::tests::test_struct and other ffi tests #580
  • ListArray equality check may return wrong result #570
  • cargo audit failed #561
  • ArrayData::slice() does not work for nested types such as StructArray #554

Documentation updates:

  • More examples of how to construct Arrays #301

Closed issues:

  • Implement StringBuilder::append_option #263 [arrow]

5.0.0 (2021-07-14)

Full Changelog

Breaking changes:

Implemented enhancements:

Fixed bugs:

  • Error building on master - error: cyclic package dependency: package ahash v0.7.4 depends on itself. Cycle #544
  • IPC reader panics with out of bounds error #541
  • Take kernel doesn't handle nulls and structs correctly #530 [arrow]
  • master fails to compile with default-features=false #529
  • README developer instructions out of date #523
  • Update rustc and packed_simd in CI before 5.0 release #517
  • Incorrect memory usage calculation for dictionary arrays #503 [arrow]
  • sliced null buffers lead to incorrect result in take kernel (and probably on other places) #502
  • Cast of utf8 types and list container types don't respect offset #334 [arrow]
  • fix take kernel null handling on structs #531 [arrow] (bjchambers)
  • Correct array memory usage calculation for dictionary arrays #505 [arrow] (jhorstmann)
  • parquet: improve BOOLEAN writing logic and report error on encoding fail #443 [parquet] (garyanaplan)
  • Fix bug with null buffer offset in boolean not kernel #418 [arrow] (jhorstmann)
  • respect offset in utf8 and list casts #335 [arrow] (ritchie46)
  • Fix comparison of dictionaries with different values arrays (#332) #333 [arrow] (tustvold)
  • ensure null-counts are written for all-null columns #307 [parquet] (crepererum)
  • fix invalid null handling in filter #296 [arrow] (ritchie46)
  • fix NaN handling in parquet statistics #256 (crepererum)

Documentation updates:

Merged pull requests:

4.4.0 (2021-06-24)

Full Changelog

Breaking changes:

  • migrate partition kernel to use Iterator trait #437 [arrow]
  • Remove DictionaryArray::keys_array #391 [arrow]

Implemented enhancements:

  • sort kernel boolean sort can be O(n) #447 [arrow]
  • C data interface for decimal128, timestamp, date32 and date64 #413
  • Add Decimal to CsvWriter #405
  • Use iterators to increase performance of creating Arrow arrays #200 [parquet]

Fixed bugs:

  • Release Audit Tool (RAT) is not being triggered #481
  • Security Vulnerabilities: flatbuffers: read_scalar and read_scalar_at allow transmuting values without unsafe blocks #476
  • Clippy broken after upgrade to rust 1.53 #467
  • Pull Request Labeler is not working #462
  • Arrow 4.3 release: error[E0658]: use of unstable library feature 'partition_point': new API #456
  • parquet reading hangs when row_group contains more than 2048 rows of data #349
  • Fail to build arrow #247
  • JSON reader does not implement iterator #193 [arrow]

Security fixes:

  • Ensure a successful MIRI Run on CI #227

Closed issues:

  • sort kernel has a lot of unnecessary wrapping #446
  • [Parquet] Plain encoded boolean column chunks limited to 2048 values #48 [parquet]

4.3.0 (2021-06-10)

Full Changelog

Implemented enhancements:

  • Add partitioning kernel for sorted arrays #428 [arrow]
  • Implement sort by float lists #427 [arrow]
  • Derive Eq and PartialEq for SortOptions #426 [arrow]
  • use prettier and github action to normalize markdown document syntax #399
  • window::shift can work for more than just primitive array type #392
  • Doctest for ArrayBuilder #366

Fixed bugs:

  • Boolean not kernel does not take offset of null buffer into account #417
  • my contribution not marged in 4.2 release #394
  • window::shift shall properly handle boundary cases #387
  • Parquet WriterProperties.max_row_group_size not wired up #257
  • Out of bound reads in chunk iterator #198 [arrow]

4.2.0 (2021-05-29)

Full Changelog

Breaking changes:

  • DictionaryArray::values() clones the underlying ArrayRef #313 [arrow]

Implemented enhancements:

  • Simplify shift kernel using null array #371
  • Provide Arc-based constructor for parquet::util::cursor::SliceableCursor #368
  • Add badges to crates #361
  • Consider inlining PrimitiveArray::value #328
  • Implement automated release verification script #327
  • Add wasm32 to the list of target architectures of the simd feature #316
  • add with_escape for csv::ReaderBuilder #315 [arrow]
  • IPC feature gate #310
  • csv feature gate #309 [arrow]
  • Add shrink_to / shrink_to_fit to MutableBuffer #297

Fixed bugs:

  • Incorrect crate setup instructions #364
  • Arrow-flight only register rerun-if-changed if file exists #350
  • Dictionary Comparison Uses Wrong Values Array #332
  • Undefined behavior in FFI implementation #322
  • All-null column get wrong parquet null-counts #306 [parquet]
  • Filter has inconsistent null handling #295

4.1.0 (2021-05-17)

Full Changelog

Implemented enhancements:

  • Add Send to ArrayBuilder #290 [arrow]
  • Improve performance of bound checking option #280 [arrow]
  • extend compute kernel arity to include nullary functions #276
  • Implement FFI / CDataInterface for Struct Arrays #251 [arrow]
  • Add support for pretty-printing Decimal numbers #230 [arrow]
  • CSV Reader String Dictionary Support #228 [arrow]
  • Add Builder interface for adding Arrays to record batches #210 [arrow]
  • Support auto-vectorization for min/max #209 [arrow]
  • Support LargeUtf8 in sort kernel #25 [arrow]

Fixed bugs:

  • no method named select_nth_unstable_by found for mutable reference &mut [T] #283
  • Rust 1.52 Clippy error #266
  • NaNs can break parquet statistics #255 [parquet]
  • u64::MAX does not roundtrip through parquet #254 [parquet]
  • Integration tests failing to compile (flatbuffer) #249 [arrow]
  • Fix compatibility quirks between arrow and parquet structs #245 [parquet]
  • Unable to write non-null Arrow structs to Parquet #244 [parquet]
  • schema: missing field metadata when deserialize #241 [arrow]
  • Arrow does not compile due to flatbuffers upgrade #238 [arrow]
  • Sort with limit panics for the limit includes some but not all nulls, for large arrays #235 [arrow]
  • arrow-rs contains a copy of the "format" directory #233 [arrow]
  • Fix SEGFAULT/ SIGILL in child-data ffi #206 [arrow]
  • Read list field correctly in <struct<list>> #167 [parquet]
  • FFI listarray lead to undefined behavior. #20

Security fixes:

Documentation updates:

  • Comment out the instructions in the PR template #277
  • Update links to datafusion and ballista in README.md #19
  • Update "repository" in Cargo.toml #12

Closed issues:

  • Arrow Aligned Vec #268
  • [Rust]: Tracking issue for AVX-512 #220 [arrow]
  • Umbrella issue for clippy integration #217 [arrow]
  • Support sort #215 [arrow]
  • Support stable Rust #214 [arrow]
  • Remove Rust and point integration tests to arrow-rs repo #211 [arrow]
  • ArrayData buffers are inconsistent accross implementations #207
  • 3.0.1 patch release #204
  • Document patch release process #202
  • Simplify Offset #186 [arrow]
  • Typed Bytes #185 [arrow]
  • [CI]docker-compose setup should enable caching #175
  • Improve take primitive performance #174
  • [CI] Try out buildkite #165 [arrow]
  • Update assignees in JIRA where missing #160
  • [Rust]: From<ArrayDataRef> implementations should validate data type #103 [arrow]
  • [DataFusion] Verify that projection push down does not remove aliases columns #99 [arrow]
  • [Rust][DataFusion] Implement modulus expression #98 [arrow]
  • [DataFusion] Add constant folding to expressions during logically planning #96 [arrow]
  • [DataFusion] DataFrame.collect should return RecordBatchReader #95 [arrow]
  • [Rust][DataFusion] Add FORMAT to explain plan and an easy to visualize format #94 [arrow]
  • [DataFusion] Implement metrics framework #90 [arrow]
  • [DataFusion] Implement micro benchmarks for each operator #89 [arrow]
  • [DataFusion] Implement pretty print for physical query plan #88 [arrow]
  • [Archery] Support rust clippy in the lint command #83
  • [rust][datafusion] optimize count(*) queries on parquet sources #75 [arrow]
  • [Rust][DataFusion] Improve like/nlike performance #71 [arrow]
  • [DataFusion] Implement optimizer rule to remove redundant projections #56 [arrow]
  • [DataFusion] Parquet data source does not support complex types #39 [arrow]
  • Merge utils from Parquet and Arrow #32 [arrow] [parquet]
  • Add benchmarks for Parquet #30 [parquet]
  • Mark methods that do not perform bounds checking as unsafe #28 [arrow]
  • Test issue #24 [arrow]
  • This is a test issue #11