-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ce06e17
to
3cc3a4b
Compare
@carols10cents this needs a rebase :( I had to fix failing tests on the parquet branch before merging it, so there's now a conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.
My main concern is the PartialEq for ArrayData
, which IMO is not correct. All other comments are smaller improvements..
rust/arrow/src/array/data.rs
Outdated
impl PartialEq for ArrayData { | ||
fn eq(&self, other: &Self) -> bool { | ||
assert_eq!( | ||
self.data_type(), | ||
other.data_type(), | ||
"Data types not the same" | ||
); | ||
assert_eq!(self.len(), other.len(), "Lengths not the same"); | ||
// TODO: when adding tests for this, test that we can compare with arrays that have offsets | ||
assert_eq!(self.offset(), other.offset(), "Offsets not the same"); | ||
assert_eq!(self.null_count(), other.null_count()); | ||
// compare buffers excluding padding | ||
let self_buffers = self.buffers(); | ||
let other_buffers = other.buffers(); | ||
assert_eq!(self_buffers.len(), other_buffers.len()); | ||
self_buffers.iter().zip(other_buffers).for_each(|(s, o)| { | ||
compare_buffer_regions( | ||
s, | ||
self.offset(), // TODO mul by data length | ||
o, | ||
other.offset(), // TODO mul by data len | ||
); | ||
}); | ||
// assert_eq!(self.buffers(), other.buffers()); | ||
|
||
assert_eq!(self.child_data(), other.child_data()); | ||
// null arrays can skip the null bitmap, thus only compare if there are no nulls | ||
if self.null_count() != 0 || other.null_count() != 0 { | ||
compare_buffer_regions( | ||
self.null_buffer().unwrap(), | ||
self.offset(), | ||
other.null_buffer().unwrap(), | ||
other.offset(), | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't understand this part of the code.
- why are these asserts instead of equalities? Doesn't this panic? on the comparison?
- why do we only compare buffers when there are no nulls? What about the other cases?
I am trying to understand the goal here: IMO there are two equalities that we can do to the ArrayData
: a "physical equality" and a "logical equality".
The physical equality compares the data (in bytes) on the ArrayData
, and two arrays are equal when that data is equal. For example, two ArrayData
with 1 buffer each and a null bitmap of the form:
# (buffer, null_bit_as_bools)
a = ([1, 2], [true, false])
b = ([1, 0], [true, false])
are different under this equality because their data is different.
Logical equality interprets the data on the ArrayData
according to the spec and performs a comparison accordingly. In the example above, the two arrays are equal because the 2
and 0
become undefined due to the null bit.
Is this code trying to achieve the former equality, or the latter? If it is the former, then:
- it needs to take into account that buffer equality for the null bitmap and boolean arrays is different and
- the condition
if self.null_count() != 0 || other.null_count() != 0
should not be there as this is a physical comparison that ignores null counts.
If it is trying to achieve logical equality, then it has a lot of issues as buffer and child equality are not that simple.
Coincidentally, I have been looking at the logical comparison of ArrayData
as part of #8541 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jorgecarleitao I introduced these changes in #8200 because we were failing on integration tests where:
- We create an Arrow array which has no nulls from JSON, and don't allocate a null buffer (as it's optional http://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps)
- When reading IPC files & streams, the C++ always has the null buffer allocated
So without changing that to account for the above, both the Parquet roundtrips and integration tests were failing even though we were trying to compare logically correct data.
On panics, there's a slight leakage of intention, in the sense that much of the equality was introduced to aid in testing, but because we don't yet have a logging mechanism, something like the below:
let a: Int32Array = ...;
let b: Int32Array = ..;;
// try compare in tests
assert_eq!(&a, &b);
Would fail with a panic on the assert if the arrays are unequal, but because the comparison would return a bool
, one wouldn't know where the failure occurred.
So this becomes tough to see why tests are failing, and hence the practise (I'm fine with calling it bad as I still do it) of adding asserts.
I'll work with you on #8541 to find a solution that can benefit both us when testing, and downstream users if they want a panic-free Array:Ref:equals(other: &ArrayRef) -> bool;
kind of interface.
If it is trying to achieve logical equality, then it has a lot of issues as buffer and child equality are not that simple.
I appreciate this, and the intention is to achieve logical equality even with slices. I believe that the path that I took, while still incomplete, would help us to get there.
The lack of offset handling from that TODO
could have been motivation not to merge the parquet branch in, but with all the refactoring work happening elsewhere, it's really become very inconvenient to keep rebasing and handling merge conflicts every few other days.
I checked that tests were passing on both branches before the merge, and in master after the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I would have written an utility function just for that, or use debug_assert_eq
, so that the in release we do not assert those.
ArrayData
is public API, which means that users can now use array_data1 == array_data2
. That now panics when the two data is not equal (under some definition of equal), which was my only concern.
The test that would have failed in this case would be:
- create
array_data1
- create
array_data2
with different values fromarray_data1
- assert that equality between them is false
|
||
let expected = Int32Array::from(vec![None; 6]); | ||
|
||
// Cast to a dictionary (same value type, Utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Cast to a dictionary (same value type, Utf8) | |
// Cast to a int32 |
@@ -200,8 +200,7 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool { | |||
(Timestamp(_, _), Date32(_)) => true, | |||
(Timestamp(_, _), Date64(_)) => true, | |||
// date64 to timestamp might not make sense, | |||
|
|||
// end temporal casts | |||
(Null, Int32) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we support (literally) all types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only needed Int32 for Parquet at the time
rust/parquet/Cargo.toml
Outdated
@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true } | |||
chrono = "0.4" | |||
num-bigint = "0.3" | |||
arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT", optional = true } | |||
base64 = { version = "*", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we bracket this, to avoid backward incompatibilities if base64
releases a major version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bracket it.
rust/parquet/src/arrow/schema.rs
Outdated
DataType::LargeUtf8 | DataType::LargeBinary | DataType::LargeList(_) => { | ||
Err(ArrowError("Large arrays not supported".to_string())) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
rust/parquet/src/arrow/schema.rs
Outdated
#[should_panic(expected = "Parquet does not support writing empty structs")] | ||
fn test_empty_struct_field() { | ||
let arrow_fields = vec![Field::new("struct", DataType::Struct(vec![]), false)]; | ||
let arrow_schema = Schema::new(arrow_fields); | ||
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema); | ||
|
||
assert!(converted_arrow_schema.is_err()); | ||
converted_arrow_schema.unwrap(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of should_panic
, could we test the error type and message? should_panic
is generally reserved to catch panics, not to raise the panic in the test itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix what you've raised here in a follow up PR
Once rebased, this PR only removes 3 or 4 |
Yeah, when I reviewed these, I though that the writer branch was still not merged in master, and thus ended up reviewing everything. My bad there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 LGTM!
daebdc8
to
60d8d51
Compare
Rebased! Sorry for any confusion-- this PR should now just be removing some |
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes #8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes apache#8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache/arrow#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes #8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>
Closes apache#8546 from carols10cents/unignore-some-tests Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes apache#8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>
No description provided.