Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546

Closed

Conversation

carols10cents
Copy link
Contributor

No description provided.

@github-actions
Copy link

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nevi-me nevi-me changed the base branch from rust-parquet-arrow-writer to master October 28, 2020 21:23
@nevi-me
Copy link
Contributor

nevi-me commented Oct 28, 2020

@carols10cents this needs a rebase :( I had to fix failing tests on the parquet branch before merging it, so there's now a conflict

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.

My main concern is the PartialEq for ArrayData, which IMO is not correct. All other comments are smaller improvements..

Comment on lines 212 to 246
impl PartialEq for ArrayData {
fn eq(&self, other: &Self) -> bool {
assert_eq!(
self.data_type(),
other.data_type(),
"Data types not the same"
);
assert_eq!(self.len(), other.len(), "Lengths not the same");
// TODO: when adding tests for this, test that we can compare with arrays that have offsets
assert_eq!(self.offset(), other.offset(), "Offsets not the same");
assert_eq!(self.null_count(), other.null_count());
// compare buffers excluding padding
let self_buffers = self.buffers();
let other_buffers = other.buffers();
assert_eq!(self_buffers.len(), other_buffers.len());
self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
compare_buffer_regions(
s,
self.offset(), // TODO mul by data length
o,
other.offset(), // TODO mul by data len
);
});
// assert_eq!(self.buffers(), other.buffers());

assert_eq!(self.child_data(), other.child_data());
// null arrays can skip the null bitmap, thus only compare if there are no nulls
if self.null_count() != 0 || other.null_count() != 0 {
compare_buffer_regions(
self.null_buffer().unwrap(),
self.offset(),
other.null_buffer().unwrap(),
other.offset(),
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand this part of the code.

  1. why are these asserts instead of equalities? Doesn't this panic? on the comparison?
  2. why do we only compare buffers when there are no nulls? What about the other cases?

I am trying to understand the goal here: IMO there are two equalities that we can do to the ArrayData: a "physical equality" and a "logical equality".

The physical equality compares the data (in bytes) on the ArrayData, and two arrays are equal when that data is equal. For example, two ArrayData with 1 buffer each and a null bitmap of the form:

# (buffer, null_bit_as_bools)
a = ([1, 2], [true, false])
b = ([1, 0], [true, false])

are different under this equality because their data is different.

Logical equality interprets the data on the ArrayData according to the spec and performs a comparison accordingly. In the example above, the two arrays are equal because the 2 and 0 become undefined due to the null bit.

Is this code trying to achieve the former equality, or the latter? If it is the former, then:

  • it needs to take into account that buffer equality for the null bitmap and boolean arrays is different and
  • the condition if self.null_count() != 0 || other.null_count() != 0 should not be there as this is a physical comparison that ignores null counts.

If it is trying to achieve logical equality, then it has a lot of issues as buffer and child equality are not that simple.

Coincidentally, I have been looking at the logical comparison of ArrayData as part of #8541 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jorgecarleitao I introduced these changes in #8200 because we were failing on integration tests where:

So without changing that to account for the above, both the Parquet roundtrips and integration tests were failing even though we were trying to compare logically correct data.

On panics, there's a slight leakage of intention, in the sense that much of the equality was introduced to aid in testing, but because we don't yet have a logging mechanism, something like the below:

let a: Int32Array = ...;
let b: Int32Array = ..;;
// try compare in tests
assert_eq!(&a, &b);

Would fail with a panic on the assert if the arrays are unequal, but because the comparison would return a bool, one wouldn't know where the failure occurred.
So this becomes tough to see why tests are failing, and hence the practise (I'm fine with calling it bad as I still do it) of adding asserts.
I'll work with you on #8541 to find a solution that can benefit both us when testing, and downstream users if they want a panic-free Array:Ref:equals(other: &ArrayRef) -> bool; kind of interface.

If it is trying to achieve logical equality, then it has a lot of issues as buffer and child equality are not that simple.

I appreciate this, and the intention is to achieve logical equality even with slices. I believe that the path that I took, while still incomplete, would help us to get there.

The lack of offset handling from that TODO could have been motivation not to merge the parquet branch in, but with all the refactoring work happening elsewhere, it's really become very inconvenient to keep rebasing and handling merge conflicts every few other days.

I checked that tests were passing on both branches before the merge, and in master after the merge.

Copy link
Member

@jorgecarleitao jorgecarleitao Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I would have written an utility function just for that, or use debug_assert_eq, so that the in release we do not assert those.

ArrayData is public API, which means that users can now use array_data1 == array_data2. That now panics when the two data is not equal (under some definition of equal), which was my only concern.

The test that would have failed in this case would be:

  1. create array_data1
  2. create array_data2 with different values from array_data1
  3. assert that equality between them is false


let expected = Int32Array::from(vec![None; 6]);

// Cast to a dictionary (same value type, Utf8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Cast to a dictionary (same value type, Utf8)
// Cast to a int32

@@ -200,8 +200,7 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
(Timestamp(_, _), Date32(_)) => true,
(Timestamp(_, _), Date64(_)) => true,
// date64 to timestamp might not make sense,

// end temporal casts
(Null, Int32) => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we support (literally) all types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needed Int32 for Parquet at the time

@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true }
chrono = "0.4"
num-bigint = "0.3"
arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT", optional = true }
base64 = { version = "*", optional = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we bracket this, to avoid backward incompatibilities if base64 releases a major version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bracket it.

Comment on lines 270 to 272
DataType::LargeUtf8 | DataType::LargeBinary | DataType::LargeList(_) => {
Err(ArrowError("Large arrays not supported".to_string()))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines 1376 to 1384
#[should_panic(expected = "Parquet does not support writing empty structs")]
fn test_empty_struct_field() {
let arrow_fields = vec![Field::new("struct", DataType::Struct(vec![]), false)];
let arrow_schema = Schema::new(arrow_fields);
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema);

assert!(converted_arrow_schema.is_err());
converted_arrow_schema.unwrap();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of should_panic, could we test the error type and message? should_panic is generally reserved to catch panics, not to raise the panic in the test itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix what you've raised here in a follow up PR

@nevi-me
Copy link
Contributor

nevi-me commented Oct 29, 2020

Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.

My main concern is the PartialEq for ArrayData, which IMO is not correct. All other comments are smaller improvements..

Once rebased, this PR only removes 3 or 4 #[ignore]s in the parquet tests

@jorgecarleitao
Copy link
Member

Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.
My main concern is the PartialEq for ArrayData, which IMO is not correct. All other comments are smaller improvements..

Once rebased, this PR only removes 3 or 4 #[ignore]s in the parquet tests

Yeah, when I reviewed these, I though that the writer branch was still not merged in master, and thus ended up reviewing everything. My bad there.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 LGTM!

@carols10cents
Copy link
Contributor Author

Rebased! Sorry for any confusion-- this PR should now just be removing some #[ignore]s :)

@carols10cents carols10cents deleted the unignore-some-tests branch October 29, 2020 18:39
alamb pushed a commit that referenced this pull request Nov 10, 2020
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities.

This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`.

This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](#8546 (comment)) by @jorgecarleitao .

This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

Closes #8590 from carols10cents/bufferdata-eq

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: alamb <andrew@nerdnetworks.org>
yordan-pavlov pushed a commit to yordan-pavlov/arrow that referenced this pull request Nov 14, 2020
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities.

This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`.

This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao .

This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

Closes apache#8590 from carols10cents/bufferdata-eq

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: alamb <andrew@nerdnetworks.org>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities.

This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`.

This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache/arrow#8546 (comment)) by @jorgecarleitao .

This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

Closes #8590 from carols10cents/bufferdata-eq

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: alamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Closes apache#8546 from carols10cents/unignore-some-tests

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities.

This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`.

This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao .

This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

Closes apache#8590 from carols10cents/bufferdata-eq

Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Signed-off-by: alamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants