ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546

carols10cents · 2020-10-28T14:10:18Z

No description provided.

github-actions · 2020-10-28T14:31:12Z

https://issues.apache.org/jira/browse/ARROW-10413

andygrove

LGTM

nevi-me · 2020-10-28T21:24:17Z

@carols10cents this needs a rebase :( I had to fix failing tests on the parquet branch before merging it, so there's now a conflict

rust/arrow/src/array/data.rs

jorgecarleitao

Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.

My main concern is the PartialEq for ArrayData, which IMO is not correct. All other comments are smaller improvements..

jorgecarleitao · 2020-10-29T03:50:07Z

rust/arrow/src/array/data.rs

+impl PartialEq for ArrayData {
+    fn eq(&self, other: &Self) -> bool {
+        assert_eq!(
+            self.data_type(),
+            other.data_type(),
+            "Data types not the same"
+        );
+        assert_eq!(self.len(), other.len(), "Lengths not the same");
+        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
+        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
+        assert_eq!(self.null_count(), other.null_count());
+        // compare buffers excluding padding
+        let self_buffers = self.buffers();
+        let other_buffers = other.buffers();
+        assert_eq!(self_buffers.len(), other_buffers.len());
+        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
+            compare_buffer_regions(
+                s,
+                self.offset(), // TODO mul by data length
+                o,
+                other.offset(), // TODO mul by data len
+            );
+        });
+        // assert_eq!(self.buffers(), other.buffers());
+
+        assert_eq!(self.child_data(), other.child_data());
+        // null arrays can skip the null bitmap, thus only compare if there are no nulls
+        if self.null_count() != 0 || other.null_count() != 0 {
+            compare_buffer_regions(
+                self.null_buffer().unwrap(),
+                self.offset(),
+                other.null_buffer().unwrap(),
+                other.offset(),
+            )
+        }


I can't understand this part of the code.

why are these asserts instead of equalities? Doesn't this panic? on the comparison?

why do we only compare buffers when there are no nulls? What about the other cases?

I am trying to understand the goal here: IMO there are two equalities that we can do to the ArrayData: a "physical equality" and a "logical equality".

The physical equality compares the data (in bytes) on the ArrayData, and two arrays are equal when that data is equal. For example, two ArrayData with 1 buffer each and a null bitmap of the form:

# (buffer, null_bit_as_bools) a = ([1, 2], [true, false]) b = ([1, 0], [true, false])

are different under this equality because their data is different.

Logical equality interprets the data on the ArrayData according to the spec and performs a comparison accordingly. In the example above, the two arrays are equal because the 2 and 0 become undefined due to the null bit.

Is this code trying to achieve the former equality, or the latter? If it is the former, then:

it needs to take into account that buffer equality for the null bitmap and boolean arrays is different and

the condition if self.null_count() != 0 || other.null_count() != 0 should not be there as this is a physical comparison that ignores null counts.

If it is trying to achieve logical equality, then it has a lot of issues as buffer and child equality are not that simple.

Coincidentally, I have been looking at the logical comparison of ArrayData as part of #8541 .

Hi @jorgecarleitao I introduced these changes in #8200 because we were failing on integration tests where:

We create an Arrow array which has no nulls from JSON, and don't allocate a null buffer (as it's optional http://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps)

When reading IPC files & streams, the C++ always has the null buffer allocated

So without changing that to account for the above, both the Parquet roundtrips and integration tests were failing even though we were trying to compare logically correct data.

On panics, there's a slight leakage of intention, in the sense that much of the equality was introduced to aid in testing, but because we don't yet have a logging mechanism, something like the below:

let a: Int32Array = ...; let b: Int32Array = ..;; // try compare in tests assert_eq!(&a, &b);

Would fail with a panic on the assert if the arrays are unequal, but because the comparison would return a bool, one wouldn't know where the failure occurred.
So this becomes tough to see why tests are failing, and hence the practise (I'm fine with calling it bad as I still do it) of adding asserts.
I'll work with you on #8541 to find a solution that can benefit both us when testing, and downstream users if they want a panic-free Array:Ref:equals(other: &ArrayRef) -> bool; kind of interface.

If it is trying to achieve logical equality, then it has a lot of issues as buffer and child equality are not that simple.

I appreciate this, and the intention is to achieve logical equality even with slices. I believe that the path that I took, while still incomplete, would help us to get there.

The lack of offset handling from that TODO could have been motivation not to merge the parquet branch in, but with all the refactoring work happening elsewhere, it's really become very inconvenient to keep rebasing and handling merge conflicts every few other days.

I checked that tests were passing on both branches before the merge, and in master after the merge.

I see. I would have written an utility function just for that, or use debug_assert_eq, so that the in release we do not assert those.

ArrayData is public API, which means that users can now use array_data1 == array_data2. That now panics when the two data is not equal (under some definition of equal), which was my only concern.

The test that would have failed in this case would be:

create array_data1

create array_data2 with different values from array_data1

assert that equality between them is false

jorgecarleitao · 2020-10-29T03:52:52Z

rust/arrow/src/compute/kernels/cast.rs

+
+        let expected = Int32Array::from(vec![None; 6]);
+
+        // Cast to a dictionary (same value type, Utf8)


Suggested change

// Cast to a dictionary (same value type, Utf8)

// Cast to a int32

jorgecarleitao · 2020-10-29T03:53:41Z

rust/arrow/src/compute/kernels/cast.rs

@@ -200,8 +200,7 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
        (Timestamp(_, _), Date32(_)) => true,
        (Timestamp(_, _), Date64(_)) => true,
        // date64 to timestamp might not make sense,
-
-        // end temporal casts
+        (Null, Int32) => true,


Shouldn't we support (literally) all types?

Only needed Int32 for Parquet at the time

jorgecarleitao · 2020-10-29T03:57:19Z

rust/parquet/Cargo.toml

@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true }
 chrono = "0.4"
 num-bigint = "0.3"
 arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT", optional = true }
+base64 = { version = "*", optional = true }


shouldn't we bracket this, to avoid backward incompatibilities if base64 releases a major version?

I'll bracket it.

jorgecarleitao · 2020-10-29T03:59:45Z

rust/parquet/src/arrow/schema.rs

-        DataType::LargeUtf8 | DataType::LargeBinary | DataType::LargeList(_) => {
-            Err(ArrowError("Large arrays not supported".to_string()))
-        }


jorgecarleitao · 2020-10-29T04:02:16Z

rust/parquet/src/arrow/schema.rs

+    #[should_panic(expected = "Parquet does not support writing empty structs")]
+    fn test_empty_struct_field() {
+        let arrow_fields = vec![Field::new("struct", DataType::Struct(vec![]), false)];
+        let arrow_schema = Schema::new(arrow_fields);
+        let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema);
+
+        assert!(converted_arrow_schema.is_err());
+        converted_arrow_schema.unwrap();
+    }


Instead of should_panic, could we test the error type and message? should_panic is generally reserved to catch panics, not to raise the panic in the test itself.

I'll fix what you've raised here in a follow up PR

nevi-me · 2020-10-29T05:27:46Z

Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.

My main concern is the PartialEq for ArrayData, which IMO is not correct. All other comments are smaller improvements..

Once rebased, this PR only removes 3 or 4 #[ignore]s in the parquet tests

jorgecarleitao · 2020-10-29T05:29:45Z

Cool improvements here. Thanks a lot for taking them on! I left some comments on the code.
My main concern is the PartialEq for ArrayData, which IMO is not correct. All other comments are smaller improvements..

Once rebased, this PR only removes 3 or 4 #[ignore]s in the parquet tests

Yeah, when I reviewed these, I though that the writer branch was still not merged in master, and thus ended up reviewing everything. My bad there.

jorgecarleitao

🎉 LGTM!

carols10cents · 2020-10-29T14:06:05Z

Rebased! Sorry for any confusion-- this PR should now just be removing some #[ignore]s :)

@jorgecarleitao

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes #8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>

@jorgecarleitao

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes apache#8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>

@jorgecarleitao

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache/arrow#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes #8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>

Closes apache#8546 from carols10cents/unignore-some-tests Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@jorgecarleitao

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for. This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their `ArrayData`'s `BufferData` capacities. This leaves `BufferData`'s `PartialEq` implementation as-is (that is, taking the equality of `capacity` into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of `Buffer`. This *does* change `PartialEq` for `ArrayData` to [not assert to fix the problems noted](apache#8546 (comment)) by @jorgecarleitao . This has the downside of making test failures where `ArrayData`s should be equal but aren't harder to interpret because the `assert_eq!`s aren't as granular. We can't use `debug_assert_eq!` either, because then we can't write tests that do `assert_ne!(array_data1, array_data2)` (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR. Closes apache#8590 from carols10cents/bufferdata-eq Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: alamb <andrew@nerdnetworks.org>

github-actions bot added the Component: Rust label Oct 28, 2020

andygrove approved these changes Oct 28, 2020

View reviewed changes

nevi-me force-pushed the rust-parquet-arrow-writer branch from ce06e17 to 3cc3a4b Compare October 28, 2020 18:58

nevi-me changed the base branch from rust-parquet-arrow-writer to master October 28, 2020 21:23

jorgecarleitao reviewed Oct 29, 2020

View reviewed changes

rust/arrow/src/array/data.rs Outdated Show resolved Hide resolved

jorgecarleitao requested changes Oct 29, 2020

View reviewed changes

jorgecarleitao mentioned this pull request Oct 29, 2020

ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader #8548

Merged

jorgecarleitao approved these changes Oct 29, 2020

View reviewed changes

ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now

60d8d51

carols10cents force-pushed the unignore-some-tests branch from daebdc8 to 60d8d51 Compare October 29, 2020 14:05

jorgecarleitao closed this in 9867c48 Oct 29, 2020

carols10cents deleted the unignore-some-tests branch October 29, 2020 18:39

carols10cents mentioned this pull request Nov 6, 2020

ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590

Closed

asfimport mentioned this pull request Dec 27, 2020

[Rust] [Parquet] Unignore some roundtrip tests that are passing now #18343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546

ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546

carols10cents commented Oct 28, 2020

github-actions bot commented Oct 28, 2020

andygrove left a comment

nevi-me commented Oct 28, 2020

jorgecarleitao left a comment

jorgecarleitao Oct 29, 2020

nevi-me Oct 29, 2020

jorgecarleitao Oct 29, 2020 •

edited

Loading

jorgecarleitao Oct 29, 2020

jorgecarleitao Oct 29, 2020

nevi-me Oct 29, 2020

jorgecarleitao Oct 29, 2020

nevi-me Oct 29, 2020

jorgecarleitao Oct 29, 2020

jorgecarleitao Oct 29, 2020

nevi-me Oct 29, 2020

nevi-me commented Oct 29, 2020

jorgecarleitao commented Oct 29, 2020

jorgecarleitao left a comment

carols10cents commented Oct 29, 2020


		let expected = Int32Array::from(vec![None; 6]);

		// Cast to a dictionary (same value type, Utf8)

	// Cast to a dictionary (same value type, Utf8)
	// Cast to a int32

ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546

ARROW-10413: [Rust] [Parquet] Unignore some tests that are passing now #8546

Conversation

carols10cents commented Oct 28, 2020

github-actions bot commented Oct 28, 2020

andygrove left a comment

Choose a reason for hiding this comment

nevi-me commented Oct 28, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me commented Oct 29, 2020

jorgecarleitao commented Oct 29, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

carols10cents commented Oct 29, 2020

jorgecarleitao Oct 29, 2020 •

edited

Loading