Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Utf8View and BinaryView in substrait serialization. #12199

Merged
merged 7 commits into from
Sep 6, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Aug 27, 2024

Which issue does this PR close?

Closes #12118

Rationale for this change

We have two new view data types, Utf8View and BinaryView. Support in datafusion is part of this epic, and this specific PR is about adding support for the (de-)serialization of logical and physical plans into the substrait format.

This PR adds new substrait variations on existing type classes. For example, there is a "string" substrait class which can have different variations representing different physical types (e.g. Utf8 vs LargeUtf8 vs Utf8View). If we serialize using string variation=2 (e.g. view physical type), then the deserialization of variation=2 will give us back the Utf8View. More background is given here.

What changes are included in this PR?

  • feat(12118): logical plan support for Utf8View (d7be771)
  • feat(12118): physical plan support for Utf8View (b17ae25)
  • feat(12118): logical plan support for BinaryView (f38085d)
  • feat(12118): physical plan support for BinaryView (5c4ebec)

Are these changes tested?

Logical plan: The Utf8View and BinaryView are covered in the logical plan roundtrip serialization tests.

Physical plan: However, the physical plan roundtrip serialization tests are not yet implemented. There is an ongoing epic to finish the physical plan serialization. As such, I added code for the physical plan substrait handling of Utf8View and BinaryView (to avoid incurring more tech debt) -- but this code is not tested.

Are there any user-facing changes?

No API contract change.
Removal of unimplemented errors if using these new datatypes in subtrait serialization.

@wiedld wiedld force-pushed the 12118/substrait-serialize-views branch from 700fe41 to 5c4ebec Compare August 27, 2024 18:41
Comment on lines 727 to 739
/// Arrow-cast does not currently handle direct casting from utf8 to binaryView.
#[tokio::test]
async fn binaryview_type_literal_needs_casting_fix() -> Result<()> {
let err = roundtrip_all_types(
"select * from data where
view_binary_col = arrow_cast('binary_view', 'BinaryView');",
)
.await;

assert!(
matches!(err, Err(e) if e.to_string().contains("Unsupported CAST from Utf8 to BinaryView"))
);
Ok(())
Copy link
Contributor Author

@wiedld wiedld Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have a few missing arrow_cast implementations for BinaryView (explicit casting). Going to file a ticket in arrow and put up a PR; I'll be assessing possible changes in cast_with_options and can_cast_types.

Note that datafusion's type coercion has been previously updated to prefer coercion to the view types. It's the explicit casting that has coverage gaps.

Copy link
Contributor Author

@wiedld wiedld Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see sqllogictests which demonstrate what is supported by arrow_cast. Then my follow ups will be: (a) make sqllogictests showing what is, and is not, supported of the new view types, and then (b) make the upstream arrow-rs changes (with some correctness guidance during code review).

Copy link
Contributor Author

@wiedld wiedld Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sqllogictests added: #12200

Turns out the arrow-cast changes are already made, but not in the current release used in datafusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have now updated to the latest arrow-rs so we'll have the correct code #12032

Copy link
Contributor Author

@wiedld wiedld Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. I've removed the work-arounds and deleted this (no longer applicable) test. Thank you.

@@ -716,12 +716,29 @@ async fn all_type_literal() -> Result<()> {
date32_col = arrow_cast('2020-01-01', 'Date32') AND
binary_col = arrow_cast('binary', 'Binary') AND
large_binary_col = arrow_cast('large_binary', 'LargeBinary') AND
view_binary_col = arrow_cast(arrow_cast('binary_view', 'Binary'), 'BinaryView') AND
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test binaryview_type_literal_needs_casting_fix() below, as for the reason behind the double casting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this workaround in a5bfedd

@wiedld wiedld marked this pull request as ready for review August 27, 2024 19:32
@@ -52,6 +52,7 @@ pub const DATE_32_TYPE_VARIATION_REF: u32 = 0;
pub const DATE_64_TYPE_VARIATION_REF: u32 = 1;
pub const DEFAULT_CONTAINER_TYPE_VARIATION_REF: u32 = 0;
pub const LARGE_CONTAINER_TYPE_VARIATION_REF: u32 = 1;
pub const VIEW_CONTAINER_TYPE_VARIATION_REF: u32 = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, hardcoding the numbers isn't really the proper way to do type variations. (Rather we should add the variation as an extension and refer to the extension's id.) However, given this is already used for default vs large, I guess adding view makes sense - and they can all be migrated at once to the proper way someday.

Copy link
Contributor Author

@wiedld wiedld Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I'll craft a follow up ticket later today, and link here (for future reference).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #12355 to track

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld and @Blizzara -- I think this PR is looking good.

@wiedld now that we have updated to the latest arrow-rs (#12032) I updated this PR to remove the workarounds. I figured I had it checked out anyways so I would just push the change

@@ -2351,8 +2371,10 @@ mod test {
round_trip_type(DataType::Binary)?;
round_trip_type(DataType::FixedSizeBinary(10))?;
round_trip_type(DataType::LargeBinary)?;
round_trip_type(DataType::BinaryView)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 727 to 739
/// Arrow-cast does not currently handle direct casting from utf8 to binaryView.
#[tokio::test]
async fn binaryview_type_literal_needs_casting_fix() -> Result<()> {
let err = roundtrip_all_types(
"select * from data where
view_binary_col = arrow_cast('binary_view', 'BinaryView');",
)
.await;

assert!(
matches!(err, Err(e) if e.to_string().contains("Unsupported CAST from Utf8 to BinaryView"))
);
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have now updated to the latest arrow-rs so we'll have the correct code #12032

@@ -716,12 +716,29 @@ async fn all_type_literal() -> Result<()> {
date32_col = arrow_cast('2020-01-01', 'Date32') AND
binary_col = arrow_cast('binary', 'Binary') AND
large_binary_col = arrow_cast('large_binary', 'LargeBinary') AND
view_binary_col = arrow_cast(arrow_cast('binary_view', 'Binary'), 'BinaryView') AND
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this workaround in a5bfedd

@alamb alamb merged commit aed84c2 into apache:main Sep 6, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

Thanks again @wiedld and @Blizzara

@alamb alamb deleted the 12118/substrait-serialize-views branch September 6, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support substrait serialization for ScalarValue::Utf8View and ScalarValue::BinaryView
3 participants