-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support casting to and from unions #6218
base: master
Are you sure you want to change the base?
Conversation
child.slice(*offset as usize, 1) | ||
} else { | ||
new_null_array(data_type, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how slow this will be?
In theory we could do something cleverer where we build up chunks rather than having a separate array for each element, but I've no idea how much difference it will make.
More to the point I wonder if there's a completely different way to do this that's much faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely be slow.
I think the better API is this: https://docs.rs/arrow/latest/arrow/array/struct.MutableArrayData.html (which lets you copy chunks of data from different arrays)
Though if what you are trying to do is to get one child array and use the nulls of the parent, you could potentially just replace the null buffer directly
I can probably have some more specific suggestions once we sort out what the semantics of casting a union array are supposed to be (see my other comment)
I think this is ready to review. |
5209176
to
648897f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samuelcolvin -- I am sorry for the delay in reviewing.
I am not sure about the desired semantics of a union cast. Thus I think we should at minimum document what casting Union means in https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html#behavior
It would probably help to research what other systems (e.g. pyarrow) do when casitng UnionArray to other types
Once we are clear / agree on the semantics we can then sort out the implementation
@@ -296,7 +296,7 @@ impl UnionArray { | |||
} | |||
|
|||
/// Returns whether the `UnionArray` is dense (or sparse if `false`). | |||
fn is_dense(&self) -> bool { | |||
pub fn is_dense(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
(Union(_, _), Union(_, _)) => false, | ||
(Union(from_fields, _), _) => { | ||
from_fields.iter().any(|(_, f)| can_cast_types(f.data_type(), to_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this verify we can cast ALL the possible union field types to the target type , not just one of the fields?
)), | ||
|
||
// unions | ||
// we might be able to support this, but it's complex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 -- it is also not well specified if there are multiple field types that could be the target
For example, what would Union(Int32, Int64) --> Union(Float64, Decimal128)
do? Would it cast all the integers to Float64 Or to Decimal128? Or something else 🤔
/// | ||
/// # Panics | ||
/// Panics if `type_id` is not present in the union. | ||
fn union_field_array(union_array: &UnionArray, type_id: i8) -> Result<Cow<ArrayRef>, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since an ArrayRef is already an Arc
I suspect the extra Cow
here doesn't matter much
let cast_array = cast(&union_array, &DataType::Int64).unwrap(); | ||
let as_int64 = as_int_vec::<Int64Type>(&cast_array); | ||
assert_eq!(as_int64, vec![Some(1), None, None, None]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the expected semantics of casting a UnionArray to another type?
What this test seems to verify is some sort of TAKE semantics, where casting from UnionArray
to Int64
basically picks the Union variant that is the closest and takes
So it does
cast(union of [{A=1}, {A=}, {B=3.2}], Int64) --> [1, NULL, NULL]
I sort of expected that the result of casting a UnionArray would be to cast all the elements individually to the target type. So in this example
cast (union of [{A=1}, {A=}, {B=3.2}]) --> [1, NULL, 3]
Where the 3
results from casting 3.2
to 3
child.slice(*offset as usize, 1) | ||
} else { | ||
new_null_array(data_type, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely be slow.
I think the better API is this: https://docs.rs/arrow/latest/arrow/array/struct.MutableArrayData.html (which lets you copy chunks of data from different arrays)
Though if what you are trying to do is to get one child array and use the nulls of the parent, you could potentially just replace the null buffer directly
I can probably have some more specific suggestions once we sort out what the semantics of casting a union array are supposed to be (see my other comment)
2e0b407
to
8b43aa3
Compare
@alamb ye, I think you make a very good point. I hadn't thought about the case where multiple children (perhaps all children) of the union are valid in the output type. I'll do some research. |
Let's discuss on #6247. |
Which issue does this PR close?
Closes apache/datafusion#10180.
Rationale for this change
I want to cast from unions to primitive types (I've also added support for casting to a union here).
What changes are included in this PR?
Support casting to and from unions (hopefully), tests.
Casting between unions is not yet supported.
Are there any user-facing changes?
no