Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnionBuilder produces incorrect Union DataType #1637

Open
tfeda opened this issue May 1, 2022 · 3 comments
Open

UnionBuilder produces incorrect Union DataType #1637

tfeda opened this issue May 1, 2022 · 3 comments
Labels

Comments

@tfeda
Copy link
Contributor

tfeda commented May 1, 2022

Describe the bug
The Union DataType produced by UnionBuilder has non-nullable children Fields after appending nulls in the builder.

To Reproduce
Steps to reproduce the behavior: Try the following code

let mut builder = UnionBuilder::new_dense(4);
builder.append::<Int32Type>("a", 1).unwrap();
builder.append::<Float64Type>("b", 3.0).unwrap();
builder.append_null::<Float64Type>("b").unwrap();
builder.append_null::<Int32Type>("a").unwrap();
let union = builder.build().unwrap();

let schema = Schema::new(vec![
    Field::new(
        "Teamsters",
        DataType::Union(
            vec![
                Field::new("a", DataType::Int32, true),
                Field::new("b", DataType::Float64, true),
            ],
            UnionMode::Dense,
        ),
         false,
    ),
]); 

let batch = RecordBatch::try_new(
    Arc::new(schema),
    vec![Arc::new(union)]
).unwrap();

This code panics:

InvalidArgumentError("column types must match schema types, expected
Union([
Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None },
Field { name: "b", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }
], Dense
) but found Union([
Field { name: "a", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None },
Field { name: "b", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }
], Dense)
at column index 0")

Expected behavior

Depending on the interpretation of the specification, one of 2 things should happen:
A Union's children Fields should inherit its nullabillity (i.e. always be false): Then I think this should error when executing Field::new() with a bad DataType.

A child should be nullable if it is capable of returning None to the parent when unionArray.value(index) is called: This code should run just fine then.

Additional context
I ran into this when working on #1594. I think it's a simple fix: track the nullablility of the UnionBuilder fields rather than always hardcode the child Fields nullability to be false. That being said, I'm not sure if that's the correct understanding of the specification.

@tfeda tfeda added the bug label May 1, 2022
@tustvold
Copy link
Contributor

tustvold commented May 3, 2022

I think automatically determining the nullability of the field based on if that child contains nulls makes a lot of sense, however, it could cause schema-volatility depending on if the written data happens to contain nulls which seems sub-optimal.

I think the safest thing is probably to do what GenericListBuilder does and always set nullable to true, but it does feel a bit off...

Edit: FWIW @jhorstmann reported a similar issue recently with StructArrays - #1611

@tustvold
Copy link
Contributor

tustvold commented May 5, 2022

Further #1649

@gstvg
Copy link
Contributor

gstvg commented Aug 29, 2024

I also ran into this on #6303
We probably should care about the generated type_id of each field too.

Currently, the returned union only contains fields appended at least once. So every build_union calls below returns a different set of children with different type_id each.

        enum Value {
            U64(u64),
            I32(i32),
        }

        fn build_union(values: &[Value]) -> UnionArray {
            let mut builder = UnionBuilder::new_sparse();

            for value in values {
                match value {
                    Value::U64(u) => builder.append::<UInt64Type>("u64", *u).unwrap(),
                    Value::I32(i) => builder.append::<Int32Type>("i32", *i).unwrap(),
                };
            }

            builder.build().unwrap()
        }

        build_union(&[Value::U64(10), Value::I32(-4)]); // u64=0 i32=1
        build_union(&[Value::I32(-4), Value::U64(10)]); // i32=0 u64=1
        build_union(&[Value::U64(10), Value::U64(40)]); // u64=0
        build_union(&[Value::I32(-9), Value::I32(-4)]); // i32=0

Do you think we should follow GenericListBuilder, and add a method to specify UnionFields, or require it as an argument for UnionBuilder constructors?

If the fields are somehow specified, should the returned array contains every child specified, or only those appended at least one time?

I thought about another option, and would like to know what you think

First, we need to add an id argument to the current append* functions:

fn append(&mut self, type_name: &str, id: i8, value: T::Native)
fn append_null(&mut self, type_name: &str, id: i8) 

And then add

fn append_nullable(&mut self, type_name: &str, id:i8, value: Option<T::Native>)

And finally, document that nullable fields should only use append_null and append_nullable, while non nullable ones should only use append, and return an error if the same field is appended with mixed calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants