-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement UnionArray logical_nulls #6303
base: master
Are you sure you want to change the base?
Conversation
Looks good from a brief check, what's the result of running the benchmarks? |
|
Edit: Now it runs faster when there are up to ~10 fields with nulls, with timings slowly increasing for every field with nulls, and from there timings stabilize and depends only on the length of the union New results
|
arrow-array/src/array/union_array.rs
Outdated
None | ||
} | ||
} | ||
|
||
/// Union types always return non null as there is no validity buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated docs below, but the others arrays which also should and correctly implements logical_nulls
doesn't include those default methods. Should we delete them too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove these specific implementations, the behavior will not change but the documented reason (because UnionArray has no validity buffer) will be lost.
If we remove these -- can we update the documentation for the default implementations to mention the UnionArray special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment points to old code, my bad.
I think that the most recent changes already update the documentation, so I removed the methods
Let me know if you agree
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
I'm starting to review this one. 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for the code documentation! This was a fun PR to review.
Most of the suggestions are intended to make it easier for the next reader. I'm not sure about our code coverage policy, and I also think we may need to implement UnionArray::is_nullable (maybe in a followup PR?).
arrow-array/src/array/mod.rs
Outdated
/// Similary, a [`UnionArray`] with any nullable child will always return true, even if all | ||
/// selected values are valid, and therefore would not appear in [`Array::logical_nulls`]. | ||
fn is_nullable(&self) -> bool { | ||
self.null_count() != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the added code comment match the implementation?
UnionArrays always returns a zero null_count. So this conditional self.null_count() != 0
always evaluates to false, which means is never nullable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the proposed/added doc comments is what should happen -- but it's not happening now.
Union types have no null validity bitmap (per spec). I believe this is why the null_count is always zero and the UnionArray::nulls is None.
In contrast, I believe the is_nullable is based on the child types (as the docs added here suggest). But what's missing is to define UnionArray::is_nullable
on it's implementation, and in a way that examines the UnionArray's children. (Dictionary arrays already does so.) Do you agree?
If you agree, then maybe add this doc comment & the code change in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolute right, thank you!
I added an implementation always returning true
just to not be buggy, and will improve in a follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
arrow-array/src/array/union_array.rs
Outdated
None | ||
} | ||
} | ||
|
||
/// Union types always return non null as there is no validity buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove these specific implementations, the behavior will not change but the documented reason (because UnionArray has no validity buffer) will be lost.
If we remove these -- can we update the documentation for the default implementations to mention the UnionArray special case?
arrow-array/src/array/union_array.rs
Outdated
enum SparseStrategy { | ||
Gather, | ||
AllNullsSkipOne, | ||
MixedSkipWithoutNulls, | ||
MixedSkipFullyNull, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this enum elsewhere and doc comment what the options mean? Since the names are not self explanatory.
For example, the AllNullsSkipOne
is actually "all fields have some (not all) nulls" and the SkipOne
part refers to how the child arrays are iterated. Having that enum explained up front helps speedup the comprehension. 🙏🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I added comments, but please let me know if you think it can be further improved
arrow-array/src/array/union_array.rs
Outdated
// This is the threshold where masking becomes slower than gather | ||
// TODO: test on avx512f(feature is still unstable) | ||
let gather_relative_cost = if cfg!(target_feature = "avx2") { | ||
10 | ||
} else if cfg!(target_feature = "sse4.1") { | ||
3 | ||
} else { | ||
2 | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any sources (or data) on specifically why these numbers?
There is a clear preference to using the gather approach when we have many fields (partial null) to iterate thru, but I didn't follow how these exact numbers were chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, those numbers looks magic. I choose them based on benchmarks, numbers below.
I added comments to be more clear, and modified the else branch, that was choose based on x86 baseline/SSE2 benchmarks, but was also being selected for non x86 archs. I restricted it to x86, and set the gather cost to 0 (always used) on non x86, because I only have x86 to bench.
In resume, for a union with len of 4096:
gather = 6.4 µs (regardless of target feature)
AVX2:
1 selection mask per chunk = 0.8 µs
10 masks = 6.3 µs
11 maks = ~7 µs (slower than gather)
SSE4.1:
1 mask = 2.5µs
2 masks = 5 µs
3 masks = 7.5 µs (slower than gather)
x86 baseline/SSE2:
1 mask = 4 µs
2 masks = 8 µs (slower than gather)
The data for AVX2 is at #6303 (comment)
Data for SSE4.1 and SSE2 is from memory, I will bench it again tomorrow
// Unsafe code below depend on it: | ||
// To remove one branch from the loop, if the a type_id is not utilized, or it's logical_nulls is None/all set, | ||
// we use a null buffer of len 1 and a index_mask of 0, or the true null buffer and usize::MAX otherwise. | ||
// We then unconditionally access the null buffer with index & index_mask, | ||
// which always return 0 for the 1-len buffer, or the true index unchanged otherwise | ||
// We also use a 256 array, so llvm knows that `type_id as u8 as usize` is always in bounds | ||
let mut logical_nulls_array = [(&one_valid, Mask::Zero); 256]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
arrow-array/src/array/union_array.rs
Outdated
let chunks = chunks_exact.map(|chunk| { | ||
let chunk_array = <&[i8; 64]>::try_from(chunk).unwrap(); | ||
|
||
mask_chunk(chunk_array, &mut nulls_masks_iter) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would you mind renaming the variable chunk_array
(and where the mask_chunk closures are defined) as type_ids_chunk_array
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better, I also renamed chunk
to type_ids_chunk
and remainder
to type_ids_remainder
Feel free to ping any other place I may have missed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
arrow-array/src/array/union_array.rs
Outdated
}, | ||
); | ||
|
||
union_nulls | without_nulls_selected(chunk, &without_nulls_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment on this line?
I believe the without_nulls_selected returns the bitmask for the without nulls (the is_d | is_e
), which provides the proper true NullBuffer slot (which translates to not-the-f-array-null, a.k.a. not one of the fully nulls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, please let me know if it can be further clarified
arrow-array/src/array/union_array.rs
Outdated
}) | ||
.fold((0, 0), fold); | ||
|
||
!any_nullable_selected | union_nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment on this line?
It's basically the inverse of line 480 (see requested comment there). It's clear once reading the NullBuffer docs, but it's an extra hop for anyone newer to the codebase. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified it to be the same as line 480, and renamed nullable
to with_nulls
because it doesn't applies to nullable fields that happens to have 0 nulls
@@ -380,6 +392,254 @@ impl UnionArray { | |||
_ => unreachable!(), | |||
} | |||
} | |||
|
|||
/// Computes the logical nulls for a sparse union, optimized for when there's a lot of fields without nulls | |||
fn mask_sparse_skip_without_nulls(&self, nulls: Vec<(i8, NullBuffer)>) -> BooleanBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not hit by any of the test cases, but does get used by the benchmark.
There are also several other branch points in these mask methods, also only used in the benchmark. I don't believe the benchmark tests for correctness (uses black_box(array.logical_nulls()
). What is the policy here @alamb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any strict policy -- rather the guidelines are to cover the code such that if someone broke it accidentally in a future refactoring, the tests would break.
Exactly how much coverage is enough to meet that bar I think is somewhat of a judgement call and is based on the functionality.
Perhaps you can help @gstvg figure out any missing cases they could add based on your coverage analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. Here is the cov report.
I added in assert!(false)
lines for verification of coverage gaps; also makes skimming the report easier.
# to open
> tar -xvf cov_union_arrays.tar.gz
> open cov_union_arrays.html
# how generated
> RUSTFLAGS='-Z profile -C codegen-units=1' CARGO_CFG_REGEX_DISABLE_AUTO_OPTIMIZATIONS=1 cargo +nightly cov test -p arrow-array --lib
> cargo +nightly cov report --open
# save as complete webpage (give you interactive bits), and tarball
# Also confirmed the arrow-arith tests did not increase coverage.
# Only the benchmarks did -- and those don't check correctness.
Your call on wherever you want correctness coverage @gstvg . Hope this helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it would help to translate the codecov report into a description of what UnionArray should be constructed / have logical_nulls
called on it to improve the coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the cov report.
That's my fault, the test test_sparse_union_logical_mask_mixed_nulls_skip_fully_valid
should have hit this. It's fixed
I also discovered that test_sparse_union_logical_nulls_mask_all_nulls_skip_one
was using the gather strategy, and the SkipOne strategy was only called on a fast_paths
test that doesn't make sense, so I removed it and fixed the skip_one
test.
… into union_logical_nulls
Co-authored-by: wiedld <wiedld@users.noreply.github.com>
@wiedld -- please ping me when you think this PS is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gstvg for the clarifications & documentation. ❤️
I can approve after @alamb let's me know about the CI-testing targets, and what is the standard we use. 🙏🏼
Alternatively, there is another suggested approach to make sure we are testing the different sparse strategies.
// Choose the fastest way to compute the logical nulls | ||
// Gather computes one null per iteration, while the others work on 64 nulls chunks, | ||
// but must also compute selection masks, which is expensive, | ||
// so it's cost is the number of selection masks computed per chunk | ||
// Since computing the selection mask gets auto-vectorized, it's performance depends on which simd feature is enabled | ||
// For gather, the cost is the threshold where masking becomes slower than gather, which is determined with benchmarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
// Always use gather on non benchmarked archs because even though it may slower on some cases, | ||
// it's performance depends only on the union length, without being affected by the number of fields | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. These comments are great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb -- I have a question regarding how CI will be testing these branches.
The default gather_relative_cost=0 means that the default scenario is to always use the SparseStrategy::Gather
except when other architectures AND features are enabled during testing. When I look at our CI (which I believe is here) we won't be testing all of these scenarios.
Not trying to hold up this PR. Just trying to figure out if this means we leave the responsibility on the user to build & run tests on their own platform -- and report to us if the tests errors. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could have these branches (choosing which SparseStrategy
enum) not be hardcoded to a feature & architecure -- and instead use an approach where we can set the variable separately before running each test. That way all branches can be run through the correctness tests.
Is this a reasonable ask? (I'm new to this codebase & as a code reviewer here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the arrow tests are controlled by this file: https://github.com/apache/arrow-rs/blob/master/.github/workflows/arrow.yml
Which runs with various combination of feature flags
Which issue does this PR close?
Closes #6017
Rationale for this change
N/A
What changes are included in this PR?
UnionArray::logical_nulls
implementation, tests and benchesis_null
andis_not_null
tests on unionsCheck if sparse union child arrays length match the length of the parent union
Are there any user-facing changes?
UnionArray::logical_nulls
return correct resultsAdditional info
This is a port of apache/datafusion#11321