-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for Binary
/LargeBinary
/Utf8
/LargeUtf8
data types in data page statistics
#11136
Conversation
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @PsiACE -- this looks great to me
@@ -600,6 +600,18 @@ make_data_page_stats_iterator!( | |||
Index::DOUBLE, | |||
f64 | |||
); | |||
make_data_page_stats_iterator!( | |||
MinByteArrayDataPageStatsIterator, | |||
|x: &PageIndex<ByteArray>| { x.min.clone() }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time digging in the code and I found that this clone
is cloning a Bytes
(aka it is ref counted, and is not actually copying the bytes around)
[<$stat_type_prefix ByteArrayDataPageStatsIterator>]::new($iterator).map(|x| { | ||
x.into_iter().filter_map(|x| { | ||
x.and_then(|x| { | ||
let res = std::str::from_utf8(x.data()).map(|s| s.to_string()).ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this code and was able to avoid the to_string()
here but the code was more involved:
Some(DataType::Utf8) => {
let mut builder = StringBuilder::new();
let iterator = [<$stat_type_prefix ByteArrayDataPageStatsIterator>]::new($iterator);
for x in iterator {
for x in x.into_iter() {
let Some(x) = x else {
builder.append_null(); // no statistics value
continue;
};
let Ok(x) = std::str::from_utf8(x.data()) else {
log::debug!("Utf8 statistics is a non-UTF8 value, ignoring it.");
builder.append_null();
continue;
};
builder.append_value(x);
}
}
Ok(Arc::new(builder.finish()))
},
I think this is a good one for now until we have a benchmark to optimize (aka #10934)
…types in data page statistics (apache#11136) * *: for binary Signed-off-by: Chojan Shang <psiace@apache.org> * *: for large binary Signed-off-by: Chojan Shang <psiace@apache.org> * feat: add utf8 and largeutf8 Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org>
Which issue does this PR close?
Part of #11026
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?