Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Binary/LargeBinary/Utf8/LargeUtf8 data types in data page statistics #11136

Merged
merged 4 commits into from
Jun 27, 2024

Conversation

PsiACE
Copy link
Member

@PsiACE PsiACE commented Jun 26, 2024

Which issue does this PR close?

Part of #11026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
@github-actions github-actions bot added the core Core DataFusion crate label Jun 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @PsiACE -- this looks great to me

@@ -600,6 +600,18 @@ make_data_page_stats_iterator!(
Index::DOUBLE,
f64
);
make_data_page_stats_iterator!(
MinByteArrayDataPageStatsIterator,
|x: &PageIndex<ByteArray>| { x.min.clone() },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time digging in the code and I found that this clone is cloning a Bytes (aka it is ref counted, and is not actually copying the bytes around)

[<$stat_type_prefix ByteArrayDataPageStatsIterator>]::new($iterator).map(|x| {
x.into_iter().filter_map(|x| {
x.and_then(|x| {
let res = std::str::from_utf8(x.data()).map(|s| s.to_string()).ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this code and was able to avoid the to_string() here but the code was more involved:

                Some(DataType::Utf8) => {
                    let mut builder = StringBuilder::new();
                    let iterator = [<$stat_type_prefix ByteArrayDataPageStatsIterator>]::new($iterator);
                    for x in iterator {
                        for x in x.into_iter() {
                            let Some(x) = x else {
                                builder.append_null(); // no statistics value
                                continue;
                            };

                            let Ok(x) = std::str::from_utf8(x.data()) else {
                                log::debug!("Utf8 statistics is a non-UTF8 value, ignoring it.");
                                builder.append_null();
                                continue;
                            };

                            builder.append_value(x);
                        }
                    }
                    Ok(Arc::new(builder.finish()))
                },

I think this is a good one for now until we have a benchmark to optimize (aka #10934)

@alamb alamb merged commit 64b8eea into apache:main Jun 27, 2024
23 checks passed
@PsiACE PsiACE deleted the try-data-page branch June 28, 2024 00:36
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…types in data page statistics (apache#11136)

* *: for binary

Signed-off-by: Chojan Shang <psiace@apache.org>

* *: for large binary

Signed-off-by: Chojan Shang <psiace@apache.org>

* feat: add utf8 and largeutf8

Signed-off-by: Chojan Shang <psiace@apache.org>

---------

Signed-off-by: Chojan Shang <psiace@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants