Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading orc column statistics #6588

Closed
wants to merge 1 commit into from

Conversation

wypb
Copy link
Contributor

@wypb wypb commented Sep 15, 2023

Currently, the dwrf module only supports reading the column statistics of dwrf format, and does not support reading the column statistics of orc format. When I use velox to read the orc table, the following exception occurs:

presto:tpch_velox> create table region_orc with(format = 'ORC') as select * from tpch.sf10.region;
CREATE TABLE: 5 rows

Query 20230804_032253_00003_tubug, FINISHED, 1 node
Splits: 22 total, 22 done (100.00%)
0:35 [5 rows, 0B] [0 rows/s, 0B/s]

presto:tpch_velox> select * from region_orc;

Query 20230804_043524_00023_m8qw7, FAILED, 1 node
Splits: 2 total, 0 done (0.00%)
0:40 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20230804_043524_00023_m8qw7 failed:  Operator::getOutput failed for [operator: TableScan, plan node ID: 0]: vector

with this pr, we support read orc column statistics, than we can read orc data through velox:

presto:tpch_velox> select * from region_orc;
 regionkey |    name     |                                                       comment
-----------+-------------+---------------------------------------------------------------------------------------------------------------------
         0 | AFRICA      | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to
         1 | AMERICA     | hs use ironic, even requests. s
         2 | ASIA        | ges. thinly even pinto beans ca
         3 | EUROPE      | ly final courts cajole furiously final excuse
         4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl
(5 rows)

Query 20230804_044016_00026_m8qw7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:14 [5 rows, 603B] [0 rows/s, 44B/s]

presto:tpch_velox>

CC: @Yuhta

@netlify
Copy link

netlify bot commented Sep 15, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 48f1ee0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/666a560f236a4b00092a640b

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 15, 2023
@wypb wypb closed this Sep 15, 2023
@wypb wypb deleted the orc_read branch September 15, 2023 09:10
@wypb wypb restored the orc_read branch September 15, 2023 09:11
@wypb wypb reopened this Sep 15, 2023
@wypb wypb force-pushed the orc_read branch 3 times, most recently from 8b689ff to 70fa5e9 Compare November 30, 2023 07:16
@@ -97,9 +97,9 @@ std::unique_ptr<BinaryStripeStreams> BinaryStreamReader::next() {
stripeReaderBase_, columnSelector_, stripeIndex_++);
}

std::unordered_map<uint32_t, proto::ColumnStatistics>
std::unordered_map<uint32_t, ColumnStatisticsWrapper>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some problem here, the wrapper is just a reference but here we are returning actual values. If ORC does not need it, can we keep the signature and make it DWRF only (with a check)?

@@ -111,7 +111,8 @@ BinaryStreamReader::getStatistics() const {
"Corrupted file detected, Footer stats are missing, but stripes are present");
for (auto node = 0; node < typesSize; node++) {
if (columnSelector_.shouldReadNode(node)) {
stats[node] = proto::ColumnStatistics();
const proto::ColumnStatistics cs = proto::ColumnStatistics();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Thank you for your review, It should be that temporary local variables are used here, I will modify it.

@wypb wypb force-pushed the orc_read branch 2 times, most recently from 7dd28a8 to 9c15412 Compare December 18, 2023 08:28
@wypb
Copy link
Contributor Author

wypb commented Dec 18, 2023

Hi @Yuhta, Thank you for your review, I have modified the problematic code and the code has been rebased. You can check this diff for the relevant modified code.

auto cs =
google::protobuf::Arena::CreateMessage<proto::ColumnStatistics>(
stripeReaderBase_.getReader().arena());
stats.emplace(node, ColumnStatisticsWrapper(cs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have the same problem, I think for this function you should keep it as it is, and create another one returning proto::orc::ColumnStatistics if it is required by ORC as well. Wrappers can only be created on call sites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta I looked at the code again, ORC does not use this method. I'll keep the previous code logic. Thank you for testing this.

@wypb wypb force-pushed the orc_read branch 2 times, most recently from 2af7412 to c5c5986 Compare December 19, 2023 01:33
@@ -436,12 +766,18 @@ class FooterWrapper : public ProtoWrapperBase {
return dwrfPtr()->statistics();
}

const ::facebook::velox::dwrf::proto::ColumnStatistics& statistics(
const ::facebook::velox::dwrf::proto::ColumnStatistics& statisticsByIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name is dwrfStatistics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@wypb wypb closed this Dec 20, 2023
@wypb wypb reopened this Dec 20, 2023
@wypb wypb force-pushed the orc_read branch 2 times, most recently from 0d21b06 to 4b3960e Compare December 20, 2023 03:19
@wypb
Copy link
Contributor Author

wypb commented Jan 25, 2024

Hi @Yuhta Do you have any other comment for this PR? thank you.

@wypb
Copy link
Contributor Author

wypb commented Feb 26, 2024

Hi @Yuhta PTAL, Thank you.

@wypb wypb force-pushed the orc_read branch 5 times, most recently from 0fbf1b2 to d886334 Compare June 13, 2024 01:20
@wypb
Copy link
Contributor Author

wypb commented Jun 13, 2024

HI @kevinwilfong I moved the implementation of options() and context() to ColumnStatisticsBase.h file to reduce duplication of code. I also synchronized the latest code of the main branch to make CI green. please help me review it again when you have a chance. Thanks!

@wypb wypb force-pushed the orc_read branch 14 times, most recently from 5ec1137 to cfc7d0c Compare June 13, 2024 02:09
@facebook-github-bot
Copy link
Contributor

@kevinwilfong has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kevinwilfong merged this pull request in 0c0a973.

Copy link

Conbench analyzed the 1 benchmark run on commit 0c0a9737.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants