-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for dictionary encoded INT96 timestamp in parquet files #4680
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo the implementation looks good to me. Left a couple of comments. Can you add a Parquet File with timestamp type to velox/dwio/parquet/tests/examples
and add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some tests in https://github.com/facebookincubator/velox/blob/main/velox/dwio/parquet/tests/reader/E2EFilterTest.cpp
You probably need to make the writer to generate INT96 in writeToMemory
@Yuhta I doubt if the Arrow Bridge supports int96 type. But worth checking. The alternative is to check in a file. |
@majetideepak Using a fixed file gives less coverage, but if the writer is not working then we have to do it this way for now. Either way we should make sure the result is correct with or without filters. |
@majetideepak @Yuhta Thanks for your review! Your comments are well received, and I'm working on them. |
f5c944e
to
a8dee34
Compare
@Yuhta I also tried that. Use |
53d9408
to
1fe1694
Compare
hi @Yuhta, I spent more time on Below is the stack of current timestamp scan. facebook::velox::parquet::PageReader::prepareDictionary(facebook::velox::parquet::thrift::PageHeader const&) in ./velox_dwio_parquet_table_scan_test
1# facebook::velox::parquet::PageReader::seekToPage(long) in ./velox_dwio_parquet_table_scan_test
2# facebook::velox::parquet::PageReader::rowsForPage(facebook::velox::dwio::common::SelectiveColumnReader&, bool, bool, folly::Range<int const*>&, unsigned long const*&) in ./velox_dwio_parquet_table_scan_test
3# void facebook::velox::parquet::PageReader::readWithVisitor<facebook::velox::dwio::common::ColumnVisitor<__int128, facebook::velox::common::AlwaysTrue, facebook::velox::dwio::common::ExtractToReader<facebook::velox::dwio::common::SelectiveIntegerColumnReader>, true> >(facebook::velox::dwio::common::ColumnVisitor<__int128, facebook::velox::common::AlwaysTrue, facebook::velox::dwio::common::ExtractToReader<facebook::velox::dwio::common::SelectiveIntegerColumnReader>, true>&) in ./velox_dwio_parquet_table_scan_test Could you explain more about the possible risks? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using int128_t will work here, the valueSize_ is different and you will end up reading different part of data and even read out of bound.
hi @Yuhta, I spent more time on
Int96Timestamp
type support but it is not easy to make it work through. We also made more tests on theint128_t
workaround, and found it could work for a pure scan. As posted in this PR, int96 in Parquet is converted to Velox Timestamp type (which is of 16-byte length) in PageReader (see link), and onlynumValues * sizeof(Int96Timestamp)
bytes of data was read in PageReader.Below is the stack of current timestamp scan.
facebook::velox::parquet::PageReader::prepareDictionary(facebook::velox::parquet::thrift::PageHeader const&) in ./velox_dwio_parquet_table_scan_test 1# facebook::velox::parquet::PageReader::seekToPage(long) in ./velox_dwio_parquet_table_scan_test 2# facebook::velox::parquet::PageReader::rowsForPage(facebook::velox::dwio::common::SelectiveColumnReader&, bool, bool, folly::Range<int const*>&, unsigned long const*&) in ./velox_dwio_parquet_table_scan_test 3# void facebook::velox::parquet::PageReader::readWithVisitor<facebook::velox::dwio::common::ColumnVisitor<__int128, facebook::velox::common::AlwaysTrue, facebook::velox::dwio::common::ExtractToReader<facebook::velox::dwio::common::SelectiveIntegerColumnReader>, true> >(facebook::velox::dwio::common::ColumnVisitor<__int128, facebook::velox::common::AlwaysTrue, facebook::velox::dwio::common::ExtractToReader<facebook::velox::dwio::common::SelectiveIntegerColumnReader>, true>&) in ./velox_dwio_parquet_table_scan_testCould you explain more about the possible risks? Thank you.
So the assumption here is it is always dictionary-encoded? If this assumption holds all the time, we can probably go this way. It's only a problem when we want to apply a filter on the column of flat values.
Make sure to beef up your E2E filter tests with filters on some primary keys (int64 is fine), and also put timestamp in complex types (array, map, struct) in addition to top-level column.
@Yuhta Thanks for your reply.
Understood the gap here. I guess RLEV1 and Plain encoding are also possible because the column encoding can be set during Parquet write, but we only tested the Parquet generated with default configs.
Got it, will do. |
@bikramSingh91 Could you help import and merge this PR? Thanks! |
@@ -99,13 +99,11 @@ PlanBuilder& PlanBuilder::tableScan( | |||
const RowTypePtr& dataColumns, | |||
const std::unordered_map< | |||
std::string, | |||
std::shared_ptr<connector::ColumnHandle>>& assignments, | |||
bool isFilterPushdownEnabled) { | |||
std::shared_ptr<connector::ColumnHandle>>& assignments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is removing isFilterPushdownEnabled parameter related to the Timestamp reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. To add isFilterPushdownEnabled
parameter was a temporary change before supporting the filter pushdown of Timestamp. After its support in Filter.h, this change has been removed from this PR.
@@ -272,6 +272,16 @@ bool HiveConfig::s3UseProxyFromEnv() const { | |||
return config_->get<bool>(kS3UseProxyFromEnv, false); | |||
} | |||
|
|||
uint8_t HiveConfig::readTimestampUnit(const Config* session) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit should be read from the Parquet logical type for this column, not set by the user as a config property. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is corresponding to the timestamp unit that can be handled in compute engine (usually milliseconds for Presto and maybe some other values for Spark), not related to the type in the file. The reader should use the more coarse one of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 In a Parquet file, the unit of int96 is fixed because it is made up of days and nanos, unlike int64-timestamp, which can have different units. We parse days and nanos from Parquet, while the compute engine may need different units of timestamps, e.g. Presto needs milli while Spark needs micro. This config allows us to adjust timestamp precision according to user's requirement.
Without this change, the filter result could become incorrect. For example, for a Spark filter a == 2000-09-12 22:36:29.000000
, if a is stored as nano unit in Velox, when a is 2000-09-12 22:36:29.000000111
Velox returns false but Spark needs true because it only cares about the micro digits.
Therefore, we need to truncate the value and this logic is also needed for int64-timestamp reader. Does this makes sense? Thanks.
Reference for Int96 in Parquet: https://github.com/apache/parquet-format/pull/49/files#diff-0e877db0daf579f98a11e5e113b29250a2dcae3decb1e83a88db1e6f092bee96R149-R150
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 In a Parquet file, the unit of int96 is fixed because it is made up of days and nanos, unlike int64-timestamp, which can have different units. We parse days and nanos from Parquet, while the compute engine may need different units of timestamps, e.g. Presto needs milli while Spark needs micro. This config allows us to adjust timestamp precision according to user's requirement.
Without this change, the filter result could become incorrect. For example, for a Spark filter
a == 2000-09-12 22:36:29.000000
, if a is stored as nano unit in Velox, when a is2000-09-12 22:36:29.000000111
Velox returns false but Spark needs true because it only cares about the micro digits.Therefore, we need to truncate the value and this logic is also needed for int64-timestamp reader. Does this makes sense? Thanks.
Reference for Int96 in Parquet: https://github.com/apache/parquet-format/pull/49/files#diff-0e877db0daf579f98a11e5e113b29250a2dcae3decb1e83a88db1e6f092bee96R149-R150
THanks @rui-mo for explaining. Sorry I didn't check the INT96 Timestamp spec. Just approved this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for helping review this PR.
We have ported this PR internally and so far running fine. Thanks for working on this @rui-mo ! We do encounter one issue though related IntDecoder reading int128.
Any quick insights on what needs to be done here? i.e. If the data file uses INT96 (12 bytes), readInt128() would read 16 bytes? Then will the reader need to re-align the bytes correspondingly, plus use little endian? |
@chliang71 Thanks for your feedback. I assume this issue is on plain-encoded timestamp reading, while this PR focuses on dictionary-encoding. There is a draft on plain-encoding support oap-project@533bb9e by @mskapilks, which may go into a separate PR after this one. |
I can raise the follow up PR for that change once this is done |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova I assume there has been some discussions on whether to merge this one or #8325 first. Seeing #8325 (comment) & #8325 (comment). If possible, we would like merge this one first as it is ready. |
@rui-mo There is a assertion failure in unit test:
|
@Yuhta Thanks for the catch. I reproduced locally on debug mode and fixed with this change: https://github.com/facebookincubator/velox/pull/4680/files#diff-ae87451c1577f3b47d2863187de8bf30c7351484d39537419016487cc7b2f71cR49-R51. Would you take another look? Thank you. |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Thank you all for helping review this PR. |
Support timestamp reader for Parquet file format to read from dictionary-
encoded INT96 timestamps. Hive configs
kReadTimestampUnit
andkReadTimestampUnitSession
are added to control the precision whenreading timestamps from files.
Parquet documentation for INT96:
https://github.com/apache/parquet-format/pull/49/files#diff-0e877db0daf579f98a11e5e113b29250a2dcae3decb1e83a88db1e6f092bee96R149-R157