Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filter out strip by provided range #126

Merged
merged 1 commit into from
Sep 29, 2024

Conversation

harveyyue
Copy link
Contributor

@harveyyue harveyyue commented Sep 28, 2024

Since spark orc file format will slice a file into multiple orc splits, support filter out strip by provided range will avoid reading whole orc data file.

@Jefffrey
Copy link
Collaborator

Could you help me understand the intended use case for this? This API seems a bit unintuitive in requiring the user to specify the exact byte range that should be read from a file (where a stripe only needs to begin inside specified range, not necessarily being contained within the range itself). Would it be a more intuitive API to allow users to specify which stripes they would want to read via their indices perhaps?

@harveyyue
Copy link
Contributor Author

harveyyue commented Sep 29, 2024

Could you help me understand the intended use case for this? This API seems a bit unintuitive in requiring the user to specify the exact byte range that should be read from a file (where a stripe only needs to begin inside specified range, not necessarily being contained within the range itself). Would it be a more intuitive API to allow users to specify which stripes they would want to read via their indices perhaps?

Spark will slice filepart-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc into 8 splits as below, currently, this orc file have 4 stripes and specify a stripe reader which couldn't handle this split file reader case, but we can filter out non-stripes according specified range API.
In addition, the ORC Java API's RecordReaderImpl also use the specified range to filter out stripes.

Spark filePartitions

FilePartition(0,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@1fbe9d4): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 0-15485986, partition values: [empty row]
FilePartition(1,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@14c22e8): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 15485986-30971972, partition values: [empty row]
FilePartition(2,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@1f69937a): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 30971972-46457958, partition values: [empty row]
FilePartition(3,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@717d3b2b): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 46457958-61943944, partition values: [empty row]
FilePartition(4,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@2f16c999): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 61943944-77429930, partition values: [empty row]
FilePartition(5,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@4f2de5f1): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 77429930-92915916, partition values: [empty row]
FilePartition(6,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@1c0c4d2d): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 92915916-108401902, partition values: [empty row]
FilePartition(7,[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;@43089e4): path: file:///Users/xxx/Downloads/2024-09-25/part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc, range: 108401902-119693586, partition values: [empty row]

Orc file metadata

Processing data file part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc [length: 119693586]
Structure for part-00000-5ef6a45a-89b8-4048-babe-01fdfd1e0475.c000.zlib.orc
File Version: 0.12 with ORC_517 by ORC Java
Rows: 7000000
Compression: ZLIB
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<_op:string,_ts_ms:bigint,_hoodie_is_deleted:boolean,_precombine_key:string,id:bigint,coin:int,user_id:bigint,account_id:bigint,effective_amount_e8:bigint,pnl_e8:bigint,product_id:bigint,status:int,date:bigint,created_at:bigint,updated_at:bigint,type:int,award_id:bigint,interest_card_pnl_e8:bigint>

Stripe Statistics:
Stripe 1:
Column 0: count: 2140160 hasNull: false
Column 1: count: 2140160 hasNull: false bytesOnDisk: 117 min: r max: r sum: 2140160
Column 2: count: 2140160 hasNull: false bytesOnDisk: 79273 min: 1727310063000 max: 1727320528000 sum: 3696730248666850000
Column 3: count: 2140160 hasNull: false bytesOnDisk: 41 true: 0
Column 4: count: 2140160 hasNull: false bytesOnDisk: 138 min: 1727309378802|0000000000|00000000000000000000|0000000000 max: 1727309378802|0000000000|00000000000000000000|0000000000 sum: 119848960
Column 5: count: 2140160 hasNull: false bytesOnDisk: 4826163 min: 154389335 max: 256789334 sum: 445391857788800
Column 6: count: 2140160 hasNull: false bytesOnDisk: 2333188 min: 1 max: 717 sum: 415121566
Column 7: count: 2140160 hasNull: false bytesOnDisk: 8073284 min: 461578 max: 288005330 sum: 187694142854795
Column 8: count: 2140160 hasNull: false bytesOnDisk: 107 min: 0 max: 0 sum: 0
Column 9: count: 2140160 hasNull: false bytesOnDisk: 10796783 min: 0 max: 7940634944600000000
Column 10: count: 2140160 hasNull: false bytesOnDisk: 6211231 min: 0 max: 217551600000000 sum: 14828543874169736
Column 11: count: 2140160 hasNull: false bytesOnDisk: 2341373 min: 1 max: 428 sum: 312436856
Column 12: count: 2140160 hasNull: false bytesOnDisk: 107 min: 0 max: 0 sum: 0
Column 13: count: 2140160 hasNull: false bytesOnDisk: 1349762 min: 1717200963000 max: 1726274451000 sum: 3685640085065660000
Column 14: count: 2140160 hasNull: false bytesOnDisk: 213586 min: 1717204608000 max: 1726274451000 sum: 3685642207967777000
Column 15: count: 2140160 hasNull: false bytesOnDisk: 1242591 min: 1717204608000 max: 1726274569000 sum: 3685642291884386000
Column 16: count: 2140160 hasNull: false bytesOnDisk: 3563 min: 0 max: 3 sum: 65742
Column 17: count: 2140160 hasNull: false bytesOnDisk: 107 min: 0 max: 0 sum: 0
Column 18: count: 2140160 hasNull: false bytesOnDisk: 255 min: 0 max: 193655787 sum: 525299178
Stripe 2:
Column 0: count: 2160640 hasNull: false
Column 1: count: 2160640 hasNull: false bytesOnDisk: 118 min: r max: r sum: 2160640
Column 2: count: 2160640 hasNull: false bytesOnDisk: 80117 min: 1727310143000 max: 1727320751000 sum: 3732105049953096000
Column 3: count: 2160640 hasNull: false bytesOnDisk: 43 true: 0
Column 4: count: 2160640 hasNull: false bytesOnDisk: 139 min: 1727309378802|0000000000|00000000000000000000|0000000000 max: 1727309378802|0000000000|00000000000000000000|0000000000 sum: 120995840
Column 5: count: 2160640 hasNull: false bytesOnDisk: 4863204 min: 154459415 max: 267052720 sum: 471540114883200
Column 6: count: 2160640 hasNull: false bytesOnDisk: 2318295 min: 1 max: 717 sum: 414120776
Column 7: count: 2160640 hasNull: false bytesOnDisk: 8164727 min: 461578 max: 308047924 sum: 194730114994289
Column 8: count: 2160640 hasNull: false bytesOnDisk: 108 min: 0 max: 0 sum: 0
Column 9: count: 2160640 hasNull: false bytesOnDisk: 10849376 min: 0 max: 9200000000000000000
Column 10: count: 2160640 hasNull: false bytesOnDisk: 6243389 min: 0 max: 217551600000000 sum: 15040754996761726
Column 11: count: 2160640 hasNull: false bytesOnDisk: 2329304 min: 1 max: 428 sum: 310045921
Column 12: count: 2160640 hasNull: false bytesOnDisk: 108 min: 0 max: 0 sum: 0
Column 13: count: 2160640 hasNull: false bytesOnDisk: 1318286 min: 1717287360000 max: 1727137331000 sum: 3722922583673604000
Column 14: count: 2160640 hasNull: false bytesOnDisk: 218007 min: 1717287554000 max: 1727137339000 sum: 3722924901403567000
Column 15: count: 2160640 hasNull: false bytesOnDisk: 523315 min: 1717287554000 max: 1727137346000 sum: 3722924918585875000
Column 16: count: 2160640 hasNull: false bytesOnDisk: 3721 min: 0 max: 3 sum: 73551
Column 17: count: 2160640 hasNull: false bytesOnDisk: 108 min: 0 max: 0 sum: 0
Column 18: count: 2160640 hasNull: false bytesOnDisk: 942 min: 0 max: 550000000 sum: 6754362385
Stripe 3:
Column 0: count: 2176000 hasNull: false
Column 1: count: 2176000 hasNull: false bytesOnDisk: 118 min: r max: r sum: 2176000
Column 2: count: 2176000 hasNull: false bytesOnDisk: 80741 min: 1727309864000 max: 1727319282000 sum: 3758638017170580000
Column 3: count: 2176000 hasNull: false bytesOnDisk: 43 true: 0
Column 4: count: 2176000 hasNull: false bytesOnDisk: 139 min: 1727309378802|0000000000|00000000000000000000|0000000000 max: 1727309378802|0000000000|00000000000000000000|0000000000 sum: 121856000
Column 5: count: 2176000 hasNull: false bytesOnDisk: 4854530 min: 170389335 max: 267189334 sum: 469557746640000
Column 6: count: 2176000 hasNull: false bytesOnDisk: 2309626 min: 1 max: 717 sum: 408335509
Column 7: count: 2176000 hasNull: false bytesOnDisk: 8206884 min: 461578 max: 308790487 sum: 195723938201199
Column 8: count: 2176000 hasNull: false bytesOnDisk: 108 min: 0 max: 0 sum: 0
Column 9: count: 2176000 hasNull: false bytesOnDisk: 10883131 min: 0 max: 9200000000000000000
Column 10: count: 2176000 hasNull: false bytesOnDisk: 6264019 min: 0 max: 252054700000000 sum: 15061862115047130
Column 11: count: 2176000 hasNull: false bytesOnDisk: 2319575 min: 1 max: 428 sum: 310922042
Column 12: count: 2176000 hasNull: false bytesOnDisk: 108 min: 0 max: 0 sum: 0
Column 13: count: 2176000 hasNull: false bytesOnDisk: 1334074 min: 1718842560000 max: 1727137593000 sum: 3748879671302168000
Column 14: count: 2176000 hasNull: false bytesOnDisk: 319777 min: 1718844219000 max: 1727137593000 sum: 3748881950393951000
Column 15: count: 2176000 hasNull: false bytesOnDisk: 635359 min: 1718844221000 max: 1727137599000 sum: 3748881970927861000
Column 16: count: 2176000 hasNull: false bytesOnDisk: 4049 min: 0 max: 3 sum: 164865
Column 17: count: 2176000 hasNull: false bytesOnDisk: 108 min: 0 max: 0 sum: 0
Column 18: count: 2176000 hasNull: false bytesOnDisk: 1203 min: 0 max: 480023027 sum: 6826562910
Stripe 4:
Column 0: count: 523200 hasNull: false
Column 1: count: 523200 hasNull: false bytesOnDisk: 57 min: r max: r sum: 523200
Column 2: count: 523200 hasNull: false bytesOnDisk: 3769 min: 1727311393000 max: 1727318680000 sum: 903730017209474000
Column 3: count: 523200 hasNull: false bytesOnDisk: 19 true: 0
Column 4: count: 523200 hasNull: false bytesOnDisk: 78 min: 1727309378802|0000000000|00000000000000000000|0000000000 max: 1727309378802|0000000000|00000000000000000000|0000000000 sum: 29299200
Column 5: count: 523200 hasNull: false bytesOnDisk: 222790 min: 192989335 max: 238789334 sum: 104815622188000
Column 6: count: 523200 hasNull: false bytesOnDisk: 562190 min: 1 max: 713 sum: 97927017
Column 7: count: 523200 hasNull: false bytesOnDisk: 1974867 min: 461578 max: 268199017 sum: 45696224087078
Column 8: count: 523200 hasNull: false bytesOnDisk: 47 min: 0 max: 0 sum: 0
Column 9: count: 523200 hasNull: false bytesOnDisk: 2631479 min: 0 max: 4601175500100000000
Column 10: count: 523200 hasNull: false bytesOnDisk: 1513070 min: 0 max: 126059600000000 sum: 3588340355210087
Column 11: count: 523200 hasNull: false bytesOnDisk: 564705 min: 1 max: 427 sum: 75786339
Column 12: count: 523200 hasNull: false bytesOnDisk: 47 min: 0 max: 0 sum: 0
Column 13: count: 523200 hasNull: false bytesOnDisk: 346099 min: 1720916160000 max: 1724890866000 sum: 900725815767454000
Column 14: count: 523200 hasNull: false bytesOnDisk: 10449 min: 1720917457000 max: 1724890884000 sum: 900726317556700000
Column 15: count: 523200 hasNull: false bytesOnDisk: 104258 min: 1720917457000 max: 1724890893000 sum: 900726321642114000
Column 16: count: 523200 hasNull: false bytesOnDisk: 774 min: 0 max: 3 sum: 534
Column 17: count: 523200 hasNull: false bytesOnDisk: 47 min: 0 max: 0 sum: 0
Column 18: count: 523200 hasNull: false bytesOnDisk: 107 min: 0 max: 550000000 sum: 680208080

File Statistics:
Column 0: count: 7000000 hasNull: false
Column 1: count: 7000000 hasNull: false bytesOnDisk: 410 min: r max: r sum: 7000000
Column 2: count: 7000000 hasNull: false bytesOnDisk: 243900 min: 1727309864000 max: 1727320751000
Column 3: count: 7000000 hasNull: false bytesOnDisk: 146 true: 0
Column 4: count: 7000000 hasNull: false bytesOnDisk: 494 min: 1727309378802|0000000000|00000000000000000000|0000000000 max: 1727309378802|0000000000|00000000000000000000|0000000000 sum: 392000000
Column 5: count: 7000000 hasNull: false bytesOnDisk: 14766687 min: 154389335 max: 267189334 sum: 1491305341500000
Column 6: count: 7000000 hasNull: false bytesOnDisk: 7523299 min: 1 max: 717 sum: 1335504868
Column 7: count: 7000000 hasNull: false bytesOnDisk: 26419762 min: 461578 max: 308790487 sum: 623844420137361
Column 8: count: 7000000 hasNull: false bytesOnDisk: 370 min: 0 max: 0 sum: 0
Column 9: count: 7000000 hasNull: false bytesOnDisk: 35160769 min: 0 max: 9200000000000000000
Column 10: count: 7000000 hasNull: false bytesOnDisk: 20231709 min: 0 max: 252054700000000 sum: 48519501341188679
Column 11: count: 7000000 hasNull: false bytesOnDisk: 7554957 min: 1 max: 428 sum: 1009191158
Column 12: count: 7000000 hasNull: false bytesOnDisk: 370 min: 0 max: 0 sum: 0
Column 13: count: 7000000 hasNull: false bytesOnDisk: 4348221 min: 1717200963000 max: 1727137593000
Column 14: count: 7000000 hasNull: false bytesOnDisk: 761819 min: 1717204608000 max: 1727137593000
Column 15: count: 7000000 hasNull: false bytesOnDisk: 2505523 min: 1717204608000 max: 1727137599000
Column 16: count: 7000000 hasNull: false bytesOnDisk: 12107 min: 0 max: 3 sum: 304692
Column 17: count: 7000000 hasNull: false bytesOnDisk: 370 min: 0 max: 0 sum: 0
Column 18: count: 7000000 hasNull: false bytesOnDisk: 2507 min: 0 max: 550000000 sum: 14786432553

Stripes:
Stripe: offset: 3 data: 37471669 rows: 2140160 tail: 255 index: 46908
Stream: column 0 section ROW_INDEX start: 3 length 40
Stream: column 1 section ROW_INDEX start: 43 length 938
Stream: column 2 section ROW_INDEX start: 981 length 2139
Stream: column 3 section ROW_INDEX start: 3120 length 837
Stream: column 4 section ROW_INDEX start: 3957 length 1129
Stream: column 5 section ROW_INDEX start: 5086 length 4235
Stream: column 6 section ROW_INDEX start: 9321 length 2535
Stream: column 7 section ROW_INDEX start: 11856 length 4300
Stream: column 8 section ROW_INDEX start: 16156 length 928
Stream: column 9 section ROW_INDEX start: 17084 length 5414
Stream: column 10 section ROW_INDEX start: 22498 length 4707
Stream: column 11 section ROW_INDEX start: 27205 length 2512
Stream: column 12 section ROW_INDEX start: 29717 length 928
Stream: column 13 section ROW_INDEX start: 30645 length 3696
Stream: column 14 section ROW_INDEX start: 34341 length 4408
Stream: column 15 section ROW_INDEX start: 38749 length 4435
Stream: column 16 section ROW_INDEX start: 43184 length 1640
Stream: column 17 section ROW_INDEX start: 44824 length 928
Stream: column 18 section ROW_INDEX start: 45752 length 1159
Stream: column 1 section DATA start: 46911 length 107
Stream: column 1 section LENGTH start: 47018 length 6
Stream: column 1 section DICTIONARY_DATA start: 47024 length 4
Stream: column 2 section DATA start: 47028 length 79273
Stream: column 3 section DATA start: 126301 length 41
Stream: column 4 section DATA start: 126342 length 107
Stream: column 4 section LENGTH start: 126449 length 6
Stream: column 4 section DICTIONARY_DATA start: 126455 length 25
Stream: column 5 section DATA start: 126480 length 4826163
Stream: column 6 section DATA start: 4952643 length 2333188
Stream: column 7 section DATA start: 7285831 length 8073284
Stream: column 8 section DATA start: 15359115 length 107
Stream: column 9 section DATA start: 15359222 length 10796783
Stream: column 10 section DATA start: 26156005 length 6211231
Stream: column 11 section DATA start: 32367236 length 2341373
Stream: column 12 section DATA start: 34708609 length 107
Stream: column 13 section DATA start: 34708716 length 1349762
Stream: column 14 section DATA start: 36058478 length 213586
Stream: column 15 section DATA start: 36272064 length 1242591
Stream: column 16 section DATA start: 37514655 length 3563
Stream: column 17 section DATA start: 37518218 length 107
Stream: column 18 section DATA start: 37518325 length 255
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[1]
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT
Encoding column 4: DICTIONARY_V2[1]
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Encoding column 8: DIRECT_V2
Encoding column 9: DIRECT_V2
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT_V2
Encoding column 13: DIRECT_V2
Encoding column 14: DIRECT_V2
Encoding column 15: DIRECT_V2
Encoding column 16: DIRECT_V2
Encoding column 17: DIRECT_V2
Encoding column 18: DIRECT_V2
Stripe: offset: 37518835 data: 36913307 rows: 2160640 tail: 252 index: 48653
Stream: column 0 section ROW_INDEX start: 37518835 length 40
Stream: column 1 section ROW_INDEX start: 37518875 length 948
Stream: column 2 section ROW_INDEX start: 37519823 length 2215
Stream: column 3 section ROW_INDEX start: 37522038 length 844
Stream: column 4 section ROW_INDEX start: 37522882 length 1143
Stream: column 5 section ROW_INDEX start: 37524025 length 4376
Stream: column 6 section ROW_INDEX start: 37528401 length 2594
Stream: column 7 section ROW_INDEX start: 37530995 length 4380
Stream: column 8 section ROW_INDEX start: 37535375 length 941
Stream: column 9 section ROW_INDEX start: 37536316 length 5500
Stream: column 10 section ROW_INDEX start: 37541816 length 4727
Stream: column 11 section ROW_INDEX start: 37546543 length 2558
Stream: column 12 section ROW_INDEX start: 37549101 length 941
Stream: column 13 section ROW_INDEX start: 37550042 length 3830
Stream: column 14 section ROW_INDEX start: 37553872 length 4519
Stream: column 15 section ROW_INDEX start: 37558391 length 4537
Stream: column 16 section ROW_INDEX start: 37562928 length 1676
Stream: column 17 section ROW_INDEX start: 37564604 length 941
Stream: column 18 section ROW_INDEX start: 37565545 length 1943
Stream: column 1 section DATA start: 37567488 length 108
Stream: column 1 section LENGTH start: 37567596 length 6
Stream: column 1 section DICTIONARY_DATA start: 37567602 length 4
Stream: column 2 section DATA start: 37567606 length 80117
Stream: column 3 section DATA start: 37647723 length 43
Stream: column 4 section DATA start: 37647766 length 108
Stream: column 4 section LENGTH start: 37647874 length 6
Stream: column 4 section DICTIONARY_DATA start: 37647880 length 25
Stream: column 5 section DATA start: 37647905 length 4863204
Stream: column 6 section DATA start: 42511109 length 2318295
Stream: column 7 section DATA start: 44829404 length 8164727
Stream: column 8 section DATA start: 52994131 length 108
Stream: column 9 section DATA start: 52994239 length 10849376
Stream: column 10 section DATA start: 63843615 length 6243389
Stream: column 11 section DATA start: 70087004 length 2329304
Stream: column 12 section DATA start: 72416308 length 108
Stream: column 13 section DATA start: 72416416 length 1318286
Stream: column 14 section DATA start: 73734702 length 218007
Stream: column 15 section DATA start: 73952709 length 523315
Stream: column 16 section DATA start: 74476024 length 3721
Stream: column 17 section DATA start: 74479745 length 108
Stream: column 18 section DATA start: 74479853 length 942
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[1]
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT
Encoding column 4: DICTIONARY_V2[1]
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Encoding column 8: DIRECT_V2
Encoding column 9: DIRECT_V2
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT_V2
Encoding column 13: DIRECT_V2
Encoding column 14: DIRECT_V2
Encoding column 15: DIRECT_V2
Encoding column 16: DIRECT_V2
Encoding column 17: DIRECT_V2
Encoding column 18: DIRECT_V2
Stripe: offset: 74481047 data: 37213592 rows: 2176000 tail: 255 index: 48735
Stream: column 0 section ROW_INDEX start: 74481047 length 40
Stream: column 1 section ROW_INDEX start: 74481087 length 957
Stream: column 2 section ROW_INDEX start: 74482044 length 2176
Stream: column 3 section ROW_INDEX start: 74484220 length 849
Stream: column 4 section ROW_INDEX start: 74485069 length 1152
Stream: column 5 section ROW_INDEX start: 74486221 length 4380
Stream: column 6 section ROW_INDEX start: 74490601 length 2580
Stream: column 7 section ROW_INDEX start: 74493181 length 4402
Stream: column 8 section ROW_INDEX start: 74497583 length 943
Stream: column 9 section ROW_INDEX start: 74498526 length 5477
Stream: column 10 section ROW_INDEX start: 74504003 length 4741
Stream: column 11 section ROW_INDEX start: 74508744 length 2584
Stream: column 12 section ROW_INDEX start: 74511328 length 943
Stream: column 13 section ROW_INDEX start: 74512271 length 3817
Stream: column 14 section ROW_INDEX start: 74516088 length 4581
Stream: column 15 section ROW_INDEX start: 74520669 length 4572
Stream: column 16 section ROW_INDEX start: 74525241 length 1714
Stream: column 17 section ROW_INDEX start: 74526955 length 943
Stream: column 18 section ROW_INDEX start: 74527898 length 1884
Stream: column 1 section DATA start: 74529782 length 108
Stream: column 1 section LENGTH start: 74529890 length 6
Stream: column 1 section DICTIONARY_DATA start: 74529896 length 4
Stream: column 2 section DATA start: 74529900 length 80741
Stream: column 3 section DATA start: 74610641 length 43
Stream: column 4 section DATA start: 74610684 length 108
Stream: column 4 section LENGTH start: 74610792 length 6
Stream: column 4 section DICTIONARY_DATA start: 74610798 length 25
Stream: column 5 section DATA start: 74610823 length 4854530
Stream: column 6 section DATA start: 79465353 length 2309626
Stream: column 7 section DATA start: 81774979 length 8206884
Stream: column 8 section DATA start: 89981863 length 108
Stream: column 9 section DATA start: 89981971 length 10883131
Stream: column 10 section DATA start: 100865102 length 6264019
Stream: column 11 section DATA start: 107129121 length 2319575
Stream: column 12 section DATA start: 109448696 length 108
Stream: column 13 section DATA start: 109448804 length 1334074
Stream: column 14 section DATA start: 110782878 length 319777
Stream: column 15 section DATA start: 111102655 length 635359
Stream: column 16 section DATA start: 111738014 length 4049
Stream: column 17 section DATA start: 111742063 length 108
Stream: column 18 section DATA start: 111742171 length 1203
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[1]
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT
Encoding column 4: DICTIONARY_V2[1]
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Encoding column 8: DIRECT_V2
Encoding column 9: DIRECT_V2
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT_V2
Encoding column 13: DIRECT_V2
Encoding column 14: DIRECT_V2
Encoding column 15: DIRECT_V2
Encoding column 16: DIRECT_V2
Encoding column 17: DIRECT_V2
Encoding column 18: DIRECT_V2
Stripe: offset: 111743629 data: 7934852 rows: 523200 tail: 247 index: 12820
Stream: column 0 section ROW_INDEX start: 111743629 length 24
Stream: column 1 section ROW_INDEX start: 111743653 length 319
Stream: column 2 section ROW_INDEX start: 111743972 length 560
Stream: column 3 section ROW_INDEX start: 111744532 length 264
Stream: column 4 section ROW_INDEX start: 111744796 length 385
Stream: column 5 section ROW_INDEX start: 111745181 length 1043
Stream: column 6 section ROW_INDEX start: 111746224 length 671
Stream: column 7 section ROW_INDEX start: 111746895 length 1140
Stream: column 8 section ROW_INDEX start: 111748035 length 303
Stream: column 9 section ROW_INDEX start: 111748338 length 1481
Stream: column 10 section ROW_INDEX start: 111749819 length 1243
Stream: column 11 section ROW_INDEX start: 111751062 length 693
Stream: column 12 section ROW_INDEX start: 111751755 length 303
Stream: column 13 section ROW_INDEX start: 111752058 length 962
Stream: column 14 section ROW_INDEX start: 111753020 length 1119
Stream: column 15 section ROW_INDEX start: 111754139 length 1180
Stream: column 16 section ROW_INDEX start: 111755319 length 434
Stream: column 17 section ROW_INDEX start: 111755753 length 303
Stream: column 18 section ROW_INDEX start: 111756056 length 393
Stream: column 1 section DATA start: 111756449 length 47
Stream: column 1 section LENGTH start: 111756496 length 6
Stream: column 1 section DICTIONARY_DATA start: 111756502 length 4
Stream: column 2 section DATA start: 111756506 length 3769
Stream: column 3 section DATA start: 111760275 length 19
Stream: column 4 section DATA start: 111760294 length 47
Stream: column 4 section LENGTH start: 111760341 length 6
Stream: column 4 section DICTIONARY_DATA start: 111760347 length 25
Stream: column 5 section DATA start: 111760372 length 222790
Stream: column 6 section DATA start: 111983162 length 562190
Stream: column 7 section DATA start: 112545352 length 1974867
Stream: column 8 section DATA start: 114520219 length 47
Stream: column 9 section DATA start: 114520266 length 2631479
Stream: column 10 section DATA start: 117151745 length 1513070
Stream: column 11 section DATA start: 118664815 length 564705
Stream: column 12 section DATA start: 119229520 length 47
Stream: column 13 section DATA start: 119229567 length 346099
Stream: column 14 section DATA start: 119575666 length 10449
Stream: column 15 section DATA start: 119586115 length 104258
Stream: column 16 section DATA start: 119690373 length 774
Stream: column 17 section DATA start: 119691147 length 47
Stream: column 18 section DATA start: 119691194 length 107
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[1]
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT
Encoding column 4: DICTIONARY_V2[1]
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Encoding column 8: DIRECT_V2
Encoding column 9: DIRECT_V2
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT_V2
Encoding column 13: DIRECT_V2
Encoding column 14: DIRECT_V2
Encoding column 15: DIRECT_V2
Encoding column 16: DIRECT_V2
Encoding column 17: DIRECT_V2
Encoding column 18: DIRECT_V2

File length: 119693586 bytes
File raw data size: 2387000000 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit hesitant on this API itself, but I suppose since there is a need for this functionality and we don't provide any alternative (e.g. select stripe index) then we can introduce this.

src/arrow_reader.rs Outdated Show resolved Hide resolved
src/arrow_reader.rs Outdated Show resolved Hide resolved
src/async_arrow_reader.rs Outdated Show resolved Hide resolved
Comment on lines +385 to +421
#[test]
pub fn basic_test_with_range() {
let path = basic_path("test.orc");
let reader = new_arrow_reader_range(&path, 0..2000);
let batch = reader.collect::<Result<Vec<_>, _>>().unwrap();

assert_eq!(5, batch[0].column(0).len());
}

#[test]
pub fn basic_test_with_range_without_data() {
let path = basic_path("test.orc");
let reader = new_arrow_reader_range(&path, 100..2000);
let batch = reader.collect::<Result<Vec<_>, _>>().unwrap();

assert_eq!(0, batch.len());
}

#[cfg(feature = "async")]
#[tokio::test]
pub async fn async_basic_test_with_range() {
let path = basic_path("test.orc");
let reader = new_arrow_stream_reader_range(&path, 0..2000).await;
let batch = reader.try_collect::<Vec<_>>().await.unwrap();

assert_eq!(5, batch[0].column(0).len());
}

#[cfg(feature = "async")]
#[tokio::test]
pub async fn async_basic_test_with_range_without_data() {
let path = basic_path("test.orc");
let reader = new_arrow_stream_reader_range(&path, 100..2000).await;
let batch = reader.try_collect::<Vec<_>>().await.unwrap();

assert_eq!(0, batch.len());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this 👍

@Jefffrey Jefffrey merged commit 16e0e04 into datafusion-contrib:main Sep 29, 2024
11 checks passed
@harveyyue harveyyue deleted the range branch September 30, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants