Parallel Arrow file format reading #8503

alamb · 2023-12-11T21:59:28Z

Is your feature request related to a problem or challenge?

DataFusion can now automatically read CSV and parquet files in parallel (see #6325 for CSV)

It would be great to do the same for "Arrow" files

Describe the solution you'd like

Basically implement what is described in #6325 for Arrow -- and read a single large arrow file in parallel

Describe alternatives you've considered

Some research may be required -- I am not sure if finding record boundaries is feasible

Additional context

I found this while writing tests for #8451

alamb · 2023-12-11T22:09:06Z

See also #8504

my-vegetable-has-exploded · 2023-12-20T08:53:48Z

I'd like to have a try.

my-vegetable-has-exploded · 2023-12-24T09:40:01Z

I read related pr about parquet and csv.
Parquet parallel scan is based on rowgroup and csv is based on line. Both of them can be splitted by row and then output RecordBatchs using a certain method.
I don't think arrow can be handled like that, since arrow file is purely column-based.
But I am wondering whether we can split the scan process into several parts and rebuild the whole Batch, since there maybe more than one array in file.

Merry Christmas!

alamb · 2023-12-24T12:43:27Z

But I am wondering whether we can split the scan process into several parts and rebuild the whole Batch, since there maybe more than one array in file.

This sounds like a good idea to me in theory -- I am not sure how easy/hard it would be to do with the existing arrow IPC reader

In general, the strategy for paralleizing Paruqet and CSV is to be to split up the file by ranges, and then have each of the ArrowFileReaders partitions read row groups (or CSV lines) that have their first byte within their assigned rnage

Perhaps we could do the same for arrow files which could use the first byte of the RecordBatches 🤔

This code explains it a bit more: https://github.com/apache/arrow-datafusion/blob/6b433a839948c406a41128186e81572ec1fff689/datafusion/core/src/datasource/physical_plan/file_groups.rs#L35-L79

my-vegetable-has-exploded · 2023-12-26T15:10:05Z

Perhaps we could do the same for arrow files which could use the first byte of the RecordBatches 🤔

There maybe several RecordBatches(blocks in arrow-rs) in a Arrow file(I didn't notice it before). We can handle it like rowgroups in parquet.

I will check whether DICTIONARY can be handled correctly since there maybe Delta DICTIONARY.

Thanks.

my-vegetable-has-exploded · 2023-12-28T10:44:33Z

I will check whether DICTIONARY can be handled correctly since there maybe Delta DICTIONARY.

It seems that delta dictionary batches not supported yet.

And I think a pub function to provide offsets is needed in upstream. Like

impl<R: Read + Seek> FileReader<R> {
    pub fn blocks(&self) -> Vec<Block> {
        &self.blocks
    }
   //OR
    pub fn offsets(&self) -> Vec<i64> {
        &self.blocks.iter().map(Block::offset).collect()
    }
}

tustvold · 2023-12-28T12:55:45Z

apache/arrow-rs#5249 adds a lower-level reader that should enable this and other use-cases

Delta DICTIONARY.

Delta and replacement dictionaries are only supported by IPC streams, not files

my-vegetable-has-exploded · 2023-12-28T15:11:41Z

Delta DICTIONARY.

Delta and replacement dictionaries are only supported by IPC streams, not files

get it! Thanks

my-vegetable-has-exploded · 2023-12-30T13:09:02Z

I will complete this after next release of arrow-rs.

alamb · 2023-12-31T13:31:31Z

The next release is tracked by apache/arrow-rs#5234

alamb added the enhancement New feature or request label Dec 11, 2023

alamb mentioned this issue Dec 11, 2023

[EPIC] Streaming partitioned writes #6569

Open

38 tasks

my-vegetable-has-exploded mentioned this issue Dec 28, 2023

Support get offsets or blocks info from arrow file. apache/arrow-rs#5252

Closed

my-vegetable-has-exploded mentioned this issue Jan 17, 2024

feat: Parallel Arrow file format reading #8897

Merged

alamb closed this as completed in #8897 Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Arrow file format reading #8503

Parallel Arrow file format reading #8503

alamb commented Dec 11, 2023 •

edited

Loading

alamb commented Dec 11, 2023

my-vegetable-has-exploded commented Dec 20, 2023

my-vegetable-has-exploded commented Dec 24, 2023

alamb commented Dec 24, 2023 •

edited

Loading

my-vegetable-has-exploded commented Dec 26, 2023

my-vegetable-has-exploded commented Dec 28, 2023 •

edited

Loading

tustvold commented Dec 28, 2023 •

edited

Loading

my-vegetable-has-exploded commented Dec 28, 2023

my-vegetable-has-exploded commented Dec 30, 2023

alamb commented Dec 31, 2023

Parallel Arrow file format reading #8503

Parallel Arrow file format reading #8503

Comments

alamb commented Dec 11, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Dec 11, 2023

my-vegetable-has-exploded commented Dec 20, 2023

my-vegetable-has-exploded commented Dec 24, 2023

alamb commented Dec 24, 2023 • edited Loading

my-vegetable-has-exploded commented Dec 26, 2023

my-vegetable-has-exploded commented Dec 28, 2023 • edited Loading

tustvold commented Dec 28, 2023 • edited Loading

my-vegetable-has-exploded commented Dec 28, 2023

my-vegetable-has-exploded commented Dec 30, 2023

alamb commented Dec 31, 2023

alamb commented Dec 11, 2023 •

edited

Loading

alamb commented Dec 24, 2023 •

edited

Loading

my-vegetable-has-exploded commented Dec 28, 2023 •

edited

Loading

tustvold commented Dec 28, 2023 •

edited

Loading