Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Arrow <--> Parquet, and hence Awkward <--> Parquet. #343

Merged
merged 4 commits into from
Jul 17, 2020

Conversation

jpivarski
Copy link
Member

No description provided.

@jpivarski
Copy link
Member Author

@martindurant As expected, Parquet support was rather easy—I just had to copy the old code. Awkward PartitionedArrays translate into row groups, top-level RecordArrays translate into record batches, and any other kind of array is presented as a record batch with one field whose name is the empty string. (This convention is also used when reading back: if there's only one field and its name is the empty string, we read back into a non-empty array.)

I also tested it on some ancient samples I made for OAMap, but even after all these years, most of the data structures are not supported by pyarrow:

https://github.com/scikit-hep/awkward-1.0/blob/97b165e53666d13a88da247ea18d577bfdf85761/tests/test_0341-parquet-reader-writer.py#L105-L246

To mitigate this, I added an explode_records option to ak.to_parquet. The transformation isn't lossy, but the result would have to be read back into ak.zip. Not ideal.

But hey, we have both eager and lazy reading now!

@jpivarski jpivarski merged commit f5d3282 into master Jul 17, 2020
@jpivarski jpivarski deleted the jpivarski/parquet-reader-writer branch July 17, 2020 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant