-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading parquet file thanks to arrow-dataset #576 #577
base: master
Are you sure you want to change the base?
Conversation
158ed95
to
3c6e600
Compare
Hi and thanks for the PR. I have nothing to add to the code. But i get this exception trying to run the test on Linux with both JDK 11 and 17. The issue seems to be on Arrow side. Do you know about any requirements for it to work?
|
Hi @koperagen it seems to be a JNI issue, I just checked and it works well both on my MacBook Pro (M1) and on a PC with Windows 10 (intel core i7). What is the processor architecture on your computer ? Normally |
Yes, i do run them with Gradle. Processor is Intel core i7. I tried to run the test on TeamCity, but there it fails on Linux as well :(
But the library doesn't have a dependency on any protobuf library, so i assume it could be a linkage error on Arrow side.. maybe? Either this or project needs a dependency on native protobuf somehow
|
Effectively, I also reproduced the issue with docker, downgrading arrow dependency to the version |
3c6e600
to
0dd7498
Compare
Can confirm, 14.0.2 works. I tried it, have some requests
Looks like only URL that point to files are valid ones? Can we make this parameter a
At the same time it reads sample file from tests just fine |
Actually URI parsing is done natively by arrow and it supports only few file systems and unfortunately http(s) is not supported yet :
CF arrow source code : https://github.com/apache/arrow/blob/2a87693134135a8af2ae2b6df41980176431b1c0/cpp/src/arrow/filesystem/filesystem.cc#L679 |
I actually tried to read local copy of that file and it failed with Thanks for clarification about URI. Let's change that parameter type to |
I reached the same issue, another problem with JNI (and thread)... |
Hi, thanks to the PR, sorry, I could not understand will it cover any Parquet files or only Parquet files keeping the something in the Arrow format? I will collect a few parquet files and return to you |
I confirm that it should cover every parquet files. We facing to a JNI error with some parquet files (not all). I created an issue on arrow repository: apache/arrow#20379 |
@fb64 we made a decision to not merge it immediately before three things happened:
Thanks again for your help and collaboration! |
No problem ! |
Related Arrow issue for |
71b06f5
to
8b8f706
Compare
for information I just updated this PR with Arrow 16.0.0 that includes fixes for the 2 issues discovered previously :
|
8b8f706
to
5ce70b9
Compare
5ce70b9
to
79fd37d
Compare
Hi I'm interested in df support, I can confirm that cloning from https://github.com/fb64/dataframe/tree/arrow-read-parquet and building with Please note that Arrow required me to set the following args in my build.gradle
However, when reading a.parquet with 38 millions of lines, it will OOM with It works with smaller files though, around 5M rows per file work.
Parquet size :
I've set Xmx to 4096M on my machine, should I be able to run 38M of lines or is it too much ? thanks |
Hello Laurent, I'm pleased to see your interest in this PR. |
Hi. My parquet is 126MB on disk, uses ZSTD compression, columns types are byte array, double, float (1), and has 38M lines. In fact.. I'm already using DuckDB ;) Some of my parquets are generated either by python polars, or duck. Our company backends use both kotlin and python so I was looking at your PR with great interest. I've used KDF before, my use case is to render base64 PNG plot renderings (using Kandy) of the dataframes, I know it works, however, only on "small" datasets : I was not aware KDF would load the entire thing in memory. This is not a problem per se : I'm already using duck, and your article is a great piece of information, I'll use it :) 👍 I was wondering how one would read such a massive dataset directly thru I hope your branch gets merged soon in 1.4.x ;) (1) ############ file meta data ############ ############ Columns ############ ############ Column(name) ############ ############ Column(id) ############ ############ Column(value2) ############ ############ Column(value3) ############ ############ Column(timestamp) ############ ############ Column(value4) ############ ############ Column(processed_timestamp) ############ |
Polars (and also Pandas) store DataFrames in memory. However, Polars uses Arrow Format for representation in RAM, which can be more efficient for handling large datasets compared to KDF. import polars as pl
df_pl = pl.read_parquet('<your_parquet_file>')
print(f'Size: {df_pl.estimated_size()} bytes') I could be useful to add a such method in Kotlin DataFrame |
79fd37d
to
e66906a
Compare
Polars estimated_size returned : 3650708644 == estimated 3.65GB I made a mistake configuring my gradle build, it was not honoring my heap settings under tests, hence the OOM. I made it work and it fits in memory using 16GB of heap which roughly translates using To estimate an order of magnitude for future readers stumbling upon this thread then I'd say loading the whole set using the JVM will use approx 4-5x the estimated size returned by Polars. TL;DR : works :
I understand KDF minimal java version is 8, and ZGC did not land under java <= 15 so ZGC may not be a viable option for massive datasets requiring a big heap (esp. if target is an Android device). I'm using kotlin on server side, so ZGC is ok for me. Thanks ! |
After testing Parquet and CSV files, I also found that Kotlin Dataframe consumes 4-5 times more memory than Polars and Pandas. This maybe due to type boxing or other things related to the JVM, but not directly linked to parquet parsing. |
I just dig on to understand why the same DataFrame takes much space in JVM, and it is indeed due to JVM memory representation (mark word, class pointer, memory alignment). |
@fb64 We're aware of the JVM space usage of DataFrame. Some weeks ago I've actually been investigating whether it's possible to swap out the underlying ArrayLists containing the column's data with primitive arrays (#712). While, yes, it's possible, I found it to be a large trade-off in terms of performance. Yes, we can store values as primitives, but then for each operation, since we use generic functions, they are boxed and unboxed one at a time. This is heavy and is very noticable for large operations. While we could, in theory, create primitive overloads for each operation we have, it may be worth to wait for Project Valhalla to help us here: https://openjdk.org/jeps/402. |
@Jolanrensen Yes Valhalla sounds promising to improve memory, and combined with the new Vector API it it could improve the performances significantly. |
e66906a
to
b123c43
Compare
Fixes #576