-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray #5904
Comments
Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently? |
The current gc function won't deduplicating strings, it only use GenericByteViewBuilder to create a new instance of the array. |
I filed #5910 to track discussing this option |
take |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In https://github.com/apache/arrow-rs/issues/ @ariesdevil @XiangpengHao and I implemented pretty fast reading of data in Parquet to Arrow
StringViewArray
The solution we have so far is #5877 which doesn't copy the string data 🎉 , but does track a set of offsets which are then converted into
StringViewArray
@ariesdevil had a more comprehensive approach in #5557 that built the StringViews directly from the encoded data but hadn't yet removed the string copies
Describe the solution you'd like
It may be worth looking at the StringViewDecoding to see if there is more performance to be had.
Specifically we can se the
arrow_array_reader/StringViewArray
and related benchmarks to profile and identify any additional potential improvementsDescribe alternatives you've considered
It may be good enough now
Additional context
The text was updated successfully, but these errors were encountered: