-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of row converter by skipping utf8 checks #6058
Comments
take |
Hi @alamb and @XiangpengHao, I have some observations for this issue. The uf8 validation only happens when row.config.validate_utf8 is true. The validate_utf8 is only set to true when initialized from a RowParser Lines 781 to 788 in 49840ec
I find the only usage of Line 759 in 49840ec
|
That's an excellent observation, I also double checked the RowConverter in DataFusion and also did not find any reference to I think utf-8 is not being validated (and is expected) in DataFusion, so we are not slowed by utf-8 validation. But we probably have to keep that utf-8 check because other users may use |
Perfect. Thank you for the investigation @xinlifoobar and the confirmation @XiangpengHao |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5374
@XiangpengHao implemented optimized row format --> ByteView (StringView / BinaryView) encoding/decoding in #5945 / #6044
It also adds benchmarks so we can test🎉
However, as mentioned in https://github.com/apache/arrow-rs/pull/6044/files#r1676804033 if we know that the
Row
value was created from valid utf8 values, re-validating utf8 is unnecessary.Describe the solution you'd like
Consider an API that would allow skipping utf8 validation
This would need to be justified by performance benchmarks showing it made a significant difference in performance
Describe alternatives you've considered
Perhaps it would be an
unsafe
option on the RowConverterAdditional context
The text was updated successfully, but these errors were encountered: