Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of row converter by skipping utf8 checks #6058

Closed
Tracked by #6163 ...
alamb opened this issue Jul 15, 2024 · 4 comments
Closed
Tracked by #6163 ...

Improve speed of row converter by skipping utf8 checks #6058

alamb opened this issue Jul 15, 2024 · 4 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5374

@XiangpengHao implemented optimized row format --> ByteView (StringView / BinaryView) encoding/decoding in #5945 / #6044

It also adds benchmarks so we can test🎉

However, as mentioned in https://github.com/apache/arrow-rs/pull/6044/files#r1676804033 if we know that the Row value was created from valid utf8 values, re-validating utf8 is unnecessary.

Describe the solution you'd like

Consider an API that would allow skipping utf8 validation

This would need to be justified by performance benchmarks showing it made a significant difference in performance

Describe alternatives you've considered

Perhaps it would be an unsafe option on the RowConverter

let converter = RowConverter::new(...);

// Safety: only decoding Rows that came from valid String arrays
let converter = unsafe {
  converter.with_validate_utf8(false)
}

Additional context

@xinlifoobar
Copy link
Contributor

take

@xinlifoobar
Copy link
Contributor

Hi @alamb and @XiangpengHao, I have some observations for this issue.

The uf8 validation only happens when row.config.validate_utf8 is true.

https://github.com/apache/arrow-rs/blob/49840ec0f110da5e9a21ce97affd32313d0b720f/arrow-row/src/lib.rs#L1302C1-L1303C1

The validate_utf8 is only set to true when initialized from a RowParser

fn new(fields: Arc<[SortField]>) -> Self {
Self {
config: RowConfig {
fields,
validate_utf8: true,
},
}
}

I find the only usage of RowParse is here in RowConverter and test, did this mean the validate_utf8 will never set to true in the current implementation of arrow-rs and we would have the additional validation?

pub fn parser(&self) -> RowParser {

@XiangpengHao
Copy link
Contributor

did this mean the validate_utf8 will never set to true

That's an excellent observation, I also double checked the RowConverter in DataFusion and also did not find any reference to .parser() or RowParser, for example in GroupValuesRows: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/group_values/row.rs#L74-L80

I think utf-8 is not being validated (and is expected) in DataFusion, so we are not slowed by utf-8 validation. But we probably have to keep that utf-8 check because other users may use RowParser

@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2024

Perfect. Thank you for the investigation @xinlifoobar and the confirmation @XiangpengHao

@alamb alamb closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2024
@alamb alamb added the arrow Changes to the arrow crate label Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants