Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype ArrayView Types #4253

Closed
tustvold opened this issue May 22, 2023 · 5 comments
Closed

Prototype ArrayView Types #4253

tustvold opened this issue May 22, 2023 · 5 comments
Assignees
Labels
arrow Changes to the arrow crate development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There is ongoing discussion of introducing an ArrayView type to the format - https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb

We should explore the design space around this, in particular to gather some empirical data as to the impact of introducing such a type.

Describe the solution you'd like

I would like to prototype an implementation of StringView and explore integrating it into the parquet reader, where it ostensibly could yield to some non-trivial performance improvements

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label May 22, 2023
@tustvold tustvold self-assigned this May 22, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Jun 7, 2023

The benchmarks in #4378 show half the execution time being spent rewriting data to remove empty array slices. This likely could be optimised, and it is unclear how realistic the benchmark is, but I thought it was an interesting data point.

image

Theoretically ArrayView types would remove the need for this, whilst also removing the memcpy when decoding byte arrays. I'd anticipate roughly a 2x return, with bigger returns for more heavily nested data

@alamb alamb added the arrow Changes to the arrow crate label Jun 7, 2023
@tustvold
Copy link
Contributor Author

My current feelings on this matter are summarized in https://lists.apache.org/thread/1j0hdbfd0q2636zs9z0x19fkcn87gjhf

TLDR I think improving the support for sparse dictionaries may be sufficient to support this use case

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2023
@tustvold tustvold added the development-process Related to development process of arrow-rs label Jul 14, 2023
@tustvold tustvold reopened this Jul 29, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jul 29, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 30, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 30, 2023
@alamb
Copy link
Contributor

alamb commented Feb 8, 2024

Filed #5374 to track implementing what was added to the spec

@alamb
Copy link
Contributor

alamb commented Feb 12, 2024

Here was a draft PR: #4585

@alamb
Copy link
Contributor

alamb commented Feb 12, 2024

From my perspective, the prototype was completed in #4585 and follow on work is tracked in #5374 so closing this ticket down

@alamb alamb closed this as completed Feb 12, 2024
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 11, 2024
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants