Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cast kernel support for StringViewArray and BinaryViewArray #5508

Closed
2 tasks
Tracked by #5374
alamb opened this issue Mar 14, 2024 · 5 comments
Closed
2 tasks
Tracked by #5374

cast kernel support for StringViewArray and BinaryViewArray #5508

alamb opened this issue Mar 14, 2024 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement StringViewArray -- see #5374

In #5481 we added support for StringViewArray and ByteViewArray.

This ticket tracks supporting StringViewArray and ByteViewArray in the cast kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cast/index.html

Describe the solution you'd like

Specifically the following conversions should be supported in the cast kernels:

  • StringViewArray <--> StringArray
  • StringViewArray <--> LargeStringArray

And similarly for Binary:

  • BinaryViewArray <--> BinaryArray
  • BinaryViewArray <--> LargeBinaryArray

Notes:

  1. Good test coverage is the most important part of this ticket
  2. I recommend 2 PRs (one for StringView and one for BinaryView) to make review easier -- once we have one then the other will largely follow the example of the first
  3. We can track other features in follow on tickets (e.g. DictionaryArray <--> StringViewArray)

Subtasks

Describe alternatives you've considered

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Mar 14, 2024
@alamb alamb changed the title Cast support for StringViewArray and BinaryViewArray cast kernel support for StringViewArray and BinaryViewArray Mar 14, 2024
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2024

@RinChanNOWWW has a PR to add initial support here: #5686

I updated this ticket with other potential subtasks

@RinChanNOWWW
Copy link
Contributor

I want to discuss about casting from ViewArray to ByteArray.

As we know, we can use ViewArray for random access of byte buffers. So, when converting ViewArray to ByteArray, memory copy is unavoidable. I can't come up with a zero-copy way.

If I implement this operation, I will allocate brand new buffers for the target ByteArray. I want to discuss if there is a better way?

@alamb
Copy link
Contributor Author

alamb commented May 1, 2024

As we know, we can use ViewArray for random access of byte buffers. So, when converting ViewArray to ByteArray, memory copy is unavoidable. I can't come up with a zero-copy way.

If I implement this operation, I will allocate brand new buffers for the target ByteArray. I want to discuss if there is a better way?

I agree there is unlikely to be a zero copy way. The only potential exception I can imagine is if the underlying buffer was already pre-packed (though this would require all strings to be longer than 12 bytes and in order and contiguous). I suspect the time necessary to detect if this was the case would be substantial

Thus I suggest we start with the simple case (copy to new buffer) and we can optimize later of it turns out that is an important usecase.

I think some part of #5513 might be relevant here (namely the operation to compact the strings) 🤔

@alamb
Copy link
Contributor Author

alamb commented Jun 10, 2024

I think what is remaining for this ticket is support for

  • StringViewArray --> StringArray / LargeStringArray
  • BinaryViewArray --> BinaryArray / LargeBinaryArray

As @RinChanNOWWW says in #5508 (comment), these casts are going to require copying the string data into the compacted form required of StringArray

@alamb
Copy link
Contributor Author

alamb commented Jun 10, 2024

Sorry @XiangpengHao corrected me -- it seems that @RinChanNOWWW actually implemented both directions

So I think we can claim this ticket is done.

#5861 tracks adding support for dictionary

@alamb alamb closed this as completed Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants