-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Boolean data type #667
Comments
This will require updating the |
from side conversation with Jake, this will just be a wrapper
|
In many contexts, one would prefer a bit-resolution Boolean. I'm not suggesting this not go forward as written, but assuming the question of bit-resolution Booleans has not been thoroughly discussed, I suggest such a discussion be held sometime on Slack (this idea came up in a discussion of this question with @williamBlazing). |
@williamBlazing @eyalroz do you have any data (theoretical or empirical) showing that using a bit-resolution bool will be more performant than byte-resolution? Just thinking about it, it seems like it'd be a wash between the two for something like join/groupby. Sure, a bit-wise representation is denser, but requires more work (2 bit shifts and AND) to access each element. That said, I don't love the idea of deviating from Arrow if we don't have a solid reason to do so, because then we'll have to provide functionality to expand a bit-resolution column of bools into a byte-resolution column of bools and then back again. |
One of the concerns is using Numba and especially using Numba naively for things like UDFs in Python becomes much more difficult and complicated. For example we won't be able to write a simple loop over boolean elements in a Numba kernel. We'd need to check the bit within an int32 element and that would somehow need to be handled in the UDF codepath or have the bit expanded to a byte before running a UDF. |
@jrhemstad : I was about to start explaining why it is so much better performance-wise to use bit-resolution booleans, and addressing your "it'd be a wash" suggestion, but then I realized that this is not the right venue for this discussion. I again suggest we schedule a discussion on slack. Will make a comment there rather than here. |
Add support for a GDF_BOOL data type. This would be int8 to match Python. The Arrow spec calls out a single bit, but that introduces issues (could reuse valid bitmask code).
This does not touch the Python side of the problem
The text was updated successfully, but these errors were encountered: