Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Boolean data type #667

Closed
4 tasks done
BradReesWork opened this issue Jan 10, 2019 · 7 comments · Fixed by #1142
Closed
4 tasks done

[FEA] Boolean data type #667

BradReesWork opened this issue Jan 10, 2019 · 7 comments · Fixed by #1142
Labels
2 - In Progress Currently a work in progress feature request New feature or request

Comments

@BradReesWork
Copy link
Member

BradReesWork commented Jan 10, 2019

Add support for a GDF_BOOL data type. This would be int8 to match Python. The Arrow spec calls out a single bit, but that introduces issues (could reuse valid bitmask code).

  • update types.h
  • update wrapper_types.hpp
  • update type_dispatcher.hpp
  • update tests/type_test.cpp

This does not touch the Python side of the problem

@BradReesWork BradReesWork added feature request New feature or request Needs Triage Need team to review and classify labels Jan 10, 2019
@jrhemstad
Copy link
Contributor

This will require updating the type_dispatcher as well as creating an appropriate wrapper type in https://github.com/rapidsai/cudf/blob/branch-0.5/cpp/src/utilities/wrapper_types.hpp

@BradReesWork
Copy link
Member Author

from side conversation with Jake, this will just be a wrapper

using cudf_bool = detail::wrapper::<int8_t, GDF_BOOL> in wrapper_types.hpp
it defines a new concrete type associated with GDF_BOOL that simply wraps an int8_t value

@randerzander randerzander added the Python Affects Python cuDF API. label Jan 16, 2019
@randerzander randerzander added depends on libcudf and removed Needs Triage Need team to review and classify Python Affects Python cuDF API. labels Jan 16, 2019
@randerzander
Copy link
Contributor

@mjsamoht - when this and #705 are resolved, I think CSV reader support for booleans will need an update

@BradReesWork BradReesWork added the 2 - In Progress Currently a work in progress label Jan 30, 2019
@eyalroz
Copy link
Collaborator

eyalroz commented Jan 30, 2019

In many contexts, one would prefer a bit-resolution Boolean. I'm not suggesting this not go forward as written, but assuming the question of bit-resolution Booleans has not been thoroughly discussed, I suggest such a discussion be held sometime on Slack (this idea came up in a discussion of this question with @williamBlazing).

@jrhemstad
Copy link
Contributor

jrhemstad commented Jan 30, 2019

@williamBlazing @eyalroz do you have any data (theoretical or empirical) showing that using a bit-resolution bool will be more performant than byte-resolution?

Just thinking about it, it seems like it'd be a wash between the two for something like join/groupby. Sure, a bit-wise representation is denser, but requires more work (2 bit shifts and AND) to access each element.

That said, I don't love the idea of deviating from Arrow if we don't have a solid reason to do so, because then we'll have to provide functionality to expand a bit-resolution column of bools into a byte-resolution column of bools and then back again.

@kkraus14
Copy link
Collaborator

One of the concerns is using Numba and especially using Numba naively for things like UDFs in Python becomes much more difficult and complicated. For example we won't be able to write a simple loop over boolean elements in a Numba kernel. We'd need to check the bit within an int32 element and that would somehow need to be handled in the UDF codepath or have the bit expanded to a byte before running a UDF.

@eyalroz
Copy link
Collaborator

eyalroz commented Jan 31, 2019

@jrhemstad : I was about to start explaining why it is so much better performance-wise to use bit-resolution booleans, and addressing your "it'd be a wash" suggestion, but then I realized that this is not the right venue for this discussion. I again suggest we schedule a discussion on slack. Will make a comment there rather than here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants