-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Logical operators (AND, OR, NOT) for libcudf and cuDF #1819
Conversation
* branch-0.8: (231 commits) CHANGELOG. Doc. Endline.' Added table copy functions. Fix merge updated CHANGELOG and removed old tests removed lots of code CHANGELOG. Added printing of submodule status to print_env.sh. Fill mask with zeros when making a null column ENH: Add test for cudf::bool8 in booleans gtest changelog maintain the original series name in series.unique output Use jinja to set conda dep versions Remove duplicate entry in CHANGELOG.md from merge changelog changelog changelog changelog Update conda dependencies ... # Conflicts: # CHANGELOG.md
Hi @devavret : you have access to set labels and projects, so please do so in the future, so reviewers don't miss your PRs that are ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ side is all clear and clean, good work. The questionable parts are on the Python side. Python reviewers should comment.
Neat, I’ll remember that. I think we should also move away from adding [REVIEW/WIP] to the title, at least for employees that have write access |
Maybe I'm missing something - but does it help to think of The difference is that the result is "downcasted" to a bool so it can be used for indexing (the primary use-case for |
They are indeed being downcasted. That's why I mention this:
Question is, do we want to do that? |
@devavret and I had a quick discussion and came to the following conclusion; please feel free to chime in:
We should probably keep the expectation that In [23]: df
Out[23]:
0 1
0 True 2
1 False 2
In [24]: df[0] & df[1]
Out[24]:
0 False
1 False
dtype: bool
In [25]: np.logical_and(df[0], df[1])
Out[25]:
0 True
1 False
Name: 0, dtype: bool cuDF can provide similar |
On the C++ side, we definitely will do that, because |
…iser list confusion.
* branch-0.8: (34 commits) allow all dataframe stats ops to have kwargs and leverage series methods to error responsibly changelog update dataframe support method tests to go beyond the base case change _apply_support_method to explicitly take the method as the first argument and then pass kwargs Refactor difficult conditional into obvious for loop Updated the changelog Added PR rapidsai#1831 Assuming python is in PATH instead of using PYTHON env var, since gpuCI uses PYTHON for the version number. fix Python style Update CHANGELOG.md JSON Reader: add suport for bool columns Remove tautological clause. Fix style and changelog Fix the other two issues. Repair problem with multiple aggregations on one variable and a single aggregation on another variable. remove unused import in test_csvreader.py Update CHANGELOG.md CSV Reader: default the column type to string when no data is available for inference. Added test of different shapes Raising a NotImplementedError when matching columns have different length ... # Conflicts: # CHANGELOG.md
…rror. change bintop pytests to use pandas testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Tasks
AND
andOR
&&
,||
&
,|
behaviour intact forint
and change it forbool
NOT
(no one asked but it's here anyway)!
(not)~
Related Issues/Feature requests
Closes #1564
Closes #1517
Problems faced/Incomplete work
Bool to
cudf::bool8
Just as
cudf::bool8
is implicitly convertible tobool
, the result of a logical operation should be assignable to acudf::bool8
. This is possible whencudf::bool8
has an implicit constructor.Trying to figure this out was taking a lot of time. Partly because recompiling after changing
wrapper_types.hpp
takes long.Difference from Pandas
Choice between bitwise and logical
Here's how pandas behaves:
bool (op) integer
results in a bool but internally, the operation is bitwiseBut C++'s behaviour is different and
int(2) && bool(true)
givestrue
.There's two ways I can implement this - consistent with pandas or consistent with C++.
Problem with pandas' way is that I'd need two operations:
0000 0001 | 0000 0010 = 0000 0011
)0000 0011 -> 0000 0001
)otherwise subsequent bitwise ops will not match pandas' behaviour
0000 0011 (True) & 0000 0010 (2) = True
but0000 0001 (True) & 0000 0010 (2) = False
because even though in libcudf, we consider non-0 char to be
bool(true)
, in cuDF (and also in pandas), we assume that the memory representation is0/1
. That's what subsequent bitwise operations would presume.I copied C++'s behaviour because I think pandas' is wrong and they probably didn't care about this case.
When columns don't match between dataframes
Logical operations are defined between two pandas dataframes which don't have the same columns but the result is logically inconsistent. The common column is correctly operated on but the unmatched columns have data that depends on which dataframe was lhs and which was rhs.
This again seems like a bug in pandas and I've left the behaviour to the cuDF default of filling the column with
int64(nan)