Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Logical operators (AND, OR, NOT) for libcudf and cuDF #1819

Merged
merged 16 commits into from
May 27, 2019

Conversation

devavret
Copy link
Contributor

@devavret devavret commented May 22, 2019

Tasks

  • Logical AND and OR
    • libcudf implementation &&, ||
      • Gtests
    • cuDF.Series: keep &, | behaviour intact for int and change it for bool
      • Pytests
  • Logical NOT (no one asked but it's here anyway)
    • libcudf implementation of unary ! (not)
      • Gtests
    • cuDF boolean ~
      • Pytests

Related Issues/Feature requests

Closes #1564
Closes #1517

Problems faced/Incomplete work

Bool to cudf::bool8

Just as cudf::bool8 is implicitly convertible to bool, the result of a logical operation should be assignable to a cudf::bool8. This is possible when cudf::bool8 has an implicit constructor.

constexpr explicit wrapper(T v) : value{v} {}
constexpr wrapper(bool v) : value{v} {}

Trying to figure this out was taking a lot of time. Partly because recompiling after changing wrapper_types.hpp takes long.

Difference from Pandas

Choice between bitwise and logical

Here's how pandas behaves:
bool (op) integer results in a bool but internally, the operation is bitwise

>>> import pandas as pd
>>> bt = pd.Series([True])
>>> i1 = pd.Series([1])
>>> i2 = pd.Series([2])
>>> bt & i1   # 0000 0001 & 0000 0001
0    True
dtype: bool
>>> bt & i2   # 0000 0001 & 0000 0010
0    False
dtype: bool

But C++'s behaviour is different and int(2) && bool(true) gives true.
There's two ways I can implement this - consistent with pandas or consistent with C++.
Problem with pandas' way is that I'd need two operations:

  • first a bitwise op (0000 0001 | 0000 0010 = 0000 0011)
  • and second a cast op to convert the result to bool (0000 0011 -> 0000 0001)

otherwise subsequent bitwise ops will not match pandas' behaviour
0000 0011 (True) & 0000 0010 (2) = True but
0000 0001 (True) & 0000 0010 (2) = False

because even though in libcudf, we consider non-0 char to be bool(true), in cuDF (and also in pandas), we assume that the memory representation is 0/1. That's what subsequent bitwise operations would presume.

I copied C++'s behaviour because I think pandas' is wrong and they probably didn't care about this case.

When columns don't match between dataframes

Logical operations are defined between two pandas dataframes which don't have the same columns but the result is logically inconsistent. The common column is correctly operated on but the unmatched columns have data that depends on which dataframe was lhs and which was rhs.

>>> pdf = pd.DataFrame({'a':[True, True, False, False],
                        'b':[True, False, True, False]})
>>> pdf3 = pd.DataFrame({'c':[True, True, False, False],
                         'b':[True, False, True, False]})
>>> print(pdf | pdf3)

       a      b      c
0   True   True  False
1   True  False  False
2  False   True  False
3  False  False  False

>>> print(pdf3 | pdf)

       a      b      c
0  False   True   True
1  False  False   True
2  False   True  False
3  False  False  False

This again seems like a bug in pandas and I've left the behaviour to the cuDF default of filling the column with int64(nan)

@devavret devavret requested review from a team as code owners May 22, 2019 16:36
@devavret devavret changed the title Fea logic bool op [REVIEW] Logical operators (AND, OR, NOT) for libcudf and cuDF May 22, 2019
devavret added 2 commits May 22, 2019 22:20
* branch-0.8: (231 commits)
  CHANGELOG.
  Doc.
  Endline.'
  Added table copy functions.
  Fix merge
  updated CHANGELOG and removed old tests
  removed lots of code
  CHANGELOG.
  Added printing of submodule status to print_env.sh.
  Fill mask with zeros when making a null column
  ENH: Add test for cudf::bool8 in booleans gtest
  changelog
  maintain the original series name in series.unique output
  Use jinja to set conda dep versions
  Remove duplicate entry in CHANGELOG.md from merge
  changelog
  changelog
  changelog
  changelog
  Update conda dependencies
  ...

# Conflicts:
#	CHANGELOG.md
@harrism
Copy link
Member

harrism commented May 23, 2019

Hi @devavret : you have access to set labels and projects, so please do so in the future, so reviewers don't miss your PRs that are ready for review.

@harrism harrism added 3 - Ready for Review Ready for review by team 4 - Needs cuDF Reviewer 4 - Needs Review Waiting for reviewer to review or respond Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 23, 2019
@harrism
Copy link
Member

harrism commented May 23, 2019

@shwina and @thomcom can you comment on @devavret 's concerns about Pandas behavior above?

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ side is all clear and clean, good work. The questionable parts are on the Python side. Python reviewers should comment.

cpp/include/cudf/types.h Show resolved Hide resolved
python/cudf/dataframe/series.py Outdated Show resolved Hide resolved
@devavret
Copy link
Contributor Author

Hi @devavret : you have access to set labels and projects, so please do so in the future, so reviewers don't miss your PRs that are ready for review.

Neat, I’ll remember that. I think we should also move away from adding [REVIEW/WIP] to the title, at least for employees that have write access

@shwina
Copy link
Contributor

shwina commented May 23, 2019

Here's how pandas behaves:
bool (op) integer results in a bool but internally, the operation is bitwise

Maybe I'm missing something - but does it help to think of & and | as indeed being bitwise operators, consistent with Python/numpy?

The difference is that the result is "downcasted" to a bool so it can be used for indexing (the primary use-case for & and | in Pandas is indexing/selection).

@devavret
Copy link
Contributor Author

devavret commented May 23, 2019

Maybe I'm missing something - but does it help to think of & and | as indeed being bitwise operators, consistent with Python/numpy?

The difference is that the result is "downcasted" to a bool so it can be used for indexing (the primary use-case for & and | in Pandas is indexing/selection).

They are indeed being downcasted. That's why I mention this:

Problem with pandas' way is that I'd need two operations:

* first a bitwise op (`0000 0001 | 0000 0010 = 0000 0011`)

* and second a cast op to convert the result to bool (`0000 0011 -> 0000 0001`)

otherwise subsequent bitwise ops will not match pandas' behaviour
0000 0011 (True) & 0000 0010 (2) = True but
0000 0001 (True) & 0000 0010 (2) = False

Question is, do we want to do that?

@shwina
Copy link
Contributor

shwina commented May 23, 2019

@devavret and I had a quick discussion and came to the following conclusion; please feel free to chime in:

& and | operate bitwise between Python ints/bools and are bitwise operators in NumPy. In Pandas - as @devavret pointed out above - they also appear to operate bitwise, but "downcast" the result to a bool when one of the operands is a bool.

We should probably keep the expectation that & and | are bitwise (not logical) operators in cuDF as well. In Pandas, the way to do a logical AND/OR is to use the np.logical_and/or ufuncs:

In [23]: df                                                                                                                                                                             
Out[23]: 
       0  1
0   True  2
1  False  2

In [24]: df[0] & df[1]                                                                                                                                                                  
Out[24]: 
0    False
1    False
dtype: bool

In [25]: np.logical_and(df[0], df[1])                                                                                                                                                   
Out[25]: 
0     True
1    False
Name: 0, dtype: bool

cuDF can provide similar cudf.logical_and and cudf.logical_or ufuncs for logical operators.

@harrism
Copy link
Member

harrism commented May 23, 2019

We should probably keep the expectation that & and | are bitwise (not logical) operators in cuDF as well. In Pandas, the way to do a logical AND/OR is to use the np.logical_and/or ufuncs:

On the C++ side, we definitely will do that, because & and | in C++ are bitwise operators. && and || are logical operators.

@harrism
Copy link
Member

harrism commented May 27, 2019

Just need to fix that centos build, @devavret and we need a Python review from @shwina @kkraus14 or @thomcom . Thanks!

devavret added 2 commits May 27, 2019 17:02
* branch-0.8: (34 commits)
  allow all dataframe stats ops to have kwargs and leverage series methods to error responsibly
  changelog
  update dataframe support method tests to go beyond the base case
  change _apply_support_method to explicitly take the method as the first argument and then pass kwargs
  Refactor difficult conditional into obvious for loop
  Updated the changelog
  Added PR rapidsai#1831
  Assuming python is in PATH instead of using PYTHON env var, since gpuCI uses PYTHON for the version number.
  fix Python style
  Update CHANGELOG.md
  JSON Reader: add suport for bool columns
  Remove tautological clause.
  Fix style and changelog
  Fix the other two issues.
  Repair problem with multiple aggregations on one variable and a single aggregation on another variable.
  remove unused import in test_csvreader.py
  Update CHANGELOG.md
  CSV Reader: default the column type to string when no data is available for inference.
  Added test of different shapes
  Raising a NotImplementedError when matching columns have different length
  ...

# Conflicts:
#	CHANGELOG.md
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@devavret devavret merged commit cd2a144 into rapidsai:branch-0.8 May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support *logical* boolean operators [FEA] Logical and (&&) and or (||) support.
4 participants