Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Added GDF_BOOL type to cuda code #817

Closed
wants to merge 4 commits into from
Closed

[REVIEW] Added GDF_BOOL type to cuda code #817

wants to merge 4 commits into from

Conversation

BradReesWork
Copy link
Member

Added a GDF_BOOL type.

The type was added as a wrapper type that resolves to an int8

  • update types.h
  • update wrapper_types.hpp
  • update type_dispatcher.hpp
  • update tests/type_test.cpp

This PR does not change any code to start using the new type

closes #667

Copy link
Collaborator

@eyalroz eyalroz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things I'm missing:

  • All cudf code which uses GDF_INT8 but really uses it as boolean should be converted to use GDF_BOOL.
  • Doxygen comments regarding functions taking or producing columns of booleans (and I don't mean valid pseudo-columns, since they're bit-packed) should be given a once-over to check if they say the input or output has type GDF_BOOL.

Bottom line: I would "request changes", but I don't have the official capacity for that.

@@ -27,6 +27,7 @@ typedef enum {
GDF_TIMESTAMP, /**< Exact timestamp encoded with int64 since UNIX epoch (Default unit millisecond) */
GDF_CATEGORY,
GDF_STRING,
GDF_BOOL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, where's the typedef int8_t gdf_bool? Or rather, for consistency with the (problematic) typedefs we have now, typedef bool gdf_bool?

Also - Please add a comment explaining the semantics of this type, i.e. the fact that 0 means false and anything else means true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we would want all GDF_INT8 code to be referenced as GDF_BOOL? While the datatype might be the same, conceptually they are different.

"the fact that 0 means false and anything else means true." has always been a C standard

Copy link
Contributor

@jrhemstad jrhemstad Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we would want all GDF_INT8 code to be referenced as GDF_BOOL? While the datatype might be the same, conceptually they are different.

That's not what @eyalroz was suggesting. Instead he was just saying to add a typedef int8_t gdf_bool and then use that typedef in the wrapper as

using cudf_bool 	= detail::wrapper<gdf_bool, GDF_BOOL>;

"the fact that 0 means false and anything else means true." has always been a C standard

While this is true, all of the operators for wrapper don't behave this way. Thus, there's a few options:

  1. Add specializations for the pertinent operator::wrapper for gdf_bool such that 0 means false and not zero means true (this is the most work)
  2. Require that for gdf_bool that 0 means false and 1 means true, and all other values are invalid. (Could be done via assert or just trusting the user?)
  3. Use typedef bool gdf_bool which automatically enforces 2., but is problematic because of the machine dependence on sizeof(bool).
  4. Add a normalize function similar in concept to unwrap in that for all types other than gdf_bool, it's just a no-op, but for gdf_bool types, it checks the value of the gdf_bool, if 0 it stays 0, if not zero, it changes it to 1. You would then need to update all of the necessary wrappers operators to then call normalize on the passed in arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll say I don't like the idea of keeping having cudf_foo and gdf_foo. The latter identifiers should all eventually become cudf_foo, and we'll be stuck with a conflict.

Also - @jrhemstad , in option (1.) you mean implementing arithmetic operators, comparison operators etc? I don't think that is really too much work.

I'm in favor of (1.) , but possibly also of (4.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - @jrhemstad , in option (1.) you mean implementing arithmetic operators, comparison operators etc? I don't think that is really too much work.

Yes, it's not a lot of work, just more than 2, 3, or 4.

@@ -107,6 +107,7 @@ decltype(auto) type_dispatcher(gdf_dtype dtype,
case GDF_DATE64: { return f.template operator()< date64 >(std::forward<Ts>(args)...); }
case GDF_TIMESTAMP: { return f.template operator()< timestamp >(std::forward<Ts>(args)...); }
case GDF_CATEGORY: { return f.template operator()< category >(std::forward<Ts>(args)...); }
case GDF_BOOL: { return f.template operator()< gdf_bool >(std::forward<Ts>(args)...); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to @jrhemstad : Another reason to put some priority on accepting fs which only have operator() implemented for a subset of the types. Hint hint.

cpp/src/utilities/wrapper_types.hpp Show resolved Hide resolved
@@ -263,6 +263,9 @@ using date32 = detail::wrapper<gdf_date32, GDF_DATE32>;

using date64 = detail::wrapper<gdf_date64, GDF_DATE64>;

using gdf_bool = detail::wrapper<int8_t, GDF_BOOL>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't really like how some cudf types have using statements here, and some don't. But while that's the rule - why should gdf_bool be different than the integer and floating-point types? It does correspond to a C/C++ native type after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be problematic to use bool natively, because the size of bool in C++ is implementation defined.

Copy link
Collaborator

@eyalroz eyalroz Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it would! ... I was just being consistent with the other problematic typedef's:

typedef int gdf_size_type; /**< Limits the maximum size of a gdf_column to 2^31-1 */
typedef gdf_size_type gdf_index_type;
typedef unsigned char gdf_valid_type;
typedef	long	gdf_date64;
typedef	int		gdf_date32;
typedef	int		gdf_category;
typedef	long	gdf_timestamp;

int and long are also of implementation-defined size... I'd be all for


typedef int32_t gdf_size_type;
typedef gdf_size_type gdf_index_type;
typedef uint8_t gdf_valid_type; /* shouldn't this become uint32_t ? */
typedef int64_t gdf_date64;
typedef int32_t gdf_date32;
typedef int64_t gdf_timestamp;
typedef int32_t gdf_category;
typedef int8_t gdf_bool;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial conversation was around rather to simply do a typedef int8_t gdf_bool or to use the type_dispatcher. We had decided to go down the type_dispatcher route, hence the is no typedef.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BradReesWork : I don't quite understand. How does it make sense to have typedef int gdf_date32; but not have typedef signed char gdf_bool or typedef int8_t gdf_bool?

@@ -49,7 +49,7 @@ struct WrappersTest : public ::testing::Test {
};

using Wrappers = ::testing::Types<cudf::category, cudf::timestamp, cudf::date32,
cudf::date64>;
cudf::date64, cudf::gdf_bool>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's it, no other tests? I wonder if this is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that more test would be beneficial. The current set of test does run through the most of the standard mathematical operations. What is missing are boolean operators. I will get those added

@jrhemstad
Copy link
Contributor

jrhemstad commented Jan 31, 2019

@jrhemstad: What implicit integral upcasting for types smaller than int?

@eyalroz referring to this: https://en.cppreference.com/w/cpp/language/implicit_conversion#Integral_promotion

Specifically:

In particular, arithmetic operators do not accept types smaller than int as arguments, and integral promotions are automatically applied

This was problematic because of the following:

template <typename T>
wrapper<T> operator+(wrapper<T> const &lhs, wrapper<T> const &rhs)
{
     return wrapper<T>{lhs.value + rhs.value};
}

This would throw a warning (error, since all warnings are errors) when T == int8_t because of a narrowing conversion from int to int8_t.

Basically, this will fail for wrapper<T,dtype> when T is smaller than an int.

static_assert(std::is_same<T, decltype(lhs.value + rhs.value)>::value);

Example here: https://wandbox.org/permlink/pU48n53H8F76g7SP

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that operator== will need a specialization for cudf::gdf_bool, because it should be such that if both lhs and rhs are greater than zero, then they are equal.

There will need to be some changes to the unit tests to test this as well.

@kkraus14
Copy link
Collaborator

I just realized that operator== will need a specialization for cudf::gdf_bool, because it should be such that if both lhs and rhs are greater than zero, then they are equal.

There will need to be some changes to the unit tests to test this as well.

In theory shouldn't it be incorrect behavior to have any value other than a 1 or 0 in a GDF_BOOL column? I understand if comparing a bool to an int8 for example, but wouldn't that be a different function / operator?

@eyalroz
Copy link
Collaborator

eyalroz commented Feb 1, 2019

I just realized that operator== will need a specialization for cudf::gdf_bool, because it should be such that if both lhs and rhs are greater than zero, then they are equal.

So, also all other comparison operators need to be changed as well.

Also, we might want to consider a normalize() or canonicalize() method, which for GDF_BOOL would replace non-zero values with 1's and keep 0's as-is.

@jrhemstad
Copy link
Contributor

I just realized that operator== will need a specialization for cudf::gdf_bool, because it should be such that if both lhs and rhs are greater than zero, then they are equal.

So, also all other comparison operators need to be changed as well.

Also, we might want to consider a normalize() or canonicalize() method, which for GDF_BOOL would replace non-zero values with 1's and keep 0's as-is.

So we'd add a normalize() call around all of the wrappers inside the various operators, and it'd just be a pass-through for all but gdf_bool?

@eyalroz
Copy link
Collaborator

eyalroz commented Feb 1, 2019

So we'd add a normalize() call around all of the wrappers inside the various operators, and it'd just be a pass-through for all but gdf_bool?

If by pass-through you mean it would do nothing and be noexcept and inline, then yes.

@kkraus14 kkraus14 added 0 - Waiting on Author Waiting for author to respond to review libcudf Affects libcudf (C++/CUDA) code. labels Feb 11, 2019
@harrism
Copy link
Member

harrism commented Feb 12, 2019

@jrhemstad @BradReesWork What needs to happen with this PR?

@eyalroz
Copy link
Collaborator

eyalroz commented Feb 22, 2019

@harrism

One thing I hope would be considered (whether rejected or accepted) is my typedef int8_t gdf_bool; suggestion. This would be useful, in particular, for the apply_stencil reimplementation and also for consistency IMHO.

@jrhemstad
Copy link
Contributor

@harrism

One thing I hope would be considered (whether rejected or accepted) is my typedef int8_t gdf_bool; suggestion. This would be useful, in particular, for the apply_stencil reimplementation and also for consistency IMHO.

I 100% support this as soon as #599 is completed. Otherwise, the CFFI will fail due to inability to parse the required #include <stdint.h>

@eyalroz
Copy link
Collaborator

eyalroz commented Feb 22, 2019

@jrhemstad :

I 100% support this as soon as #599 is completed. Otherwise, the CFFI will fail due to inability to parse the required #include <stdint.h>

In the mean time, we could use this hack:

#ifndef _STDINT_H
typedef signed char        int8_t;
#endif
typedef int8_t             gdf_bool;

or even simply

typedef signed char  gdf_bool;

and cross our fingers (or require with CMake, or static_assert somewhere) that they're the same type.

@eyalroz
Copy link
Collaborator

eyalroz commented Mar 6, 2019

@BradReesWork , @harrism : Pinging you guys about this PR, because it has a slight (compatible) overlap with my PR #902 , and if it goes forward I'll base myself on what's gone in.

@jrhemstad jrhemstad mentioned this pull request Mar 7, 2019
12 tasks
@jrhemstad
Copy link
Contributor

Superseded by #1142

@jrhemstad jrhemstad closed this Mar 7, 2019
@harrism
Copy link
Member

harrism commented Mar 7, 2019

@jrhemstad is going to take over this PR. We'll move it to 0.7.

@harrism harrism assigned jrhemstad and unassigned BradReesWork Mar 7, 2019
@jrhemstad jrhemstad mentioned this pull request Jun 5, 2019
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants