-
Notifications
You must be signed in to change notification settings - Fork 43
[WIP] Binary Operators #94
base: master
Are you sure you want to change the base?
Conversation
Implemented first approach, multiple dispatch and generic programming.
This has a conflict with master now |
The 2nd approach using NVRTC ("Jitify" library). Base version for review.
The 2nd approach using NVRTC ("Jitify" library). Base and stable version of the 2nd approach.
The 2nd approach using NVRTC ("Jitify" library). Added kernels for binary operations with default values.
The 2nd approach using NVRTC ("Jitify" library). Improved kernel implementation. Scalar type will be converted at kernel execution (type defined).
The 2nd approach using NVRTC ("Jitify" library). Added all base types to gdf_dtypes.
The 2nd approach using NVRTC ("Jitify" library). Changed gdf_scalar interface.
The 2nd approach using NVRTC ("Jitify" library). Erased 'valid' field input for kernels without a default value.
The 2nd approach using NVRTC ("Jitify" library). Updated valid field in kernels and added integration tests for all kernels.
Using Jitify utility to choose the optimal block size at kernel calls.
Fixed gtest compilation issues.
Fixed gtest issues
Added shared libraries for tests.
The 2nd approach using NVRTC ("Jitify" library). Erased unused functionality.
The 2nd approach using NVRTC ("Jitify" library). Added binary operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern with this PR is that there is a lot of code and basically zero documentation. First, I think it needs high-level documentation of the functionality and organization, and it needs operational documentation for the main APIs being introduced (both external and internal APIs).
include/gdf/cffi/types.h
Outdated
@@ -16,6 +20,23 @@ typedef enum { | |||
N_GDF_TYPES, /* additional types should go BEFORE N_GDF_TYPES */ | |||
} gdf_dtype; | |||
|
|||
union gdf_data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the names in this union required to only be four characters for some reason? It would be nice to spell them out better, especially "invd", "tmst", "dt32" and "dt64".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a particular reason. Is there any name style and code style for this project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Human readable variable names is good practice for every project.
src/binary/binary2/kernel.cpp
Outdated
} | ||
} | ||
|
||
template <typename TypeOut, typename TypeVax, typename TypeVay, typename TypeDef, typename TypeOpe> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, TypeDef
is very risky typename, considering its proximity to typedef
. Could just use TDef, TVax, etc. Or reorder, putting Type
last in the names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, it will be changed.
src/binary/binary2/kernel.cpp
Outdated
|
||
template <typename TypeOut, typename TypeVax, typename TypeVay, typename TypeDef, typename TypeOpe> | ||
__global__ | ||
void kernel_v_s_d(int size, gdf_data def_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should gdf_data
on this line be TypeDef
?
src/binary/binary2/kernel.cpp
Outdated
AbstractOperation<TypeOpe> operation; | ||
out_data[i] = operation.template operate<TypeOut, TypeVax, TypeVay>(vax_data_aux, (TypeVay)vay_data); | ||
|
||
__syncwarp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for __syncwarp()
here? I don't see any sharing of data between threads. Is it necessary? Same question for all the other __syncwarp()
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not correct. It'll be removed.
src/binary/binary2/launcher.cpp
Outdated
return *this; | ||
} | ||
|
||
Launcher& Launcher::instantiate(gdf_column* out, gdf_column* vax, gdf_column* vay, gdf_binary_operator ope) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since each of these instantiate
functions varies only in the type of one argument, and their bodies are identical, couldn't you template them to reduce redundancy? You might even use variadic templates to collapse all into one instantiate and one launch function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point
src/binary/binary2/traits.cpp
Outdated
R"***( | ||
#pragma once | ||
|
||
struct IntegralSigned {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, I believe Jitify already has some support for std::type_traits, e.g. https://github.com/NVIDIA/jitify/blob/master/jitify.hpp#L1124
Do you need to roll your own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some issue with that. I expressed it in my daily scrum. It is supposed to be working, but it isn't. I included the libraries, however "common_type" and others aren't working, because of the error messages using nvrtc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's work with Kate Clark and Ben Barsdell on making Jitify better. Are you in touch with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'm not. I remember having issues related to "type_traits" header and with compiler option "--device-as-default-execution-space", which wasn't working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will contact them.
return (TypeOut)pow((double)vax, (double)vay); | ||
} | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice in this situation (closing brackets far from their openings) to comment what is being closed, e.g.:
} // namespace operation
} // namespace test
} // namespace gdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is important. I was focused on implementation.
The 2nd approach using NVRTC ("Jitify" library). Erased syncwarp primitive.
…into binary-operators-draft
Changed submodule "Jitify" to https protocol.
Added CUDA configuration for NVRTC library.
Added 'stubs' directory in cmake for CUDA library.
Changed default string variable for selection of binary operations.
Added google benchmark for binary operations - NVRTC implementation.
Changed interface for cffi python.
Solved issues related to other tests due to a fix in the storage duration of Jitify cache.
Added 'stubs' directory in the library path (python) for NVRTC and CUDA libraries.
Improved vector library code.
Fixed issue related to inputs verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did a live PR review with @jrhemstad @williamBlazing and @ironbit - requesting more changes. Also, we need the python bindings updated to use the new binary ops API: @mt-jones, can you post an update on where we are with this?
@@ -0,0 +1,36 @@ | |||
#============================================================================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use this for CI? @mike-wendt
GDF_GREATER, | ||
GDF_LESS_EQUAL, | ||
GDF_GREATER_EQUAL, | ||
//GDF_COMBINE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a description/comment why those are not implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, preferably, please remove.
src/binary/CMakeLists.txt
Outdated
message(STATUS "Binary Operation Version: V2 - NVRTC") | ||
|
||
set(source_files | ||
src/binary/binary2/binary.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename binary2
folder to binary_nvrtc
or binary_jitify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source code has been reorganized, that's why I think it's convenient to reply to every observation.
https://github.com/BlazingDB/libgdf/tree/binary-operators-draft/src/binary-operation/jit
src/binary/binary2/kernel.cpp
Outdated
|
||
template <typename TypeOut, typename TypeVax, typename TypeVay, typename TypeOpe> | ||
__global__ | ||
void kernel_v_s(int size, TypeOut* out_data, TypeVax* vax_data, gdf_data vay_data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need kernel_s_v
as well - should be 6 combinations overall, one kernel per the top level GDF function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the remaining scalar-vector operations.
src/binary/binary2/launcher.cpp
Outdated
#include "binary/binary2/cuda.h" | ||
|
||
namespace gdf { | ||
static thread_local jitify::JitCache JitCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to store JitCache when exiting the process to store in a file, then use an env variable to read it from the specified file to have the cache enabled between different processes
src/binary/binary2/operation.cpp
Outdated
} | ||
}; | ||
/* | ||
struct Add : public AbstractOperation<Add> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a comment with the description why this could be useful, i.e. for the overflows - if we ever wanted to enable this "feature" in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source code has been optimized for jit.
The commented code has been moved outside the jit code and it has been added a description.
https://github.com/BlazingDB/libgdf/blob/96012d91aff817e6105a8fa8983e1587d237493f/src/binary-operation/jit/code/operation.cpp#L159
src/library/field.h
Outdated
@@ -0,0 +1,109 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move library
under bench/binary
or tests/binary
since it's only used in tests, probably also rename to something like binary_helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,203 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be moved to src/bench/binary
to have a folder per libgdf operator (one folder for binary, one for parquet, one for joins, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/binary/binary2/kernel.cpp
Outdated
|
||
for (int i=start; i<size; i+=step) { | ||
AbstractOperation<TypeOpe> operation; | ||
out_data[i] = operation.template operate<TypeOut, TypeVax, TypeVay>(vax_data[i], (TypeVay)vay_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to set the output valid bit mask - I don't see it being handled in the code. It should be an OR between the two bit masks of the two input operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output valid bitmask is processed in all kernels.
The kernels also have been optimized using 'Bit Hacks'. In the benchmark, the kernel (v_v_v_d) reduces its time approx. in 5us in all of its benchmarks, while the kernel (v_v_v) increments its time approx. in 7us due to the bitmask processing.
https://github.com/BlazingDB/libgdf/blob/729cbfac6ae2281894b99644c58643184bd18d85/src/binary-operation/jit/code/kernel.cpp#L82
Reorganized the source code.
Optimized the source code in Jit.
Added namespace documentation.
return "Pow"; | ||
//case GDF_COMBINE: | ||
//case GDF_COMBINE_FIRST: | ||
//case GDF_ROUND: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these simply not implemented? Probably should remove or make a comment
Moved 'util' library used in tests and benchmark.
Optimized Jit kernel code (bit hacks) and added the output valid bitmask in some kernels.
Added documentation related to the new interface. Changed the interface ('valid' field in gdf_scalar).
Added the changes for scalar vector operations.
Added the standard header for math functions.
Added a 'libcuda' soft link.
…ignificant bit format
… ws working before but could have had weird errors at sizes above 1bn records
Implemented so far is the first implementation concept using nested templating. It has been implemented with support for literals and 14 different binary operators.
We would like feedback on the API implemented, which is:
``
gdf_error gdf_scalar_operation(gdf_column* out, gdf_column* vax, gdf_scalar* vay, gdf_scalar* def, gdf_binary_operator ope);
gdf_error gdf_vector_operation(gdf_column* out, gdf_column* vax, gdf_column* vay, gdf_scalar* def, gdf_binary_operator ope);
enum gdf_binary_operator {
GDF_ADD,
GDF_SUB,
GDF_MUL,
GDF_DIV,
GDF_TRUE_DIV,
GDF_FLOOR_DIV,
GDF_MOD,
GDF_POW,
//GDF_COMBINE,
//GDF_COMBINE_FIRST,
//GDF_ROUND,
GDF_EQUAL,
GDF_NOT_EQUAL,
GDF_LESS,
GDF_GREATER,
GDF_LESS_EQUAL,
GDF_GREATER_EQUAL,
//GDF_PRODUCT,
//GDF_DOT
};
struct gdf_scalar {
union gdf_data {
std::int8_t si08;
std::int16_t si16;
std::int32_t si32;
std::int64_t si64;
float fp32;
double fp64;
};
};
``
We have already started on a second implementation using NVRTC which uses the same API.
Unit tests have not yet been implemented and will be implemented after an implementation method has been decided.
This current implementation right now only supports the second operand being able to be a scalar, which can be improved upon easily later.
In order to compile the binary operation functionality, use the cmake variable "BINARY_OPERATION_VERSION" and choose the binary version.
Example:
cmake -DCMAKE_BUILD_TYPE=Release -DBINARY_OPERATION_VERSION:STRING=V1 ../../code/libgdf
This cmake switch was added right now because compilation time is very large due to all the functions generated by the templated code. For example compilation on a laptop with an 8-core i7 processor and 16Gb ram was:
Time to compile: 77m10.996s
Ram memory used: More than 13.5GB
Release binary size: 69,931,016 bytes