Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Jitify versions of binaryops for non-homogeneous types #892

Merged
merged 65 commits into from
Mar 9, 2019

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Feb 7, 2019

Work in this PR

This PR carries forward the work of @ironbit from libgdf PR 94.

Closes #593
Closes #1057
Depends on #694

Investigation:

  • Jitify overheads
  • Global cache implementation

Implementation:

  • Move core functionality code
  • Move unit tests
  • cuDF python bindings

Results of Investigation:

Jitify Overheads

I ran a simple vector add on 10 million elements, multiple times to find the overhead of JIT compiling and the overhead of retrieving and running a cached compiled kernel. The results are in the table below (all times are in ms):

Compile Execute API Existing
157.4500 3.22 292.88 3.53831
0.0467 3.13 3.28 3.40279
0.0151 3.14 3.23 3.40932
0.0137 3.11 3.19 3.40577

The cost of JIT compiling is indeed large but the cost of running a cached kernel (hashing and retrieving) is negligible. So any future effort needs to be in cache persistence.
Since the API cost is almost double the compile cost, I also tried to break down the cost of JIT program creation which was divided into two parts: reading the program string and compiling.

Program init Compile
143.8870 131.2620
0.1201 0.0325
0.0506 0.0176
0.0483 0.0164

Running different instantiations of kernels (different binop/Types) after the first one, does not incur this Program init cost, only the Compile cost.
I didn't dig deeper into what the cost of reading the program string is all about.

Global cache implementation

There's three levels of cache persistence I considered:

  • Thread level
  • Process level
  • Inter process level

Thread level

This was already implemented in libgdf PR 94

Process level

Jitify has an environment variable JITIFY_THREAD_SAFE that can be enabled to turn on thread safety. The cache was simply stored globally with that option turned on and a CMake option was added to turn this off if you want.
I confirmed that this worked and did provide performance benefit by running a simple script:

from concurrent.futures import ThreadPoolExecutor
import numpy as np

e = ThreadPoolExecutor(100)

gfs = [cudf.DataFrame({'x': (np.arange(3, 10003)*3).astype('int32'),
                       'y': np.arange(10000).astype('int16')}) for _ in range(1)]

def f(gf, i):
    print('.', end="", flush=True)
    return (gf.x + gf.y)

ts = time.time()
list(e.map(f, gfs * 1000, data))
te = time.time()
print(te-ts)

Which when ran with thread local cache, took 23 s and with process level cache, took 0.7282 s

Inter process level

This was investigated but wasn't implemented because the Jitify API to serialize/deserialize is experimental and also requires the user to take care of the caching. Also to limit the scope of this PR.
Possible ways to implement this:

  • Stored as a file in temporary folder. Can be done portably using std::filesystem::temp_directory_path but would need .lock files as unix doesn't have exclusive access to files.
  • Stored in shared memory.
    In either case, more information is needed about usage scenarios.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only did a top-level review of the headers.

cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Outdated Show resolved Hide resolved
- Removed APIs that allowed for a default value
- Removed custom vector implementation from unit test
- Added Changelog
@kkraus14 kkraus14 added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Feb 8, 2019
@harrism
Copy link
Member

harrism commented Feb 11, 2019

@devavret Just want to make sure you address my concerns in my review of the original libgdf PR (rapidsai/libgdf#94), as well as any concerns expressed by other reviewers. Thanks!

* branch-0.6: (74 commits)
  accidental whitespace
  changelog
  remove 0.6 version specification from readme
  Remove duplicate imports and cimport of numpy
  np.ndim will try REALLY hard to cast to a numpy array which makes things take basically forever
  CHANGELOG and repair a few tests.
  Add DataFrame.groupby(level=0) functionality
  CHANGELOG
  Improve Series groupby
  fix indentation in merge
  Add agg method
  Update csv.py
  Update CHANGELOG.md
  Groupby support for Series
  Resolve ambiguous names
  ENH: Use RMM::exec_policy for thrust operations
  Update changelog for PR rapidsai#889
  Deleted old version of test_rmm.py which is now moved to RMM repo
  change wording
  Update CHANGELOG.md
  ...
@devavret
Copy link
Contributor Author

@harrism I presumed those would've been addressed already. I see they haven't. I'm on it.

…nops

* fea-ext-test-utils-improvements:
  Added  member to column_wrapper.
  CHANGELOG.
  Updated size_t to gdf_size_type.
  Added initial table_wrapper class for an abstraction of multiple gdf_columns of different types that can be used for unit testing.
  Added CUDA error checking to column wrapper. Updated operator== to only compare non-null values.
  Added constructor to column_wrapper that accepts solely a host_data vector. Defaults the valid buffer to null.
  Updates for column_wrapper.
  Added wrapper class for gdf_column for unit testing purposes.
  Updated uint32_t to int32_t because unsigned types are not supported gdf_dtypes.
  Added type_to_enum traits struct. Updated print_gdf_column to use type_dispatcher. Hash test currently failing.

# Conflicts:
#	CHANGELOG.md
…nops

* fea-ext-test-utils-improvements:
  moved order of includes.
  Added option to allocat bitmask to column_size constructor for column_wrapper.
  Added documentation.
  Added set_null_count function.
…nops

* fea-ext-test-utils-improvements:
  Documentation for gdf_column conversion operator.
  Added conversion operator from column_wrapper to gdf_column to avoid need to use .get().
  Added first test of column_wrapper.
  Temporarily deleted copy and assignment from column_wrapper.
  Moved tuple for_each to utility folder.
  Documentation.
  Temporarily undefine NDEBUG to run Death tests for type dispatcher.
  Updated Dispatcher Tests to test new traits. Also added additional tests to cause failed tests if a new type is added without updating types.
  Updated constructor documentaiton.
  Added intitial unit tests file for column_wrapper.
  For_each over tuple elements and documentation.
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to update this PR and get it working.

Several remnants that were inherited from the original PR need to be updated:

  1. Parameter names like vax and vay are confusing. Should instead be renamed with lhs and rhs or left and right.
  2. Most of the internal implementation files are missing Doxymentation
  3. Lots of custom and reimplemented traits that are already present in Jitify or that should be added to Jitify (e.g., std::common_type). Removing these will significantly simplify this code.
  4. Non-standard error checking functions that should user GDF_REQUIRE.
  5. The null handling in the kernels is unnecessary and inefficient. It can instead be done as a simple memcpy (outside of the kernel).

cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Outdated Show resolved Hide resolved
cpp/src/binary/jit/code/kernel.cpp Outdated Show resolved Hide resolved
cpp/src/binary/jit/code/kernel.cpp Outdated Show resolved Hide resolved
cpp/src/binary/jit/core/launcher.h Outdated Show resolved Hide resolved
cpp/src/binary/jit/core/launcher.h Show resolved Hide resolved
cpp/src/binary/jit/core/launcher.cpp Show resolved Hide resolved
cpp/src/binary/jit/core/launcher.cpp Outdated Show resolved Hide resolved
cpp/src/binary/jit/core/binop.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Outdated Show resolved Hide resolved
@@ -85,6 +85,36 @@ typedef struct {
// here we can also hold info for decimal datatype or any other datatype that requires additional information
} gdf_dtype_extra_info;

/**---------------------------------------------------------------------------*
* @union gdf_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be gdf_datum, signifying the singular rather than the plural?

gdf_data sounds super- vague, and is also confusing IMHO since it can be interpreted to mean lots of data, or all data in a column etc.

cpp/include/cudf/types.h Outdated Show resolved Hide resolved
cpp/include/cudf/types.h Show resolved Hide resolved
cpp/src/binary/jit/core/binop.cpp Show resolved Hide resolved
cpp/src/binary/jit/core/launcher.cpp Show resolved Hide resolved
return "None";
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come on... don't do this. Use an array. Or a flat map (if we have one; probably not). Or at worst, an unordered map.

cpp/include/cudf/functions.h Show resolved Hide resolved
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes needed

.gitmodules Outdated
@@ -10,3 +10,6 @@
path = thirdparty/rmm
url = https://github.com/rapidsai/rmm.git
branch = branch-0.6
[submodule "thirdparty/jitify"]
path = thirdparty/jitify
url = https://github.com/NVIDIA/jitify.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pin this to a specific branch / tag / commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please pin to a commit to avoid build issues in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this do? Or do I need to specify it in .gitmodules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@kkraus14 kkraus14 Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to specify it in .gitmodules otherwise a git submodule update --init --recursive --remote will update it to the latest commit of master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find a way to pin this to a commit/tag in the .gitmodules file. I read this and update seems to be the only option. But details on update say that the only way to do this is to use none. Is that what I should do?

python/cudf/bindings/binops.pyx Outdated Show resolved Hide resolved
python/cudf/tests/test_binops.py Show resolved Hide resolved
* branch-0.6: (30 commits)
  Update column.cpp
  Update cpp/src/dataframe/column.cpp
  added brief non doxygen comments to column.cpp, removed some more horizontal lines from functions.h
  removed horizontal lines from types.h, made one line comments us ///< instead of /**<, removed verbose language from comments
  changelog
  Use device array instead of numpy array for index gather call. Allow DeviceNDArrays to be used for getitem calls.
  Moved CUDA_CHECK_LAST after CUDA_TRY.
  Documentation.
  Update cpp/tests/error/error_handling_test.cu
  More tests for CHECK_STREAM.
  Documentation.
  Added initial tests for CHECK_STREAM.
  Added STREAM_CHECK macro to synchronize and error check a stream.
  Added test catching exception from CUDA_TRY.
  Updated CUDA_EXPECTS to CUDA_TRY.
  CUDA_EXPECTS.
  In progress.
  removed Reason: text. Added test for the contents of the what() message.
  Removed references to cudf::detail::logic_error in place of cudf::logic_error.
  Moved the exception out of the detail namespace.
  ...

# Conflicts:
#	cpp/include/cudf/types.h
.gitmodules Outdated
@@ -10,3 +10,6 @@
path = thirdparty/rmm
url = https://github.com/rapidsai/rmm.git
branch = branch-0.6
[submodule "thirdparty/jitify"]
path = thirdparty/jitify
url = https://github.com/NVIDIA/jitify.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please pin to a commit to avoid build issues in the future.

cpp/CMakeLists.txt Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
DEPENDS stringify
MAIN_DEPENDENCY ${CMAKE_CURRENT_SOURCE_DIR}/include/cudf/types.h)

add_custom_target(stringify_run DEPENDS ${CMAKE_BINARY_DIR}/include/types.h.jit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEPENDS does not have to be a path, it can be a CMake variable.

For example:

add_custom_command(OUTPUT  TYPES_JIT
	                   WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/include
	                   COMMAND ${CMAKE_BINARY_DIR}/stringify cudf/types.h > ${CMAKE_BINARY_DIR}/include/types.h.jit
                           ...)
...
add_custom_target(stringify_run DEPENDS TYPES_JIT)
...

As long as the MAIN_DEPENDENCY is properly specified to be ${CMAKE_CURRENT_SOURCE_DIR}/include/cudf/types.h, CMake will update jit_types if that file changes.

Note that this is helpful if later you expect your command to have multiple outputs.

I would also consider renaming stringify_run to something more descriptive. I'm not really sure what the purpose of this target is off the cuff, and that obfuscates our build process for new developers, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it didn't work. Running make rebuilds launcher.cpp every time.

add_custom_command(OUTPUT STRINGIFIED_HEADERS
                   WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/include
                   COMMAND ${CMAKE_BINARY_DIR}/stringify cudf/types.h > ${CMAKE_BINARY_DIR}/include/types.h.jit
                   COMMENT "Run stringify on header types.h to convert it to c-str for use in JIT compiled code"
                   DEPENDS stringify
                   MAIN_DEPENDENCY ${CMAKE_CURRENT_SOURCE_DIR}/include/cudf/types.h)

add_custom_target(stringified_headers DEPENDS STRINGIFIED_HEADERS)

add_dependencies(cudf stringified_headers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what does the cmake code above have to do with launcher.cpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is for making and running an executable that comes with Jitify. This executable, called stringify, converts source files into c-strings and writes them to another source file. Now those stringified source files can be used in JIT code compilation. In our specific use case, we wanted to be able to use the definitions in types.h in the JIT kernels. So we use stringify to convert types.h into a string and store it in types.h.jit in the build/include directory. That types.h.jit is included in launcher.cpp. I wanted a way to use CMake to run stringify only when types.h is changed. With @mt-jones ' suggestion, CMake runs it all the time and every time i run make, launcher.cpp is rebuilt because the include is touched.

cpp/CMakeLists.txt Show resolved Hide resolved
@kkraus14 kkraus14 added 3 - Ready for Review Ready for review by team 0 - Waiting on Author Waiting for author to respond to review and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Mar 7, 2019
@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 8, 2019

I think the only thing left is the change of the submodule to a fork on the rapidsai project. Once that's done I'll approve.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 8, 2019

add to whitelist

@devavret
Copy link
Contributor Author

devavret commented Mar 8, 2019

I think the only thing left is the change of the submodule to a fork on the rapidsai project. Once that's done I'll approve.

I'm only waiting on the ops ticket to fork jitify. I cannot see the issue right now. (404 for me). So I'm waiting to be added to the rapids org.

Edit: nvm, found the fork.

@jrhemstad jrhemstad merged commit 0676d5f into rapidsai:branch-0.6 Mar 9, 2019
@devavret devavret deleted the enh-ext-jitify-binops branch March 9, 2019 09:01
@harrism harrism removed the 0 - Waiting on Author Waiting for author to respond to review label Mar 10, 2019
@devavret devavret mentioned this pull request Apr 1, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants