Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transposition to kernel generator #1769

Merged
merged 22 commits into from
Mar 23, 2020

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Mar 10, 2020

Summary

Adds transposition to kernel generator. Existing transposition kernel is removed.

Tests

Added new tests for transposition. Existing tests in opencl/prim/transpose_test.cpp also test the new code.

Side Effects

None.

Checklist

  • Math issue Implement OpenCL kernel generator #1342

  • Copyright holder: Tadej Ciglarič

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@t4c1 t4c1 changed the title Cl kernel generator transpose Add transposition to kernel generator Mar 10, 2020
Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few Qs related to the templates here

/**
* Represents a transpose in kernel generator expressions.
*
* Warning: transposing this expression is not supported!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is kind of confusing, like we can't do transpose(transpose(x))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this is not true anymore. Removed.

Comment on lines 25 to 27
* @tparam Derived derived type
* @tparam T_a type of first argument
* @tparam T_b type of second argument
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't match up

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this may be a better place for clearer names like calling T Expr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 46 to 55
explicit transpose_(T&& a) : base(std::forward<T>(a)) {}

/**
* Creates a deep copy of this expression.
* @return copy of \c *this
*/
inline transpose_<std::remove_reference_t<T>> deep_copy() {
return transpose_<std::remove_reference_t<T>>{
std::get<0>(arguments_).deep_copy()};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda weird to me. Let's say we call

double b = 10;
transpose_<double> foo(b);

(or insert any non ref type). Then the constructor here is going to be

explicit transpose_(double&& a) : base(std::forward<double>(a)) {}

Where std::forward<double>(a) here is actually going to give back an rvalue reference since forward with an lvalue as the template and rvalue as the type still produces an rvalue (see godbolt below).

I think you need to have a separate template here

template <typename OpT, require_same_t<T, OpT>* = nullptr>
explicit transpose_(OpT&& a) : base(std::forward<OpT>(a)) {}

The godbolt here shows this effect.

https://godbolt.org/z/pWNzSj

Copy link
Contributor Author

@t4c1 t4c1 Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having rvalue if rvalue is passed in here is fine. Your example on godbolt does not compile because you have wrong template argument in S<int> a(b); - this should be S<int&> a(b);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in your example in the comment you have an error - the constructor for

double b = 10;
transpose_<double> foo(b);

would be:

explicit transpose_(double& a) : base(std::forward<double&>(a)) {}

* @return part of kernel with code for this and nested expressions
*/
inline kernel_parts generate(const std::string& i, const std::string& j,
const std::string var_name_arg) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass this string by value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. Fixed.

@SteveBronder
Copy link
Collaborator

Odd it looks like this is failing the cholesky test? Ping me when it's fixed and I'll do the review

@SteveBronder
Copy link
Collaborator

SteveBronder commented Mar 14, 2020

(also fyi feel free to ping me whenever to remind me to review)

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.87 4.91 0.99 -0.78% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -0.95% slower
eight_schools/eight_schools.stan 0.09 0.09 1.04 3.51% faster
gp_regr/gp_regr.stan 0.22 0.22 0.99 -1.48% slower
irt_2pl/irt_2pl.stan 6.45 6.44 1.0 0.13% faster
performance.compilation 89.05 86.59 1.03 2.77% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.54 7.53 1.0 0.11% faster
pkpd/one_comp_mm_elim_abs.stan 20.42 20.91 0.98 -2.42% slower
sir/sir.stan 92.55 93.9 0.99 -1.46% slower
gp_regr/gen_gp_data.stan 0.05 0.05 0.98 -1.84% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 2.95 1.0 -0.09% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.31 1.01 0.61% faster
arK/arK.stan 1.75 1.74 1.01 1.07% faster
arma/arma.stan 0.66 0.66 1.0 0.34% faster
garch/garch.stan 0.52 0.51 1.01 1.11% faster
Mean result: 1.00067582644

Jenkins Console Log
Blue Ocean
Commit hash: bf60476


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@t4c1
Copy link
Contributor Author

t4c1 commented Mar 19, 2020

@SteveBronder This is (finaly) ready for a review.

@SteveBronder
Copy link
Collaborator

Cool! I'll take a look tmrw

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment on the .eval() there which seems odd. Rest of it looks good!

@@ -81,7 +81,7 @@ inline matrix_cl<T> tri_inverse(const matrix_cl<T>& A) {
zero_mat.template zeros<stan::math::matrix_cl_view::Entire>();
inv_padded.template zeros<stan::math::matrix_cl_view::Entire>();
if (tri_view == matrix_cl_view::Upper) {
inv_mat = transpose(inv_mat);
inv_mat = transpose(inv_mat).eval();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a need for .eval() here? Shouldn't the kernel kick off once it's being assigned to an already constructed matrix_cl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that without eval we have aliasing issues, similar as Eigen has. Since source and destination is the same matrix threads that work on lower/upper triangular part can (and do) overwrite each other's input values with their outputs. With eval we create new matrix for destination. That is also how individual transpose kernel works.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay if you can file an issue to write some docs about this (and then add the docs later) then I'm cool with approving this. We should probably have something like Eigen does if we also have the same aliasing issues.

You can either add that as a module (like in the below comment) or follow the instructions here on adding a new page

Comment on lines 11 to 12
using Eigen::MatrixXd;
using stan::math::matrix_cl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its usually good practice to put these inside of the tests than have them floating in global

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@SteveBronder
Copy link
Collaborator

Also have yinz written up a paper about the kernel generate yet? This is starting to get v v v complicated and something like a module on the stan math site would be nice to have for this

@t4c1
Copy link
Contributor Author

t4c1 commented Mar 20, 2020

Yep, the paper is almost complete, but it does not go in much more details than the design doc I wrote. I will think about adding something to the site. Remind me, where is the source for that?

@SteveBronder
Copy link
Collaborator

Remind me, where is the source for that?

The source for the site?

I think if we had something like the below it would be fine

/**
 * \ingroup opencl
 * \defgroup opencl_kernel_generator OpenCL Kernel Generator
 * [Brief intro and link to paper]
 */

@SteveBronder
Copy link
Collaborator

I think just including the paper is fine as long as you have some details on the different types of optimizations going on in here

@SteveBronder
Copy link
Collaborator

glad we did that design doc it made my understanding of this whole thing a lot more clear!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.96 4.84 1.03 2.46% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -0.79% slower
eight_schools/eight_schools.stan 0.09 0.09 1.01 1.22% faster
gp_regr/gp_regr.stan 0.22 0.22 1.01 0.62% faster
irt_2pl/irt_2pl.stan 6.43 6.46 1.0 -0.36% slower
performance.compilation 87.81 86.61 1.01 1.36% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.55 7.51 1.0 0.46% faster
pkpd/one_comp_mm_elim_abs.stan 20.74 20.76 1.0 -0.12% slower
sir/sir.stan 92.14 91.68 1.0 0.5% faster
gp_regr/gen_gp_data.stan 0.05 0.05 1.01 1.14% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.12 2.96 1.05 5.12% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.31 1.01 1.35% faster
arK/arK.stan 1.75 1.73 1.01 0.62% faster
arma/arma.stan 0.66 0.66 0.99 -0.72% slower
garch/garch.stan 0.52 0.52 1.0 -0.08% slower
Mean result: 1.0087877676

Jenkins Console Log
Blue Ocean
Commit hash: f1cbd9e


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.88 4.87 1.0 0.18% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.31% slower
eight_schools/eight_schools.stan 0.09 0.09 0.99 -0.76% slower
gp_regr/gp_regr.stan 0.22 0.22 1.01 0.96% faster
irt_2pl/irt_2pl.stan 6.44 6.5 0.99 -0.9% slower
performance.compilation 87.72 86.61 1.01 1.26% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.52 7.53 1.0 -0.04% slower
pkpd/one_comp_mm_elim_abs.stan 20.88 20.87 1.0 0.07% faster
sir/sir.stan 96.24 90.79 1.06 5.66% faster
gp_regr/gen_gp_data.stan 0.05 0.05 0.98 -2.32% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 2.95 1.0 -0.21% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.32 0.99 -1.45% slower
arK/arK.stan 1.74 1.74 1.0 -0.12% slower
arma/arma.stan 0.67 0.67 1.0 -0.35% slower
garch/garch.stan 0.52 0.51 1.0 0.47% faster
Mean result: 1.00042266768

Jenkins Console Log
Blue Ocean
Commit hash: f1cbd9e


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@t4c1
Copy link
Contributor Author

t4c1 commented Mar 23, 2020

I added docs about aliasing to doxygen. Extended documentation about kernel generator will come in its own PR (I plan two more PRs before that).

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.85 4.86 1.0 -0.39% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.15% slower
eight_schools/eight_schools.stan 0.09 0.09 1.01 0.59% faster
gp_regr/gp_regr.stan 0.22 0.22 0.99 -1.03% slower
irt_2pl/irt_2pl.stan 6.45 6.44 1.0 0.21% faster
performance.compilation 87.61 86.54 1.01 1.22% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.58 7.52 1.01 0.76% faster
pkpd/one_comp_mm_elim_abs.stan 21.3 20.17 1.06 5.29% faster
sir/sir.stan 90.92 92.78 0.98 -2.05% slower
gp_regr/gen_gp_data.stan 0.05 0.05 0.98 -1.77% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 2.95 1.0 0.1% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.31 1.0 0.46% faster
arK/arK.stan 1.75 1.74 1.01 0.52% faster
arma/arma.stan 0.66 0.65 1.01 0.98% faster
garch/garch.stan 0.51 0.51 1.0 -0.09% slower
Mean result: 1.00272292657

Jenkins Console Log
Blue Ocean
Commit hash: 9174755


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@t4c1 t4c1 merged commit d903537 into stan-dev:develop Mar 23, 2020
@t4c1 t4c1 deleted the cl_kernel_generator_transpose branch November 30, 2020 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants