Add transposition to kernel generator #1769

t4c1 · 2020-03-10T08:51:05Z

Summary

Adds transposition to kernel generator. Existing transposition kernel is removed.

Tests

Added new tests for transposition. Existing tests in opencl/prim/transpose_test.cpp also test the new code.

Side Effects

None.

Checklist

Math issue Implement OpenCL kernel generator #1342
Copyright holder: Tadej Ciglarič

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

…stable/2017-11-14)

SteveBronder

Few Qs related to the templates here

SteveBronder · 2020-03-11T19:13:48Z

stan/math/opencl/kernel_generator/transpose.hpp

+/**
+ * Represents a transpose in kernel generator expressions.
+ *
+ * Warning: transposing this expression is not supported!


This warning is kind of confusing, like we can't do transpose(transpose(x))?

Whoops, this is not true anymore. Removed.

SteveBronder · 2020-03-11T19:14:14Z

stan/math/opencl/kernel_generator/transpose.hpp

+ * @tparam Derived derived type
+ * @tparam T_a type of first argument
+ * @tparam T_b type of second argument


These don't match up

Also this may be a better place for clearer names like calling T Expr

SteveBronder · 2020-03-11T19:50:50Z

stan/math/opencl/kernel_generator/transpose.hpp

+  explicit transpose_(T&& a) : base(std::forward<T>(a)) {}
+
+  /**
+   * Creates a deep copy of this expression.
+   * @return copy of \c *this
+   */
+  inline transpose_<std::remove_reference_t<T>> deep_copy() {
+    return transpose_<std::remove_reference_t<T>>{
+        std::get<0>(arguments_).deep_copy()};
+  }


This is kinda weird to me. Let's say we call

double b = 10; transpose_<double> foo(b);

(or insert any non ref type). Then the constructor here is going to be

explicit transpose_(double&& a) : base(std::forward<double>(a)) {}

Where std::forward<double>(a) here is actually going to give back an rvalue reference since forward with an lvalue as the template and rvalue as the type still produces an rvalue (see godbolt below).

I think you need to have a separate template here

template <typename OpT, require_same_t<T, OpT>* = nullptr> explicit transpose_(OpT&& a) : base(std::forward<OpT>(a)) {}

The godbolt here shows this effect.

https://godbolt.org/z/pWNzSj

Having rvalue if rvalue is passed in here is fine. Your example on godbolt does not compile because you have wrong template argument in S<int> a(b); - this should be S<int&> a(b);

Also in your example in the comment you have an error - the constructor for

double b = 10; transpose_<double> foo(b);

would be:

explicit transpose_(double& a) : base(std::forward<double&>(a)) {}

SteveBronder · 2020-03-11T19:51:15Z

stan/math/opencl/kernel_generator/transpose.hpp

+   * @return part of kernel with code for this and nested expressions
+   */
+  inline kernel_parts generate(const std::string& i, const std::string& j,
+                               const std::string var_name_arg) const {


Why pass this string by value?

No reason. Fixed.

SteveBronder · 2020-03-14T20:11:01Z

Odd it looks like this is failing the cholesky test? Ping me when it's fixed and I'll do the review

SteveBronder · 2020-03-14T20:11:20Z

(also fyi feel free to ping me whenever to remind me to review)

…stable/2017-11-14)

…gs/RELEASE_500/final)

…stable/2017-11-14)

stan-buildbot · 2020-03-19T19:17:52Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.87	4.91	0.99	-0.78% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-0.95% slower
eight_schools/eight_schools.stan	0.09	0.09	1.04	3.51% faster
gp_regr/gp_regr.stan	0.22	0.22	0.99	-1.48% slower
irt_2pl/irt_2pl.stan	6.45	6.44	1.0	0.13% faster
performance.compilation	89.05	86.59	1.03	2.77% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.54	7.53	1.0	0.11% faster
pkpd/one_comp_mm_elim_abs.stan	20.42	20.91	0.98	-2.42% slower
sir/sir.stan	92.55	93.9	0.99	-1.46% slower
gp_regr/gen_gp_data.stan	0.05	0.05	0.98	-1.84% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	2.95	1.0	-0.09% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.31	1.01	0.61% faster
arK/arK.stan	1.75	1.74	1.01	1.07% faster
arma/arma.stan	0.66	0.66	1.0	0.34% faster
garch/garch.stan	0.52	0.51	1.01	1.11% faster
Mean result: 1.00067582644

Jenkins Console Log
Blue Ocean
Commit hash: bf60476

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

t4c1 · 2020-03-19T20:45:08Z

@SteveBronder This is (finaly) ready for a review.

SteveBronder · 2020-03-20T02:29:35Z

Cool! I'll take a look tmrw

SteveBronder

Only one comment on the .eval() there which seems odd. Rest of it looks good!

SteveBronder · 2020-03-20T19:21:43Z

stan/math/opencl/tri_inverse.hpp

@@ -81,7 +81,7 @@ inline matrix_cl<T> tri_inverse(const matrix_cl<T>& A) {
  zero_mat.template zeros<stan::math::matrix_cl_view::Entire>();
  inv_padded.template zeros<stan::math::matrix_cl_view::Entire>();
  if (tri_view == matrix_cl_view::Upper) {
-    inv_mat = transpose(inv_mat);
+    inv_mat = transpose(inv_mat).eval();


Why is there a need for .eval() here? Shouldn't the kernel kick off once it's being assigned to an already constructed matrix_cl?

The problem is that without eval we have aliasing issues, similar as Eigen has. Since source and destination is the same matrix threads that work on lower/upper triangular part can (and do) overwrite each other's input values with their outputs. With eval we create new matrix for destination. That is also how individual transpose kernel works.

Okay if you can file an issue to write some docs about this (and then add the docs later) then I'm cool with approving this. We should probably have something like Eigen does if we also have the same aliasing issues.

You can either add that as a module (like in the below comment) or follow the instructions here on adding a new page

SteveBronder · 2020-03-20T19:24:36Z

test/unit/math/opencl/kernel_generator/transpose_test.cpp

+using Eigen::MatrixXd;
+using stan::math::matrix_cl;


I think its usually good practice to put these inside of the tests than have them floating in global

SteveBronder · 2020-03-20T19:30:46Z

Also have yinz written up a paper about the kernel generate yet? This is starting to get v v v complicated and something like a module on the stan math site would be nice to have for this

t4c1 · 2020-03-20T21:33:48Z

Yep, the paper is almost complete, but it does not go in much more details than the design doc I wrote. I will think about adding something to the site. Remind me, where is the source for that?

SteveBronder · 2020-03-20T22:30:27Z

Remind me, where is the source for that?

The source for the site?

I think if we had something like the below it would be fine

/**
 * \ingroup opencl
 * \defgroup opencl_kernel_generator OpenCL Kernel Generator
 * [Brief intro and link to paper]
 */

SteveBronder · 2020-03-20T22:32:50Z

I think just including the paper is fine as long as you have some details on the different types of optimizations going on in here

SteveBronder · 2020-03-20T22:35:22Z

glad we did that design doc it made my understanding of this whole thing a lot more clear!

stan-buildbot · 2020-03-21T03:06:58Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.96	4.84	1.03	2.46% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-0.79% slower
eight_schools/eight_schools.stan	0.09	0.09	1.01	1.22% faster
gp_regr/gp_regr.stan	0.22	0.22	1.01	0.62% faster
irt_2pl/irt_2pl.stan	6.43	6.46	1.0	-0.36% slower
performance.compilation	87.81	86.61	1.01	1.36% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.55	7.51	1.0	0.46% faster
pkpd/one_comp_mm_elim_abs.stan	20.74	20.76	1.0	-0.12% slower
sir/sir.stan	92.14	91.68	1.0	0.5% faster
gp_regr/gen_gp_data.stan	0.05	0.05	1.01	1.14% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.12	2.96	1.05	5.12% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.31	1.01	1.35% faster
arK/arK.stan	1.75	1.73	1.01	0.62% faster
arma/arma.stan	0.66	0.66	0.99	-0.72% slower
garch/garch.stan	0.52	0.52	1.0	-0.08% slower
Mean result: 1.0087877676

Jenkins Console Log
Blue Ocean
Commit hash: f1cbd9e

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

stan-buildbot · 2020-03-21T10:37:40Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.88	4.87	1.0	0.18% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.98	-2.31% slower
eight_schools/eight_schools.stan	0.09	0.09	0.99	-0.76% slower
gp_regr/gp_regr.stan	0.22	0.22	1.01	0.96% faster
irt_2pl/irt_2pl.stan	6.44	6.5	0.99	-0.9% slower
performance.compilation	87.72	86.61	1.01	1.26% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.52	7.53	1.0	-0.04% slower
pkpd/one_comp_mm_elim_abs.stan	20.88	20.87	1.0	0.07% faster
sir/sir.stan	96.24	90.79	1.06	5.66% faster
gp_regr/gen_gp_data.stan	0.05	0.05	0.98	-2.32% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	2.95	1.0	-0.21% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.32	0.99	-1.45% slower
arK/arK.stan	1.74	1.74	1.0	-0.12% slower
arma/arma.stan	0.67	0.67	1.0	-0.35% slower
garch/garch.stan	0.52	0.51	1.0	0.47% faster
Mean result: 1.00042266768

Jenkins Console Log
Blue Ocean
Commit hash: f1cbd9e

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

t4c1 · 2020-03-23T08:21:28Z

I added docs about aliasing to doxygen. Extended documentation about kernel generator will come in its own PR (I plan two more PRs before that).

SteveBronder

lgtm!

stan-buildbot · 2020-03-23T15:20:09Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.85	4.86	1.0	-0.39% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-1.15% slower
eight_schools/eight_schools.stan	0.09	0.09	1.01	0.59% faster
gp_regr/gp_regr.stan	0.22	0.22	0.99	-1.03% slower
irt_2pl/irt_2pl.stan	6.45	6.44	1.0	0.21% faster
performance.compilation	87.61	86.54	1.01	1.22% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.58	7.52	1.01	0.76% faster
pkpd/one_comp_mm_elim_abs.stan	21.3	20.17	1.06	5.29% faster
sir/sir.stan	90.92	92.78	0.98	-2.05% slower
gp_regr/gen_gp_data.stan	0.05	0.05	0.98	-1.77% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	2.95	1.0	0.1% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.31	1.0	0.46% faster
arK/arK.stan	1.75	1.74	1.01	0.52% faster
arma/arma.stan	0.66	0.65	1.01	0.98% faster
garch/garch.stan	0.51	0.51	1.0	-0.09% slower
Mean result: 1.00272292657

Jenkins Console Log
Blue Ocean
Commit hash: 9174755

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

t4c1 added 2 commits March 10, 2020 09:44

added transpose to kernel_generator

b3c5a6b

removed existing transpose kernel

969c7dc

t4c1 changed the title ~~Cl kernel generator transpose~~ Add transposition to kernel generator Mar 10, 2020

t4c1 and others added 4 commits March 10, 2020 09:57

fix cpplint

b3196f5

replaced transpose with another kernel in kernel_cl_test.cpp

f931e1d

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

7296a2e

…stable/2017-11-14)

Merge branch 'develop' into cl_kernel_generator_transpose

f101f37

SteveBronder requested changes Mar 11, 2020

View reviewed changes

t4c1 added 2 commits March 13, 2020 11:26

addressed review comments

94a4b11

Merge branch 'develop' into cl_kernel_generator_transpose

a65bb77

t4c1 and others added 12 commits March 17, 2020 14:14

Fixed the bug and enabled operation_cl to store arguments by reference

987e3d5

Merge commit 'a0e4ba0290262f0b5ce962648cd3aa50ae61d08b' into HEAD

a994e42

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

a72d379

…stable/2017-11-14)

fix headers check

5b425f8

fix view calculation of transpose

f90ab06

Merge commit '3a66a331aecce071ebc402dcaa3213cff075feec' into HEAD

92f0b01

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

4a0c836

…gs/RELEASE_500/final)

fix cpplint

60896bc

fixed aliasing in tri_inverse

57712ee

Merge commit 'f3cbe214b3da41316694673f715a434e27a9e6d0' into HEAD

2e8796d

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

805a293

…stable/2017-11-14)

fixed transpose test

bf60476

SteveBronder requested changes Mar 20, 2020

View reviewed changes

Fixed usings in test

f1cbd9e

added docs about aliasing

9174755

t4c1 mentioned this pull request Mar 23, 2020

adds an apply function and cleans adj_jac_apply to use it #1791

Merged

5 tasks

SteveBronder approved these changes Mar 23, 2020

View reviewed changes

t4c1 merged commit d903537 into stan-dev:develop Mar 23, 2020

SteveBronder mentioned this pull request Apr 16, 2020

Stan Math 3.2 release #1826

Closed

bbbales2 mentioned this pull request Apr 20, 2020

Stanc3 release for Cmdstan 2.23 stan-dev/stanc3#498

Closed

t4c1 deleted the cl_kernel_generator_transpose branch November 30, 2020 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transposition to kernel generator #1769

Add transposition to kernel generator #1769

t4c1 commented Mar 10, 2020 •

edited

Loading

SteveBronder left a comment

SteveBronder Mar 11, 2020

t4c1 Mar 13, 2020

SteveBronder Mar 11, 2020

SteveBronder Mar 11, 2020

t4c1 Mar 13, 2020

SteveBronder Mar 11, 2020

t4c1 Mar 13, 2020 •

edited

Loading

t4c1 Mar 13, 2020

SteveBronder Mar 11, 2020

t4c1 Mar 13, 2020

SteveBronder commented Mar 14, 2020

SteveBronder commented Mar 14, 2020 •

edited

Loading

stan-buildbot commented Mar 19, 2020

t4c1 commented Mar 19, 2020

SteveBronder commented Mar 20, 2020

SteveBronder left a comment

SteveBronder Mar 20, 2020

t4c1 Mar 20, 2020

SteveBronder Mar 20, 2020

SteveBronder Mar 20, 2020

t4c1 Mar 20, 2020

SteveBronder commented Mar 20, 2020

t4c1 commented Mar 20, 2020

SteveBronder commented Mar 20, 2020

SteveBronder commented Mar 20, 2020

SteveBronder commented Mar 20, 2020

stan-buildbot commented Mar 21, 2020

stan-buildbot commented Mar 21, 2020

t4c1 commented Mar 23, 2020

SteveBronder left a comment

stan-buildbot commented Mar 23, 2020

		using Eigen::MatrixXd;
		using stan::math::matrix_cl;

Add transposition to kernel generator #1769

Add transposition to kernel generator #1769

Conversation

t4c1 commented Mar 10, 2020 • edited Loading

Summary

Tests

Side Effects

Checklist

SteveBronder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t4c1 Mar 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveBronder commented Mar 14, 2020

SteveBronder commented Mar 14, 2020 • edited Loading

stan-buildbot commented Mar 19, 2020

t4c1 commented Mar 19, 2020

SteveBronder commented Mar 20, 2020

SteveBronder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveBronder commented Mar 20, 2020

t4c1 commented Mar 20, 2020

SteveBronder commented Mar 20, 2020

SteveBronder commented Mar 20, 2020

SteveBronder commented Mar 20, 2020

stan-buildbot commented Mar 21, 2020

stan-buildbot commented Mar 21, 2020

t4c1 commented Mar 23, 2020

SteveBronder left a comment

Choose a reason for hiding this comment

stan-buildbot commented Mar 23, 2020

t4c1 commented Mar 10, 2020 •

edited

Loading

t4c1 Mar 13, 2020 •

edited

Loading

SteveBronder commented Mar 14, 2020 •

edited

Loading