Change `softmax` test to use `maxf`. #10219

MaheshRavishankar · 2022-08-26T03:48:08Z

The e2e softmax test uses cmpf -> select for max operations. Use
maxfinstead. This allows the op to be vectorized. The TOSA to Linalg
lowering has been recently updated to do the same (and this test was
derived from using an older TOSA to Linalg lowering).

Related to PR #10177

The e2e softmax test uses `cmpf` -> `select` for max operations. Use `maxf`instead. This allows the op to be vectorized. The TOSA to Linalg lowering has been recently updated to do the same (and this test was derived from using an older TOSA to Linalg lowering).

@benvanik

commit f62ec3b Author: bjacob <benoitjacob@google.com> Date: Tue Aug 30 14:51:10 2022 -0400 VMVX mmt4d ukernel (#10239) This brings an initial (unoptimized, reference code only) mmt4d ukernel - both `f32f32f32` and `i8i8i32`. It is covered by the e2e matmul tests: if you purposefully introduce a numerical bug in the ukernel function, `iree_vmvx_mmt4d_f32f32f32` then this test fails: `iree/tests/e2e/matmul/e2e_matmul_mmt4d_f32_small_ukernel_vmvx_local-task` . Ditto for `i8i8i32`. That the whole reference code is for now in `module.c`, as opposed to being nicely isolated in `iree/builtings/ukernel`, is temporary. I have a few questions to ask about the placeholders in this directory, but it will be so much more concrete to discuss after we are done reviewing this PR so I hope that's OK to split as a separate code move for another PR. A couple of nontrivial decisions in this PR: * In `LowerLinalgMicrokernels.cpp` there was a `isUnitInnerStride` helper function. It was only applied to 2D memrefs. The underlying question is how much layout generality do we want ukernels to support, and the existing code embodied a decision on this for 2D arrays, but mmt4d deals with 4D arrays so the question was how to generalize this from 2D to 4D arrays. I chose to generalize `isUnitInnerStride` into `areInnerDimsContiguousRowMajor`. See the comment where it is defined. The lit test, `lower_linalg_microkernels.mlir`, has testcases to cover several edge cases here. * Similar to what we decided last week for matmul in #10211, there was the question of how to deal with the accumulators that is nonzero in the general case but that we know will often be zero in practice so that we will want to retain the ability to take advantage of that. This is handled here exactly like it was for matmul in #10211. I even reused the flag symbolic constant rather than create a separate one. Yay for weak typing. commit ddaaaaa Author: Jerry Wu <cheyuw@google.com> Date: Tue Aug 30 11:10:40 2022 -0700 Generate CMake rules to download and import models (#10167) commit 753ac4d Author: Geoffrey Martin-Noble <gcmn@google.com> Date: Tue Aug 30 11:04:59 2022 -0700 Remove RV32 Mobile Bert Compilation Benchmark (#10234) Building the benchmarks is currently the critical path in CI latency, taking almost 25 minutes for just that job, after it waits 25 minutes for the TF integrations binaries (was that alway so slow??). [![ci_run_graph](https://user-images.githubusercontent.com/5732088/187279027-21137775-5a3b-4ddf-ae4d-42e39051e7b2.png)](https://github.com/iree-org/iree/actions/runs/2950708667) Of that time, 20 minutes is spent compiling this one vmfb, which we only do so we can get statistics on how long it takes to compile. I sampled the ten slowest build actions from a local build of the benchmarks: ``` 1179.39 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-quant.tflite.mlir-22179362840f853977acc734ee75e6ce.vmfb 216.321 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-quant.tflite.mlir-53b16b00b2d02162b1706d73ab6270b4.vmfb 159.585 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-quant.tflite.mlir-3bcb3f959e9f123bbaa01aa4d237bab8.vmfb 146.027 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-quant.tflite.mlir-73879267ae95d3551e73c7f078f4410d.vmfb 109.922 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-quant.tflite.mlir-cf781c710ad5c59b5e7f205b17b3c37b.vmfb 104.864 benchmark_suites/TFLite/vmfb/mobilebertsquad.tflite.mlir-fddd07b06a1abf9f5d4ea97225066f01.vmfb 88.665 benchmark_suites/TFLite/vmfb/mobilebertsquad.tflite.mlir-4fe50b8684bdd4684941c8a5698d3a48.vmfb 88.316 benchmark_suites/TFLite/vmfb/mobilebertsquad.tflite.mlir-833fba075c9cf413b8acbea9be0acade.vmfb 87.238 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-float.tflite.mlir-8a916ab990bd1cb5521dce6dd6a5ac6a.vmfb 86.905 benchmark_suites/TFLite/vmfb/mobilebert-baseline-tf2-float.tflite.mlir-e304860762a8369f86b813d45b3c699a.vmfb ``` This one is the clear winner. I don't think it's worth running this compilation only to discover what we already know (it is very slow to run this compilation). Tested: - `build_benchmarks` for this PR ran in 7 minutes instead of 25. - Ran riscv benchmark pipeline. commit a456db6 Author: CindyLiu <hcindyl@google.com> Date: Mon Aug 29 13:57:29 2022 -0700 Add iree_bytecode_module and iree_c_module static lib support (#10231) Check and parse `iree-llvm-static-library-output-path` flag to add static library object support. To make the secondary function like iree_static_linker_test cleaner. commit 6d4b129 Author: Thomas <thomasraoux@google.com> Date: Mon Aug 29 13:32:42 2022 -0700 Fix gcc build (#10235) Prevent ambigous constructor call. commit 6fa18e0 Author: Kojo Acquah <KoolJBlack@users.noreply.github.com> Date: Mon Aug 29 12:20:04 2022 -0700 Implementation of GPU Shared Memory Transpose Pipeline (#10209) Currently only `32x32` aligned 2D transposes are supported. Based on https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/, uses a fixed tile size of `32x32` and workgroup size of `{8x32}` to preform vectorized copy for transpose. The tile is padded to `32x33` to reduce bank conflicts. Note that bank conflicts aren fully eliminated due to use of vector load/store 4. Todo: * Move beyond single hard coded workgroup and tile size? * Handle non aligned transpose * Handle dynamic sized transpose Related to #10005 commit 546ffcb Author: Jerry Wu <cheyuw@google.com> Date: Mon Aug 29 11:49:58 2022 -0700 Fix the typos of riscv names in CI (#10232) commit da21e83 Author: Han-Chung Wang <hanchung@google.com> Date: Tue Aug 30 02:29:41 2022 +0800 Optimize tiling sizes heuristics for elementwise dispatches. (#10179) In the past, small numbers could be picked because we want vectorization enabled for all the kernels. The PR picks a more reasonable tiling sizes and addresses tiny dispatch issues. The peeling pipeline works in IREE and the PR moves elementwise dispatches (and copy only dispatches) to use peeling approach. In this case, we're still able to vectorize the dispatches. This PR changes the logic to limit the unroll factor when computing the vector level tiling sizes. It avoids generating many operations, which saves many compilation time and binary size for quantized models. It also improves models performance that IREE is tracking for all CPU backends. Fixes #9660 commit 249c813 Author: Thomas <thomasraoux@google.com> Date: Mon Aug 29 11:06:31 2022 -0700 [LLVMGPU] Move bufferization after vectorization for matmulSIMT (#10217) Transition mamul SIMT pipeline to do vectorization before bufferization. This relies on alloc_tensor op to model shared memory promotion and foreach_thread for the tiling at the tensor level. Also simplify significantly the vectorization pass by removing patterns not needed anymore. This will allow us to do more optimizations at the tensor going forward. commit 3f173de Author: Geoffrey Martin-Noble <gcmn@google.com> Date: Mon Aug 29 09:52:45 2022 -0700 Pin GitHub runner configuration to a specific commit (#10218) This changes the startup script on the runners to fetch configuration from a specific commit, rather than directly from tip of tree on `main`. That makes it possible to actually test, canary, and roll back configuration changes, almost as if this were a real production system. There are some early-stage scripts to automate the creation of templates and managed instance group roll-outs. I've also set up functionality to have testing runner groups. Because of the way targeting runners works, that means that workflow have to explicitly specify the environment so that testing runners *don't* pick up the job. The testing group will allow testing new runner configurations on presubmit as much as possible. Of course, for this change, I actually can't do the safe thing because I can't test adding the extra tag to the runners. I've still pushed a new template to the testing instance group and set the `build_all` job for this PR to run on it by targeting a specific instance by hostname: https://github.com/iree-org/iree/runs/8027570693. (Note that that run actually had a failure in the asan workflow, but that wasn't running on my runner and I don't think it could possibly be related). Note that because this doesn't alter the `config/` directory, submitting it will not have any effect on the current runners. skip-ci Peeled out of #10133 commit c50bac3 Author: Geoffrey Martin-Noble <gcmn@google.com> Date: Mon Aug 29 09:49:10 2022 -0700 Build Linux releases on big managed runners (#10126) This speed the linux builds up a bit, bringing the time for the longest job down from ~5 hours to ~20 minutes. Note that this is *only* the Linux jobs. The mac ones still take about 4 hours. This should still help when iterating on the release though and for faster failure indicators (it was indeed helpful when I was iterating here). I ran into issues when testing because I was using a package suffix in the workflow dispatch, which evidently had never actually been tested because it was totally broken. This gave me a lot of wonderful opportunity to bash my head against bash and I reworked a lot of the `build_linux_package.sh` script. In retrospect, I wish I'd just removed the `package_suffix` feature. Test run: https://github.com/iree-org/iree/actions/runs/2923210349 skip-ci commit c338ae9 Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Fri Aug 26 15:25:22 2022 -0700 Cherry pick D132720 (#10227) Cherry pick : llvm/llvm-project@a235562 Cherry pick : llvm/llvm-project@766f5d8 commit df4c96e Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Fri Aug 26 14:09:45 2022 -0700 Cherry-pick llvm/llvm-project@7744253 (#10226) Towards landing #10177 commit b533909 Author: bjacob <benoitjacob@google.com> Date: Fri Aug 26 15:16:39 2022 -0400 Support the i8i8i32 case in vmvx matmul ukernel. (#10222) commit 62d2be5 Author: Scott Todd <scotttodd@google.com> Date: Fri Aug 26 11:46:33 2022 -0700 [NFC] Slight cleanup in HAL compiler passes. (#10223) commit 8a48e10 Author: Thomas <thomasraoux@google.com> Date: Fri Aug 26 11:09:30 2022 -0700 Cherry-pick llvm/llvm-project@2e34599b and llvm/llvm-project@1ee0d60a (#10221) * commit 2e34599bfd01e5b20e09bd6af590a52d6a63a64c * commit 1ee0d60a9be5dcbe3234b81a1c93e6a206a88154 commit cf5a5d5 Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Fri Aug 26 10:10:40 2022 -0700 Find root by traversing the compute ops in reverse. (#10210) Since most of the codegeneration uses tile + fuse, where the consumer is tiled and the producer is fused with it, find the root by traversing the ops in reverse. Issue #10208 commit 272ea37 Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Fri Aug 26 10:00:56 2022 -0700 Change `softmax` test to use `maxf`. (#10219) The e2e softmax test uses `cmpf` -> `select` for max operations. Use `maxf`instead. This allows the op to be vectorized. The TOSA to Linalg lowering has been recently updated to do the same (and this test was derived from using an older TOSA to Linalg lowering). Related to PR #10177 commit 233795f Author: bjacob <benoitjacob@google.com> Date: Fri Aug 26 12:00:09 2022 -0400 Tidy the VMVX ukernels matmul interface (#10211) This makes the VMVX ukernel interface for matmul somewhat sustainable and generalizable. It's official now that the only supported case is when all operands are row-major (more general support might be wanted in the future, but would have to allow separate storage orders for each operand in order to be likely to be used). The only flag now is one bit to tell whether to accumulate into an existing accumulator, or just zero it. At the moment we always accumulate but could soon generate calls without the accumulate flag when compiling code where the accumulator operand is known to be zero-filled. In terms of optimized runtime code, it is nearly zero overhead to support that boolean degree of generality in the ukernel. The "reference" ukernel impl is changed to be a little more suggestive of how an optimized impl would look. The alpha, beta parameters are gone. There were hard to generalize to integer data types, and they were mostly gratuitous generality anyway (they didn't do the same as the namesake BLAS GEMM parameters). commit 094ec6d Author: Lei Zhang <antiagainst@google.com> Date: Thu Aug 25 19:34:02 2022 -0400 Integrate llvm/llvm-project@71604f4c4c30 (#10204) * Reset third_party/llvm-project: 8f45b5a7a90f24ae1dabeff161e22594039a8b0a (2022-08-24 20:26:48 +0000): RISCV: permit unaligned nop-slide padding emission * Updated tensorflow/tensorflow@aed7775 * Updated tensorflow/mlir-hlo@3b1b023 * Fixed mhlo include paths commit 4f0c5b1 Author: Jakub Kuderski <kubak@google.com> Date: Thu Aug 25 19:05:01 2022 -0400 Add debug option to dump LLVMCPU/GPU pass pipeline (#10214) This is enabled using the `--debug-only=iree-llvm-cpu-lowering-pass-pipeline` and `--debug-only=iree-llvm-gpu-lowering-pass-pipeline` flags. The SPIR-V codegen path has a similar option. commit acb7355 Author: bjacob <benoitjacob@google.com> Date: Thu Aug 25 16:24:21 2022 -0400 Add e2e matmul tests on vmvx+ukernels (float32-only for now) (#10193) Other types than `float32` are blocked on vmvx ukernels support for those (#9903). I'm interested in landing float32 support early because the path to supporting other data types goes through breaking changes in the existing vmvx ukernel interface for matmul (limiting the generality of the BLAS-inspired interface, particularly the `alpha` and `beta` parameters) so I want to have e2e tests in place at the start of that process. commit 8863f9e Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Thu Aug 25 12:02:11 2022 -0700 Cherry pick llvm/llvm-project@71604f4 (#10207) Fixes #10194 commit 22e6bd4 Author: bjacob <benoitjacob@google.com> Date: Thu Aug 25 15:00:18 2022 -0400 try to be compatible with more pyyaml versions (#10206) commit da6829d Author: Scott Todd <scotttodd@google.com> Date: Thu Aug 25 11:36:04 2022 -0700 Replace dedicated host_tools CI job with superset build_all. (#10195) Relates to #9855 These builds shared the same options but just built different targets. Just building the tools _is_ faster than building the tools and tests, but not by enough to justify having a separate job. The build_host_tools.sh script is still referenced by some samples, so I think it's worth keeping for a bit. * Spell out `build-dir-gcs-artifact` and `binaries-gcs-artifact` to match other output names * Remove host_tools.yml * Replace host_tools_assertions with build_all. Note that this uses GCS instead of upload-artifact/download-artifact for transferring archives between jobs * Sort jobs in `needs:` so the summary graph groups jobs as expected Note: `${BUILD_DIR}/install` is implicit. It could be made explicit with more plumbing. Co-authored-by: Geoffrey Martin-Noble <gcmn@google.com> commit 38e718e Author: bjacob <benoitjacob@google.com> Date: Thu Aug 25 13:32:55 2022 -0400 Fix printing of matrices on test failure: was overflowing (#10202) commit d8cabf7 Author: Kevin Gleason <gleasonk@google.com> Date: Thu Aug 25 12:20:52 2022 -0400 Allow blank issues to be created (#10197) Currently clicking the "Blank Issue" button loops you back to the issue choose page because blank issues are disabled. When disabled, the following redirect is in place: https://github.com/iree-org/iree/issues/new --> https://github.com/iree-org/iree/issues/new/choose Background: I based the StableHLO issues config off this file, and noticed that the blank issues are not working on that repo because they are disabled. Flipping this boolean did the trick in openxla/stablehlo. commit 579d527 Author: Matthias Springer <springerm@google.com> Date: Thu Aug 25 09:37:22 2022 +0200 Add CPU matmul benchmark test (#10174) This test illustrates how a simple matmul example can be compiled with the transform dialect and then benchmarked. Parameter search will use the commands that are used in this test. commit 85171e9 Author: Lei Zhang <antiagainst@google.com> Date: Wed Aug 24 21:30:51 2022 -0400 Cherry-pick MHLO dependency fix to fix release (#10198) commit 1adcebb Merge: 8301a5c 7fe1437 Author: Ben Vanik <ben.vanik@gmail.com> Date: Wed Aug 24 18:14:42 2022 -0700 Merge pull request #10181 from iree-org/benvanik-execute-commands Secondary command buffers can now be executed from primary command buffers via iree_hal_command_buffer_execute_commands. During recording of nested command buffers push descriptors can indirectly reference slots in a binding table provided with each execution request. This enables the same reusable command buffer to be executed many times with unique bindings (even with prior execution in-flight), which is a common pattern with queue-ordered allocations. In the future we could allow the indirect bindings on primary command buffers as well but that requires more work in each backend to support and for now making it nested-only lets us turn on the feature incrementally. For now nothing supports either nested or indirect bindings so this is pure plumbing. The compiler has the HAL ops modeled but nothing is lowering into them yet; a pass that memoizes portions of streams and sets up the indirect binding references is required. Progress on #10144. Bumps bytecode version due to HAL changes. commit 7fe1437 Author: Ben Vanik <ben.vanik@gmail.com> Date: Tue Aug 23 22:10:56 2022 -0700 Disabling ASAN fully_connected.mlir test due to swiftshader issue. Same behavior as the other excluded tests from #5715. commit dd93b3c Author: Ben Vanik <ben.vanik@gmail.com> Date: Tue Aug 23 16:28:16 2022 -0700 Bumping bytecode version due to breaking HAL changes. commit 9bd7031 Author: Ben Vanik <ben.vanik@gmail.com> Date: Tue Aug 23 10:44:30 2022 -0700 Plumbing support for nested command buffers and binding tables. Secondary command buffers can now be executed from primary command buffers via iree_hal_command_buffer_execute_commands. During recording of nested command buffers push descriptors can indirectly reference slots in a binding table provided with each execution request. This enables the same reusable command buffer to be executed many times with unique bindings (even with prior execution in-flight), which is a common pattern with queue-ordered allocations. In the future we could allow the indirect bindings on primary command buffers as well but that requires more work in each backend to support and for now making it nested-only lets us turn on the feature incrementally. The compiler has the HAL ops modeled but nothing is lowering into them yet; a pass that memoizes portions of streams and sets up the indirect binding references is required. Progress on #10144. commit 8301a5c Author: Scott Todd <scotttodd@google.com> Date: Wed Aug 24 16:39:27 2022 -0700 Rework build_benchmarks to reuse already built host tools. (#10190) This should address #4662 (comment). This workflow is currently our slowest, taking ~32 minutes (of which half of that time is spent rebuilding `iree-compile`, and that's 30 minutes _after_ blocking on the 20 minute build_tf_integrations job). New timing is ~20 minutes (saving 10 minutes): https://github.com/iree-org/iree/runs/8004780350?check_suite_focus=true commit cca2ff6 Author: bjacob <benoitjacob@google.com> Date: Wed Aug 24 18:46:11 2022 -0400 Handle rank-reducing subviews in ResolveBufferDescriptors (#10192) commit d9e6eb7 Author: CindyLiu <hcindyl@google.com> Date: Wed Aug 24 10:54:06 2022 -0700 Update the candidate commitish value with the last green commit (#10183) Make it consistent with the rest of the release steps. commit 1d55c6c Author: Thomas <thomasraoux@google.com> Date: Wed Aug 24 10:06:44 2022 -0700 clean up workaround after upstream fix (#10188) commit c9e9482 Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Wed Aug 24 08:20:05 2022 -0700 Cherry-pick llvm/llvm-project@a7bfdc2 (#10150) commit 00d34d1 Author: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com> Date: Wed Aug 24 07:59:13 2022 -0700 NFC: Refactoring to make extending fusion heuristics in dispatch formation easier. (#10187) Minor refactoring to allow for extending fusion heuristics for fusing root with producers. commit 3c69ea9 Author: Jakub Kuderski <kubak@google.com> Date: Wed Aug 24 10:31:08 2022 -0400 [iree-run-module] Do not abort when `Run` fails. (#10186) commit 63d4693 Author: Jakub Kuderski <kubak@google.com> Date: Wed Aug 24 10:30:50 2022 -0400 [iree-run-module] Clarify how to pass scalar inputs. NFC. (#10185) Be more explicit and provide an example. commit 2ec165b Author: Lei Zhang <antiagainst@google.com> Date: Wed Aug 24 00:33:04 2022 -0400 Integrate llvm/llvm-project@4332b049edf6 (#10180) * Reset third_party/llvm-project: 4332b049edf6ccf98c9e31dcc983760a89f01d40 (2022-08-23 17:37:12 +0800): [docs] Add examples for printing asynchronous stack for coroutines * Updated tensorflow/tensorflow@55791c2 * Updated tensorflow/mlir-hlo@184a76a * Fixed mhlo/chlo enum split. commit ae72b95 Author: CindyLiu <hcindyl@google.com> Date: Tue Aug 23 15:26:08 2022 -0700 Add llvm static library linker test targets (#10149) * Add llvm static library linker test targets Add a cmake function to build/test llvm static library modules with the llvm-cpu compiler target backend and executed using the local-sync runtime HAL driver. The executable is linked to a simple runtime runner generated by a template. Add simple e2e mlir linker tests in `tests/e2e/models`. commit e4dc88c Author: Rob Suderman <suderman@google.com> Date: Tue Aug 23 10:54:02 2022 -0700 Update flex ops test for the TFLite front-end test (#10164) commit 57ec69d Author: Thomas <thomasraoux@google.com> Date: Tue Aug 23 09:01:23 2022 -0700 [LLVMGPU] Start transitioning to scf.foreach for second level tiling (#10166) This will allow doing distribution at the tensor level. commit bd33104 Merge: 35d28b9 d1ca241 Author: Ben Vanik <ben.vanik@gmail.com> Date: Tue Aug 23 08:55:34 2022 -0700 Merge pull request #10170 from iree-org/benvanik-pipeline-layout-3 Replacing descriptor set layout usage with a flag bitfield. Descriptor sets are only used in layouts and the usage is now always push-only today. As we support things like binding tables we may want to indicate which bindings may come from tables and if we want to carry access information (which bindings are read-only, etc) we'll need somewhere for that too: instead of having 4 enums with 2 options each we'll just mash them together for now. This also adds a per-descriptor flag that can be used for indicating binding behavior. Today it's got a bit indicating whether the particular descriptor is read-only but we could extend it to support caching behavior (non-temporal, atomics, etc). The upstream bitfield enum has some glitchy behavior with lowercase strings (hardcoded to look for "None" instead of "none", etc) - I've got a refresh of the HAL dialect I've got to do at some point and will normalize things then. Progress on #10144. VMFB version bumped because of breaking type/export name change. commit 35d28b9 Author: Matthias Springer <springerm@google.com> Date: Tue Aug 23 15:31:15 2022 +0200 Support multiple target ops in clone_succeeding_op_into_dispatch_region (#10035) The target ops are sorted topoloically before cloning them one-by-one. This is to ensure that there are no dominance violations. commit b5bf9d5 Author: Matthias Springer <springerm@google.com> Date: Tue Aug 23 14:31:02 2022 +0200 Add clone_succeeding_op_into_dispatch_region transform op (#10022) This op is symmetric to `clone_preceding_op_into_dispatch_region` and can be used to build heuristics for dispatch region formation. commit 7e8c831 Author: Matthias Springer <springerm@google.com> Date: Tue Aug 23 11:56:07 2022 +0200 Support multiple target ops in clone_preceding_op_into_dispatch_region (#10020) The target ops are sorted topoloically before cloning them one-by-one. This is to ensure that there are no dominance violations. commit dc06d95 Author: Han-Chung Wang <hanchung@google.com> Date: Tue Aug 23 14:26:26 2022 +0800 [NFC] Remove outdated method arguments from KernelConfig. (#10165) The distribution tiling was done at flow level, and it's moved to a stage after setting kernel configurations. We no longer need the tiledLoop information when setting configurations. Also apply minor cleanups when revisiting the file -- use `.empty()` method instead of `.size() > 0`. commit d1ca241 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 21:42:35 2022 -0700 Bumping bytecode version due to breaking HAL changes. commit a4da601 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 15:50:25 2022 -0700 Replacing descriptor set layout usage with a flag bitfield. Descriptor sets are only used in layouts and the usage is now always push-only today. As we support things like binding tables we may want to indicate which bindings may come from tables and if we want to carry access information (which bindings are read-only, etc) we'll need somewhere for that too: instead of having 4 enums with 2 options each we'll just mash them together for now. This also adds a per-descriptor flag that can be used for indicating binding behavior. Today it's got a placeholder read-only value but we can add more in the future controlling cache behavior and such. Progress on #10144. commit 88795f5 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 16:22:05 2022 -0700 Fixing deprecation warnings on mlir::OptionalParseResult. commit d86e3a7 Author: Scott Todd <scotttodd@google.com> Date: Mon Aug 22 18:45:18 2022 -0700 Remove ArithmeticExpandOpsPass from SPIRV and VMVX lowerings. (#10162) Based on discussion at #10142 (comment) . This "fixes" one case of `spv.IsNan` ops getting introduced while lowering of `arith.minf`, but it does not generally address NaNs coming from other sources (user-space or internal to the compiler). ## Rationale The `ArithmeticExpandOpsPass` pass (declaration [here](https://github.com/llvm/llvm-project/blob/af29db64b2c7091070dd623c81872559657e7b3d/mlir/include/mlir/Dialect/Arithmetic/Transforms/Passes.td#L31-L34) and [here](https://github.com/llvm/llvm-project/blob/af29db64b2c7091070dd623c81872559657e7b3d/mlir/include/mlir/Dialect/Arithmetic/Transforms/Passes.h#L23-L24)) is overly specific to a particular lowering to LLVM. The `minf` and `maxf` lowerings in particular generate IR like ```mlir %8 = arith.cmpf ult, %7, %5 : vector<1x5xf32> %9 = arith.select %8, %7, %5 : vector<1x5xi1>, vector<1x5xf32> %10 = arith.cmpf uno, %5, %5 : vector<1x5xf32> %11 = arith.select %10, %5, %9 : vector<1x5xi1>, vector<1x5xf32> ``` rather than tunnel down to intrinsics like [`llvm.minnum`](https://llvm.org/docs/LangRef.html#llvm-minnum-intrinsic). Digging through the history a bit, I see where the min/max ops were added in https://reviews.llvm.org/D110540, which carries forward some rational for using `select` to implement min/max. For our uses, quoting @benvanik , > Yeah, that cmp/select/cmp/select dance is really bad as IIRC LLVM/other backends can't/don't practically ever simplify that again while retaining the same semantics. The behavior that nearly everything uses is "return the non-nan value if either value is nan" (GLSL min, OpenCL fminf, C/C++ fminf, CUDA fminf, numpy.fmin, AVX minps, etc), aka "between a NaN and a numeric value, the numeric value is chosen". We need to make sure that if that's the intent of the model (which I hope it is, as it's the only thing that makes sense) we can propagate that all the way to backends. There's some ISAs that do weird things but it'd be better to pay the cost there rather than everywhere like we do today. So this PR removes the `ArithmeticExpandOpsPass` from our SPIRV and VMVX lowerings, allowing us to lower min/max/ceil/floor directly from `arith` to the backend dialects (e.g. `spv.GL.FMin`). The LLVM-based backends would need direct lowerings implemented for us to drop the pass there too (e.g. I see errors like `error: failed to legalize operation 'arith.maxf' that was explicitly marked illegal` if I remove it from the LLVMGPU pipeline used for CUDA). commit 8ea0009 Author: Geoffrey Martin-Noble <gcmn@google.com> Date: Mon Aug 22 18:19:14 2022 -0700 Add a script for deploying to PyPi (#10169) The old Python script just downloaded the release artifacts, which can be accomplished with the GitHub CLI. We need to repair the wheels for reasons that aren't quite clear (and this step should probably be moved to the release if we can't fix it directly), but this works for now. skip-ci Tested: Deployed a release to PyPi with this script. > View at: > https://pypi.org/project/iree-tools-tf/20220811.232/ > https://pypi.org/project/iree-runtime-instrumented/20220811.232/ > https://pypi.org/project/iree-tools-tflite/20220811.232/ > https://pypi.org/project/iree-tools-xla/20220811.232/ > https://pypi.org/project/iree-compiler/20220811.232/ > https://pypi.org/project/iree-runtime/20220811.232/ commit c0fd1dc Author: Jerry Wu <cheyuw@google.com> Date: Mon Aug 22 17:52:46 2022 -0700 Define some IREE benchmarks as an example (#10115) Co-authored-by: Geoffrey Martin-Noble <gcmn@google.com> commit 0ee5c15 Author: Han-Chung Wang <hanchung@google.com> Date: Tue Aug 23 07:39:30 2022 +0800 Fix tests for midair collision. (#10163) commit ef27692 Author: Lei Zhang <antiagainst@google.com> Date: Mon Aug 22 18:02:38 2022 -0400 Integrate llvm/llvm-project@72136d8ba266 (#10159) * Reset third_party/llvm-project: 72136d8ba266eea6ce30fbc0e521c7b01a13b378 (2022-08-19 21:02:07 +0700): [Test] Add test for miscompile described in PR57247 * Update third_party/mlir-hlo to 5e324a40db4aa956f7cbf24e9417557776e7a84f * Update tensorflow to 8a7764be0d32a72ad6d93ff3216520af184e26a0 * Renamed `Confined` to `ConfinedAttr` * Updated `flow.dispatch.tensor.{load|store}` op assembly to use `custom<DynamicIndexList>` * Updated `operand_segment_sizes` to `DenseI32ArrayAttr` commit d4ba930 Author: Han-Chung Wang <hanchung@google.com> Date: Tue Aug 23 05:26:35 2022 +0800 Add a verifier and tuning examples for CPU convolution codegen. (#10147) commit 3263ccd Merge: c234161 b902d33 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 14:23:29 2022 -0700 Merge pull request #10158 from iree-org/benvanik-pipeline-layout-2 [NFC] Renaming "executable layout" to "pipeline layout". commit b902d33 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 13:35:57 2022 -0700 Bumping vmfb version due to break from renaming !hal.executable_layout. commit b6afa47 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 13:35:00 2022 -0700 Renaming `!hal.executable_layout` to `!hal.pipeline_layout` And similarly the runtime side to `iree_hal_pipeline_layout`. Progress on #10144. commit 347660c Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 11:27:44 2022 -0700 Starting rename of executable_layout -> pipeline_layout. Progress on #10144. commit c234161 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 12:50:50 2022 -0700 [NFC] Merging descriptor_set_layout.h into executable_layout.h. (#10154) Now that the layouts are only used together keeping them in the same place will make it easier to see how they fit and make them easier to refactor. Progress on #10144. commit 8775cfe Author: bjacob <benoitjacob@google.com> Date: Mon Aug 22 15:04:26 2022 -0400 Script improvements (#10136) Post-merge review comments from #10132. commit 1750213 Author: Ben Vanik <ben.vanik@gmail.com> Date: Mon Aug 22 10:40:28 2022 -0700 Removing !hal.descriptor_set/iree_hal_descriptor_set_t. (#10146) It was never fully implemented and the combination of push descriptors and upcoming binding tables should be sufficient for our uses. Not a breaking change as the compiler had never emitted code using them. Progress on #10144. commit e33c64c Author: Thomas <thomasraoux@google.com> Date: Mon Aug 22 08:51:08 2022 -0700 Cherry-pick mlir fix in linalg tiling (#10153) cherry-pick commit 06c02d5dbb13f6d2a10eaa75c236f3c61cdf5b91 commit 9b092fb Author: Marius Brehler <marius.brehler@iml.fraunhofer.de> Date: Mon Aug 22 17:27:11 2022 +0200 Don't explicitly set MLIR_PDLL_TABLEGEN_EXE (#10151) With llvm/llvm-project@91b6f76, the variable `MLIR_PDLL_TABLEGEN_EXE` is set as a cache variable in MLIR upstream. commit 52e8625 Author: Han-Chung Wang <hanchung@google.com> Date: Sat Aug 20 08:03:19 2022 +0800 Update default tiling sizes for ARM convolution configurations. (#10086) This is the first round of tuning for ARM normal convolution codegen. The parameters are derived from experiments for 3x3 kernel cases. Benchmark file: ```mlir util.global private @"__iree_flow_lhs" {noinline} = dense<1.0> : tensor<1x51x41x512xf32> util.global private @"__iree_flow_rhs" {noinline} = dense<1.0> : tensor<3x3x512x512xf32> func.func @conv_3x3filter() ->tensor<1x25x20x512xf32> { %lhs_ptr = util.global.address @"__iree_flow_lhs" : !util.ptr<tensor<1x51x41x512xf32>> %rhs_ptr = util.global.address @"__iree_flow_rhs" : !util.ptr<tensor<3x3x512x512xf32>> %lhs = util.global.load.indirect %lhs_ptr : !util.ptr<tensor<1x51x41x512xf32>> -> tensor<1x51x41x512xf32> %rhs = util.global.load.indirect %rhs_ptr : !util.ptr<tensor<3x3x512x512xf32>> -> tensor<3x3x512x512xf32> %cst = arith.constant 0.000000e+00 : f32 %2 = linalg.init_tensor [1, 25, 20, 512] : tensor<1x25x20x512xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x25x20x512xf32>) -> tensor<1x25x20x512xf32> %4 = linalg.conv_2d_nhwc_hwcf { dilations = dense<1> : tensor<2xi64>, strides = dense<2> : tensor<2xi64>} ins(%lhs, %rhs : tensor<1x51x41x512xf32>, tensor<3x3x512x512xf32>) outs(%3 : tensor<1x25x20x512xf32>) -> tensor<1x25x20x512xf32> return %4 : tensor<1x25x20x512xf32> } ``` Before: ``` # 1-threaded, taskset 80 ----------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------- BM_conv_3x3filter/process_time/real_time 1164 ms 1126 ms 1 # 4-threaded, taskset f0 ----------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------- BM_conv_3x3filter/process_time/real_time 643 ms 1764 ms 1 ``` After: ``` # 1-threaded, taskset 80 ----------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------- BM_conv_3x3filter/process_time/real_time 160 ms 155 ms 4 # 4-threaded, taskset f0 ----------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------- BM_conv_3x3filter/process_time/real_time 65.6 ms 160 ms 9 ``` commit 42244e7 Author: Stella Laurenzo <laurenzo@google.com> Date: Fri Aug 19 16:20:43 2022 -0700 NFC: Convert util transforms to declarative registration. (#10143) commit 979d6ea Author: Thomas <thomasraoux@google.com> Date: Fri Aug 19 12:31:23 2022 -0700 Integrate llvm-project and bump dependencies. (#10140) * llvm-project: 619fd8c2ab505d8f79cbbbe3fd09b02f6640e1b1 * mlir-hlo: cb55a7168c1841d05287677746a39a5de7cb855f * tensorflow: fc4021a8dd654606cd95e61a033691157853e122 Additional changes: * rename member functions for tenor ops * Remove reluN tosa tests * carry patches for llvm and mhlo commit cb0f8d4 Merge: e8ea103 65a9beb Author: Ben Vanik <ben.vanik@gmail.com> Date: Fri Aug 19 11:40:59 2022 -0700 Merge pull request #10141 from iree-org/benvanik-queue-barrier Adding iree_hal_device_queue_barrier helper and fixing pool enum. commit 65a9beb Author: Ben Vanik <ben.vanik@gmail.com> Date: Fri Aug 19 10:40:41 2022 -0700 Changing iree_hal_allocator_pool_id_t to iree_hal_allocator_pool_t. I originally intended this to be a bitfield but forgot when plumbing. commit 4c84f4a Author: Ben Vanik <ben.vanik@gmail.com> Date: Fri Aug 19 10:28:36 2022 -0700 Adding iree_hal_device_queue_barrier helper. commit e8ea103 Author: Thomas <thomasraoux@google.com> Date: Fri Aug 19 03:54:33 2022 -0700 [LLVMGPU] Add barriers when bufferization inserts shared memory copy (#10137) This is a conservative solution to avoid having race conditions when bufferization decides to emit shared memory copies.

MaheshRavishankar requested review from pzread and okkwon August 26, 2022 03:48

MaheshRavishankar enabled auto-merge (squash) August 26, 2022 03:48

pzread approved these changes Aug 26, 2022

View reviewed changes

MaheshRavishankar merged commit 272ea37 into iree-org:main Aug 26, 2022

MaheshRavishankar deleted the softmax_test_fix branch February 2, 2023 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `softmax` test to use `maxf`. #10219

Change `softmax` test to use `maxf`. #10219

MaheshRavishankar commented Aug 26, 2022 •

edited

Loading

Change softmax test to use maxf. #10219

Change softmax test to use maxf. #10219

Conversation

MaheshRavishankar commented Aug 26, 2022 • edited Loading

Change `softmax` test to use `maxf`. #10219

Change `softmax` test to use `maxf`. #10219

MaheshRavishankar commented Aug 26, 2022 •

edited

Loading