Telamon is a framework to find good combinations of optimization for computational kernels on GPUs. It currently focuses on dense linear algebra. For more information on how it works internally, we encourage you to read our paper.
To compile Telamon, you need the nightly version of rust 1.31 or higher
installed. If you want to generate code for GPU, you will also need a CUDA toolchain
installed, with cuda
, curand
and cupti
accessible in the include and library paths.
You can also view the documentation on github. Telamon is now compiled with
edition 2018 of rust.
Examples of kernels are located in the kernels/
directory. In particular,
kernels/src/linalg.rs
contains linear algebra kernels. You can compare the code
generated by Telamon to the state of the art implementation on GPUs by running
cargo +nightly bench --features=cuda --bench=cuda-search
in the kernel/
directory. To see the progress of the exploration, append
RUST_LOG=telamon::explorer=warn
to the command.
To write a kernel, you must first define the inputs of the kernel and the context we optimize for. Here, we assume we optimize for a Cuda GPU, but the process is the same for other backends.
use telamon::device::cuda;
use telamon::helper;
let _ = env_logger::init(); // Enable logging
let executor = cuda::Executor::init(); // Setup the interface with the device.
// Build the signature and bind the inputs in the context.
let mut context = cuda::Context::new(&executor);
let array_a;
let signature = {
let mut sig_builder = helper::SignatureBuilder::new("my_kernel", &mut context);
// Create a signature with two arguments: a scalar `m` and an array of floats. We
// give the value we want to optimize for to each argument.
sig_builder.scalar("n", 1000i32);
array = builder.array::<f32>("a", 1000); // Creates an array of size 1000.
sig_builder.get()
};
We can now describe the body of the kernel itself. Here we create a kernel that computes
x[i] = 2*i
for each i in 0..n
For that we use a builder that creates the loops and the
instructions for us. The builder keeps the list of open loops and nest new instructions in
them.
Telamon now has a nearly functional mppa backend. While most kernels run
perfectly fine, it is still buggy and a lot of hacks take place as the runtime
we rely on is not really satisfying. Also for diverse reasons, telamon
must be run and compiled with a prefix scl enable llvm-toolset-7
"MPPACL_LOCAL_SIZE=128K cargo ...". scl enable llvm-toolset-7 tells cargo not to
use custom Kalray C library (although we need them for compiling kernels for
mppa). MPPACL_LOCAL_SIZE=128K is mandatory if we want to use multithreaded
kernels. For example, in kernels:
scl enable llvm-toolset-7 "MPPACL_LOCAL_SIZE=128K cargo +nightly run --features=mppa
--bin exec_dump gesummv gesummv.dump"
is used to run a dump (of a given kernel) on mppa. Feel free to put that in an
alias.
```rust
let mut builder = helper::Builder::new(&signature, context.device());
// Open a loop of size n.
let size = builder.param_size("n");
let dim0 = builder.open_dim(size);
// Compute `x = 2*i` where `i` is the index on loop `dim0`.
let x = builder.mul(&dim0, &2i32);
// Store `x` in `a[i]. For that, we first compute the address of `a[i]` and build a
// that describes the access patern for the performance model.
let (addr, access_pattern) = builder.tensor_access(&"a", a, &ir::Type::I(32), &[&dim0]);
builder.st(&addr, x, access_pattern);
// Close the loop.
builder.close_dim(&dim0);
let search_space = builder.get();
We are now ready to start the search space exploration to find the best candidate.
use telamon::explorer;
let best = explorer::find_best(explorer::config::read(), &context, search_space, None).unwrap();
context.device().gen_code(&best, &mut std::io::stdout());
Telamon is released under the Apache Licence (version 2.0). See LICENSE for more details.