triton_vs_cuda

Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.

Lately I've been learning Triton, its strengths, and its weaknesses. Inspired by SiBohem's blog, I thought I would show how we can attempt to build a Triton kernel as performant as a near-cuBLAS performant CUDA kernel. In this endeavor I hope to highlight a few things about Triton:

what are the limitations of a Triton's block level programming paradigm?
as a kernel engineer, how much control do we retain in Triton to squeeze more performance out?
where does the Triton compiler take over and attempt to fill in? How successful is it at this task? Where is work still needed at the compiler level?
when should you actually use Triton v.s. CUDA?

Getting Started

I've divided this project into two branches:

main: template kernel files
solutions: solution kernel files

I've included dockerfiles in each /triton and /cuda directory to make enviornment setup quick and easy. Open those directories and you'll find README.mds explaining how to get going.

In Progress

I'll have a blog on the subject posted at some point on my personal website: alexkranias.com

I'm actively working on that piece.

In the meantime, you can clone this repo to work on this on your own and follow SiBohem's blog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

triton_vs_cuda

Getting Started

In Progress

Files

README.md

Latest commit

History

README.md

File metadata and controls

triton_vs_cuda

Getting Started

In Progress