OpenMP-ICCS.qmd

---
title: "Intro to OpenMP for GPUs"
subtitle: "2024 ICCS Summer School"
author: "Arjen Tamerus and Chris Edsall"
format: beamer
fonttheme: "professionalfonts"
aspectratio: 169
date: 11/07/2024
---

## Zettascale Lab

- Brought to you in association with the Cambridge Open Zettascale Lab
  - https://www.zettascale.hpc.cam.ac.uk/
- Partnership with Dell, Intel, UKAEA
- Launched Dawn - the current fastest AI supercomputer in UK
- Researching next generation supercomputers - 1000x current perf
  - Themes include programming models for heterogeneous systems

## Goals

 - Get you going with OpenMP offloading

 - ~ 1 hour introduction to OpenMP
 - ~ 2 hours practical exercises

## Restrictions

 - OpenMP is a big specification
 - We'll focus on a useful subset (loop parallelism)
	 - Aim is to get you familiar with OpenMP

## Why go parallel

 - Go back 20+ years, and most processors were serial
	- Want more performance? Buy a faster CPU (or wait til next year)
	- ... or go parallel: connect a bunch of servers
\pause


 - Mid-2000's:
	 - CPU frequency scaling stops
		 - Physics got in the way, power and heat become unmanageable
	 - Start increasing the number of cores per CPU
\pause


 - ~2010 onwards:
	 - Nvidia generalises GPU architecture
	 - Realises these things are great for algebra
	 - AI happens

## Why go parallel
![Compute node evolution](images/cpu.1.png){height=70%}


## Why go parallel
![Compute node evolution](images/cpu.2.png){height=70%}


## Why go parallel
![Compute node evolution](images/cpu.3.png){height=70%}


## Why go parallel
![Compute node evolution](images/cpu.drawio.png){height=70%}

## Oh no, my computer has gone parallel

 - Many programming languages are still inherently serial
 - OpenMP was introduced to make parallel programming accessible:
   - Without having to learn a fundamentally new programming method
   - With minimal changes to existing (serial) code
\pause

 - First released mid-90s for CPU-level parallelism
	 - Parallelism is achieved through _multithreading_
 - Support for GPU acceleration added ~2015


## OpenMP
 - Directive-based: annotations in your source code
 	- Only for: C, C++, Fortran

 - Today's focus is GPU ('offload'), but we'll cover the CPU essentials

## what is a directive

## OpenMP - CPU
 - As mentioned: OpenMP is a directive-based parallel framework
   - Extends the language transparently: code unaffected unless the compiler is explicitly told to enable OpenMP
 - Take `saxpy` as an example:

```fortran
do i=1,N
	y(i) = a*x(i) + y(i)
end do
```

## OpenMP - CPU

::: columns
:::: column
```fortran
do i=1,N
	y(i) = a*x(i) + y(i)
end do
```
::::
\pause
:::: column
```fortran
!$omp parallel do
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end parallel do
```
::::
:::

## OpenMP - C-style example

::: columns
:::: column
```C
for (int i = 0; i < N; i++) {
	y[i] = a*x[i] + y[i];
}
```
::::
\pause
:::: column
```C
#pragma omp parallel for
for (int i = 0; i < N; i++) {
	y[i] = a*x[i] + y[i];
}
```
::::
:::

## Parallel regions
 - What we just saw was in fact a _compound_ directive:
\pause
 - `!$omp parallel`
\pause
	 - Creates a parallel region
	 - Actually spawns threads
	 - All threads _run the exact same code_
\pause
 - `!$omp do`
\pause
   - Instructs the runtime to distribute the work from the loop _directly
		 following_ over the spawned threads

## OpenMP - data sharing
 - By default, any variables declared _before_ a parallel region is shared
	 between threads. Any variable declared _within_ a parallel region is private
	 to each thread.
 - We can use _data sharing_ clauses to specify a variables data sharing
	 properties within a loop
 - Variables can be `shared`, `private`, `firstprivate`
 - The data sharing properties are important: incorrect data sharing can lead to
	 painful-to-debug problems (e.g. race conditions)

## OpenMP - data sharing
::: columns
:::: column
```C
#pragma omp parallel for
for (int i = 0; i < N; i++) {
	float tmp = a*x[i];
	y[i] = tmp + y[i];
}
```
::::
:::: column
```C
float tmp;
#pragma omp parallel for
for (int i = 0; i < N; i++) {
	tmp = a*x[i];
	y[i] = tmp + y[i];
}
```
::::
:::


## OpenMP - data sharing
::: columns
:::: column
```C
#pragma omp parallel for
for (int i = 0; i < N; i++) {
	float tmp = a*x[i];
	y[i] = tmp + y[i];
}
```
::::
:::: column
```C
float tmp;
#pragma omp parallel for private(tmp)
for (int i = 0; i < N; i++) {
	tmp = a*x[i];
	y[i] = tmp + y[i];
}
```
::::
:::

## OpenMP - data sharing
 - The `default` clause specifies the default sharing mode for all variables
	 passed into a parallel region.
 - Options are `shared`, `private`, `firstprivate` and `none`
	 - `default(none)` means _all_ variables must have their data sharing
		 properties declared explicitly
	 - _use this_

## OpenMP - data sharing
\pause
 - _use default(none)_


## Reductions
 - The `reduction` clause is used to specify reduction variables and operations.
 - Reduction variables are `private`
```fortran
!$omp parallel do reduction(+:sum)
do i=1,10
	sum += i
end do
!$omp end do

write(*,*) "Sum: ", sum
```

## Moving to GPU
 - That's about all you need to get going with OpenMP on a CPU.
 - But the point of this course is to use OpenMP to _offload_ computation to the
	 GPU

## Offloading to the GPU
 - So far we've been running code only on the CPU
	 - we call this running on the _host_

 - We want to run our code on a GPU.
	 - OpenMP calls this _offloading_ a computation to a _device_
	 - We achieve this by telling OpenMP to _target_ a device for a computation
	 - We call the offloaded computation a _device kernel_

## The target directive
::: columns
:::: column
```fortran
!$omp parallel do
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end parallel do
```
::::
:::: column
::::
:::

## The target directive
::: columns
:::: column
```fortran
!$omp target parallel do
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end target parallel do
```
::::
\pause
:::: column
  - This works!
\pause
  - This is terrifically slow!
\pause
  - Let's talk about GPU architectures for a bit
::::
:::

## GPU Architecture
 - GPUs have a fundamentally different architecture to CPUs
 - My favourite bad analogy: trucks on a motorway vs container ship
![](images/truckship.jpg)

## GPU Architecture
![Intel PVC](images/pvc_block.jpg){height=70%px}

## GPU Architecture
::: columns
:::: column
![Intel PVC](images/pvc_block_intel.png){height=70%}
::::
:::: column
 - Hierarchical nature
 - OpenMP maps quite closely
::::
:::


## OpenMP - GPU 2
 - If we want to fully utilise the GPU, we'll have to _explicitly_ tell the
	 OpenMP runtime to use each level of parallelism the GPU offers.
 - This translates into using this beauty of a directive:
\pause
```fortran	 
!$omp target teams distribute parallel do simd
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end target teams distribute parallel do simd
```
\pause
 - Let's break that down...

## OpenMP - GPU directive breakdown
```fortran	 
!$omp TARGET
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end TARGET
```
 - The `target` directive instructs the compiler to generate a GPU offload
	 kernel for the enclosed code.

## OpenMP - GPU directive breakdown
```fortran	 
!$omp target PARALLEL DO
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end target PARALLEL DO
```
 - The `parallel do` directive works largely the same as on the CPU: distribute the loop
	 iterations over multiple threads.
	 - But only _within an execution unit_

## OpenMP - GPU directive breakdown
```fortran	 
!$omp target TEAMS parallel do
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end target TEAMS parallel do
```
 - The `teams` directive creates a _team_ of threads on _multiple execution
	 units_.

## OpenMP - GPU directive breakdown
```fortran	 
!$omp target teams DISTRIBUTE parallel do
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end target teams DISTRIBUTE parallel do
```
 - The `distribute` keyword ensure the loop iterations are distributed _across
	 teams_


## OpenMP - GPU directive breakdown
```fortran	 
!$omp target teams distribute parallel do SIMD
do i=1,N
	y(i) = a*x(i) + y(i)
end do
!$omp end target teams distribute parallel do SIMD
```
 - Finally, the `simd` clause enables use of the lowest-level vector units.

## OpenMP - prescriptive vs descriptive
 - What I've just shown is the original, _prescriptive_ way of declaring OpenMP
	 parallelism for GPU kernels.
 - A recent introduction to the standard is the `loop` directive, which is a
	 _descriptive_ method of parallelising loops
```fortran
!$omp target teams loop
do i=1,N
	y(i) = a*x(i) + y(i)
!$omp end target teams loop
```
 - It's much less verbose, but gives less control
	 - I recommed getting familiar with prescriptive directives first.
	 - I'll use `loop` in some examples for brevity.

## OpenMP - GPU memory
\pause
 - Back to GPU architecture for a second

## OpenMP - GPU memory
::: columns
:::: column
 - Offload device & CPU (usually) have separate physical memory
::::
:::: column
![Node architecture](images/node.cut.png)
::::
:::


## OpenMP - data movement
 - By default, the OpenMP runtime will copy _all_ required data to and from the
GPU memory
 - Can be extremely inefficient!
	 - Avoid unnecessary data movement whenever you can
 - Can control data movement using the `map` clause
   - Note _map_: the runtime keeps a _mapping_ between host and device memory
   - GPU memory always has a host equivalent!

## The `map` clause
 - Adding a `map` clause to an omp target region provides the compiler with
	 instructions on how to move data around
 - Format:
   - `map(<operation>:var1[,var2,...])`
   - Valid operations: `to`, `from`, `tofrom`, `alloc`
   - `tofrom` is default
	

 - Can control data movement per kernel, or create _data regions_
	- Will discuss these in order

## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop
do i=1,N
	c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
 - By default, _all_ of `a`, `b` and `c` will be copied to the GPU at the start
	 of this kernel, then back to the host after it's finished.
 - Let's look at how the data is actually used!


## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop
do i=1,N
	c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
 - In this loop we are:
	 - Reading from `a` and `b`
	 - Writing the result to `c`

## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop
do i=1,N
	c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
 - In this loop we are:
	 - Reading from `a` and `b` -> Their values don't change!
	 - Writing the result to `c` -> We're not reading, so the GPU doesn't have to
		 know the initial value


## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop map(to:a(1:N),b(1:N)) map(from:c(1:N))
do i=1,N
	c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
 - Now we're only copying `a` and `b` to the GPU, and only copying the final
	 value of `c` back to the host.
	  - This halves the amount of data we're moving around!

## Keeping data resident on the GPU
 - We've looked at optimising data movement _per kernel_
 - Often, we'll have multiple kernels operating on the same data
 - How do we avoid moving data around unnecessarily?

## Data regions
 - OpenMP data regions map data to the GPU, and _keep it there_ until the region
	 ends.
 - Can be either _structured_ or _unstructured_

## Data regions - structured
 - We can create structured (or scoped) data regions with the `omp target data`
	 directive
 - Useful when keeping data resident within a specific region of the code (e.g.
	 within a subroutine).

```fortran
!$omp target data map(to:a(1:N)) map(tofrom:b(1:N))

!$omp target teams loop
do i=1,N
	b(i) = a(i)+b(i)
end do
!$omp end target teams

!$omp end target data
```


## Accessing host data within data regions
 - May want to keep data resident on the GPU most of the time, but still need to
	 update on the host.
   - Nested data regions
   - `update` directive

## Update directive
```fortran
!$omp target data map(to:a(1:N)) map(tofrom:b(1:N))

!$omp target teams loop
do i=1,N
	b(i) = a(i) + b(i)
end do
!$omp end target teams loop

call do_cpu_work(a)
!$omp target update to(a(1:N))

!$omp target teams loop
do i=1,N
	b(i) = 2*a(i) + b(i)
end do
!$omp end target teams loop

!$omp end target data
```

## Data regions - unstructured
 - Sometimes a scoped data region does not fit your program
 - Alternative: unstructured data region
 - Use the `target enter data` and `target exit data` directives
 - Mostly the same `map` clauses, except:
	  - `enter data` accepts `to` and `alloc`
		- `exit data` accepts `from`, `delete` and `release`
		- `tofrom` is not accepted

## Data regions - unstructured
\small
::: columns
:::: column
```fortran
call map_data_to_device(A)

call gpu_algorithm(A, tmp)

!$omp target exit data map(from:A) &
!$omp& map(delete:tmp)
```
::::

::::column
```fortran
subroutine map_data_to_device(A)
!$omp target enter data map(to:A)
end subroutine

subroutine gpu_algorithm(A, tmp)
...
!$omp target enter data map(alloc:tmp)
end subroutine
```
::::

:::

## Resident variables and GPU routines
 - The `declare target` directive dictates that something should be present in
	 device memory for the duration of a program
\pause
 - This can be variables (e.g. global variables that are often used)
\pause
 - But can also be used to generate offloaded versions of functions or
	 subroutines
   - Can be called from `omp target` regions!

## Things to keep in mind
 - Note that all loops we've used in these slides have independent iterations
   - They are 'massively parallel'
   - This is required to prevent race conditions

\pause
 - GPUs like to be overloaded with work
   - They have thousands of cores.
   - To efficiently use them, you'll have to provide them with _many times_
     as much work.

## Final comments
 - This should (hopefully) be enough to get you started with OpenMP offloading
 - OpenMP has many more features:
   - Tasking
   - Asynchronous offload
	 - synchronisation
	 - ...

 - As a rule: any OpenMP feature that is supported on the CPU, is supported on GPUs as well

## Thanks

 - Onto the exercises

## Logging in & getting started
 - Login using:
	 - `ssh <user>@login.hpc.cam.ac.uk`

 - Clone the repository:
	- `git clone git@github.com:Cambridge-ICCS/ICCS-OpenMP.git`

## Logging in & getting started
 - Get access to a compute node
	 - `./get_gpu_node.sh`

 - Set up your environment
	 - `source modules`

## Logging in & getting started
 -