forked from ArjenTamerus/ICCS-OpenMP
-
Notifications
You must be signed in to change notification settings - Fork 2
/
OpenMP-ICCS.qmd
630 lines (523 loc) · 14.2 KB
/
OpenMP-ICCS.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
---
title: "Intro to OpenMP for GPUs"
subtitle: "2024 ICCS Summer School"
author: "Arjen Tamerus and Chris Edsall"
format: beamer
fonttheme: "professionalfonts"
aspectratio: 169
date: 11/07/2024
---
## Zettascale Lab
- Brought to you in association with the Cambridge Open Zettascale Lab
- https://www.zettascale.hpc.cam.ac.uk/
- Partnership with Dell, Intel, UKAEA
- Launched Dawn - the current fastest AI supercomputer in UK
- Researching next generation supercomputers - 1000x current perf
- Themes include programming models for heterogeneous systems
## Goals
- Get you going with OpenMP offloading
- ~ 1 hour introduction to OpenMP
- ~ 2 hours practical exercises
## Restrictions
- OpenMP is a big specification
- We'll focus on a useful subset (loop parallelism)
- Aim is to get you familiar with OpenMP
## Why go parallel
- Go back 20+ years, and most processors were serial
- Want more performance? Buy a faster CPU (or wait til next year)
- ... or go parallel: connect a bunch of servers
\pause
- Mid-2000's:
- CPU frequency scaling stops
- Physics got in the way, power and heat become unmanageable
- Start increasing the number of cores per CPU
\pause
- ~2010 onwards:
- Nvidia generalises GPU architecture
- Realises these things are great for algebra
- AI happens
## Why go parallel
![Compute node evolution](images/cpu.1.png){height=70%}
## Why go parallel
![Compute node evolution](images/cpu.2.png){height=70%}
## Why go parallel
![Compute node evolution](images/cpu.3.png){height=70%}
## Why go parallel
![Compute node evolution](images/cpu.drawio.png){height=70%}
## Oh no, my computer has gone parallel
- Many programming languages are still inherently serial
- OpenMP was introduced to make parallel programming accessible:
- Without having to learn a fundamentally new programming method
- With minimal changes to existing (serial) code
\pause
- First released mid-90s for CPU-level parallelism
- Parallelism is achieved through _multithreading_
- Support for GPU acceleration added ~2015
## OpenMP
- Directive-based: annotations in your source code
- Only for: C, C++, Fortran
- Today's focus is GPU ('offload'), but we'll cover the CPU essentials
## what is a directive
## OpenMP - CPU
- As mentioned: OpenMP is a directive-based parallel framework
- Extends the language transparently: code unaffected unless the compiler is explicitly told to enable OpenMP
- Take `saxpy` as an example:
```fortran
do i=1,N
y(i) = a*x(i) + y(i)
end do
```
## OpenMP - CPU
::: columns
:::: column
```fortran
do i=1,N
y(i) = a*x(i) + y(i)
end do
```
::::
\pause
:::: column
```fortran
!$omp parallel do
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end parallel do
```
::::
:::
## OpenMP - C-style example
::: columns
:::: column
```C
for (int i = 0; i < N; i++) {
y[i] = a*x[i] + y[i];
}
```
::::
\pause
:::: column
```C
#pragma omp parallel for
for (int i = 0; i < N; i++) {
y[i] = a*x[i] + y[i];
}
```
::::
:::
## Parallel regions
- What we just saw was in fact a _compound_ directive:
\pause
- `!$omp parallel`
\pause
- Creates a parallel region
- Actually spawns threads
- All threads _run the exact same code_
\pause
- `!$omp do`
\pause
- Instructs the runtime to distribute the work from the loop _directly
following_ over the spawned threads
## OpenMP - data sharing
- By default, any variables declared _before_ a parallel region is shared
between threads. Any variable declared _within_ a parallel region is private
to each thread.
- We can use _data sharing_ clauses to specify a variables data sharing
properties within a loop
- Variables can be `shared`, `private`, `firstprivate`
- The data sharing properties are important: incorrect data sharing can lead to
painful-to-debug problems (e.g. race conditions)
## OpenMP - data sharing
::: columns
:::: column
```C
#pragma omp parallel for
for (int i = 0; i < N; i++) {
float tmp = a*x[i];
y[i] = tmp + y[i];
}
```
::::
:::: column
```C
float tmp;
#pragma omp parallel for
for (int i = 0; i < N; i++) {
tmp = a*x[i];
y[i] = tmp + y[i];
}
```
::::
:::
## OpenMP - data sharing
::: columns
:::: column
```C
#pragma omp parallel for
for (int i = 0; i < N; i++) {
float tmp = a*x[i];
y[i] = tmp + y[i];
}
```
::::
:::: column
```C
float tmp;
#pragma omp parallel for private(tmp)
for (int i = 0; i < N; i++) {
tmp = a*x[i];
y[i] = tmp + y[i];
}
```
::::
:::
## OpenMP - data sharing
- The `default` clause specifies the default sharing mode for all variables
passed into a parallel region.
- Options are `shared`, `private`, `firstprivate` and `none`
- `default(none)` means _all_ variables must have their data sharing
properties declared explicitly
- _use this_
## OpenMP - data sharing
\pause
- _use default(none)_
## Reductions
- The `reduction` clause is used to specify reduction variables and operations.
- Reduction variables are `private`
```fortran
!$omp parallel do reduction(+:sum)
do i=1,10
sum += i
end do
!$omp end do
write(*,*) "Sum: ", sum
```
## Moving to GPU
- That's about all you need to get going with OpenMP on a CPU.
- But the point of this course is to use OpenMP to _offload_ computation to the
GPU
## Offloading to the GPU
- So far we've been running code only on the CPU
- we call this running on the _host_
- We want to run our code on a GPU.
- OpenMP calls this _offloading_ a computation to a _device_
- We achieve this by telling OpenMP to _target_ a device for a computation
- We call the offloaded computation a _device kernel_
## The target directive
::: columns
:::: column
```fortran
!$omp parallel do
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end parallel do
```
::::
:::: column
::::
:::
## The target directive
::: columns
:::: column
```fortran
!$omp target parallel do
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end target parallel do
```
::::
\pause
:::: column
- This works!
\pause
- This is terrifically slow!
\pause
- Let's talk about GPU architectures for a bit
::::
:::
## GPU Architecture
- GPUs have a fundamentally different architecture to CPUs
- My favourite bad analogy: trucks on a motorway vs container ship
![](images/truckship.jpg)
## GPU Architecture
![Intel PVC](images/pvc_block.jpg){height=70%px}
## GPU Architecture
::: columns
:::: column
![Intel PVC](images/pvc_block_intel.png){height=70%}
::::
:::: column
- Hierarchical nature
- OpenMP maps quite closely
::::
:::
## OpenMP - GPU 2
- If we want to fully utilise the GPU, we'll have to _explicitly_ tell the
OpenMP runtime to use each level of parallelism the GPU offers.
- This translates into using this beauty of a directive:
\pause
```fortran
!$omp target teams distribute parallel do simd
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end target teams distribute parallel do simd
```
\pause
- Let's break that down...
## OpenMP - GPU directive breakdown
```fortran
!$omp TARGET
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end TARGET
```
- The `target` directive instructs the compiler to generate a GPU offload
kernel for the enclosed code.
## OpenMP - GPU directive breakdown
```fortran
!$omp target PARALLEL DO
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end target PARALLEL DO
```
- The `parallel do` directive works largely the same as on the CPU: distribute the loop
iterations over multiple threads.
- But only _within an execution unit_
## OpenMP - GPU directive breakdown
```fortran
!$omp target TEAMS parallel do
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end target TEAMS parallel do
```
- The `teams` directive creates a _team_ of threads on _multiple execution
units_.
## OpenMP - GPU directive breakdown
```fortran
!$omp target teams DISTRIBUTE parallel do
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end target teams DISTRIBUTE parallel do
```
- The `distribute` keyword ensure the loop iterations are distributed _across
teams_
## OpenMP - GPU directive breakdown
```fortran
!$omp target teams distribute parallel do SIMD
do i=1,N
y(i) = a*x(i) + y(i)
end do
!$omp end target teams distribute parallel do SIMD
```
- Finally, the `simd` clause enables use of the lowest-level vector units.
## OpenMP - prescriptive vs descriptive
- What I've just shown is the original, _prescriptive_ way of declaring OpenMP
parallelism for GPU kernels.
- A recent introduction to the standard is the `loop` directive, which is a
_descriptive_ method of parallelising loops
```fortran
!$omp target teams loop
do i=1,N
y(i) = a*x(i) + y(i)
!$omp end target teams loop
```
- It's much less verbose, but gives less control
- I recommed getting familiar with prescriptive directives first.
- I'll use `loop` in some examples for brevity.
## OpenMP - GPU memory
\pause
- Back to GPU architecture for a second
## OpenMP - GPU memory
::: columns
:::: column
- Offload device & CPU (usually) have separate physical memory
::::
:::: column
![Node architecture](images/node.cut.png)
::::
:::
## OpenMP - data movement
- By default, the OpenMP runtime will copy _all_ required data to and from the
GPU memory
- Can be extremely inefficient!
- Avoid unnecessary data movement whenever you can
- Can control data movement using the `map` clause
- Note _map_: the runtime keeps a _mapping_ between host and device memory
- GPU memory always has a host equivalent!
## The `map` clause
- Adding a `map` clause to an omp target region provides the compiler with
instructions on how to move data around
- Format:
- `map(<operation>:var1[,var2,...])`
- Valid operations: `to`, `from`, `tofrom`, `alloc`
- `tofrom` is default
- Can control data movement per kernel, or create _data regions_
- Will discuss these in order
## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop
do i=1,N
c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
- By default, _all_ of `a`, `b` and `c` will be copied to the GPU at the start
of this kernel, then back to the host after it's finished.
- Let's look at how the data is actually used!
## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop
do i=1,N
c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
- In this loop we are:
- Reading from `a` and `b`
- Writing the result to `c`
## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop
do i=1,N
c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
- In this loop we are:
- Reading from `a` and `b` -> Their values don't change!
- Writing the result to `c` -> We're not reading, so the GPU doesn't have to
know the initial value
## Controlling data movement for a compute kernel
```fortran
!$omp target teams loop map(to:a(1:N),b(1:N)) map(from:c(1:N))
do i=1,N
c(i) = a(i) + b(i)
end do
!$omp end target teams loop
```
- Now we're only copying `a` and `b` to the GPU, and only copying the final
value of `c` back to the host.
- This halves the amount of data we're moving around!
## Keeping data resident on the GPU
- We've looked at optimising data movement _per kernel_
- Often, we'll have multiple kernels operating on the same data
- How do we avoid moving data around unnecessarily?
## Data regions
- OpenMP data regions map data to the GPU, and _keep it there_ until the region
ends.
- Can be either _structured_ or _unstructured_
## Data regions - structured
- We can create structured (or scoped) data regions with the `omp target data`
directive
- Useful when keeping data resident within a specific region of the code (e.g.
within a subroutine).
```fortran
!$omp target data map(to:a(1:N)) map(tofrom:b(1:N))
!$omp target teams loop
do i=1,N
b(i) = a(i)+b(i)
end do
!$omp end target teams
!$omp end target data
```
## Accessing host data within data regions
- May want to keep data resident on the GPU most of the time, but still need to
update on the host.
- Nested data regions
- `update` directive
## Update directive
```fortran
!$omp target data map(to:a(1:N)) map(tofrom:b(1:N))
!$omp target teams loop
do i=1,N
b(i) = a(i) + b(i)
end do
!$omp end target teams loop
call do_cpu_work(a)
!$omp target update to(a(1:N))
!$omp target teams loop
do i=1,N
b(i) = 2*a(i) + b(i)
end do
!$omp end target teams loop
!$omp end target data
```
## Data regions - unstructured
- Sometimes a scoped data region does not fit your program
- Alternative: unstructured data region
- Use the `target enter data` and `target exit data` directives
- Mostly the same `map` clauses, except:
- `enter data` accepts `to` and `alloc`
- `exit data` accepts `from`, `delete` and `release`
- `tofrom` is not accepted
## Data regions - unstructured
\small
::: columns
:::: column
```fortran
call map_data_to_device(A)
call gpu_algorithm(A, tmp)
!$omp target exit data map(from:A) &
!$omp& map(delete:tmp)
```
::::
::::column
```fortran
subroutine map_data_to_device(A)
!$omp target enter data map(to:A)
end subroutine
subroutine gpu_algorithm(A, tmp)
...
!$omp target enter data map(alloc:tmp)
end subroutine
```
::::
:::
## Resident variables and GPU routines
- The `declare target` directive dictates that something should be present in
device memory for the duration of a program
\pause
- This can be variables (e.g. global variables that are often used)
\pause
- But can also be used to generate offloaded versions of functions or
subroutines
- Can be called from `omp target` regions!
## Things to keep in mind
- Note that all loops we've used in these slides have independent iterations
- They are 'massively parallel'
- This is required to prevent race conditions
\pause
- GPUs like to be overloaded with work
- They have thousands of cores.
- To efficiently use them, you'll have to provide them with _many times_
as much work.
## Final comments
- This should (hopefully) be enough to get you started with OpenMP offloading
- OpenMP has many more features:
- Tasking
- Asynchronous offload
- synchronisation
- ...
- As a rule: any OpenMP feature that is supported on the CPU, is supported on GPUs as well
## Thanks
- Onto the exercises
## Logging in & getting started
- Login using:
- `ssh <user>@login.hpc.cam.ac.uk`
- Clone the repository:
- `git clone git@github.com:Cambridge-ICCS/ICCS-OpenMP.git`
## Logging in & getting started
- Get access to a compute node
- `./get_gpu_node.sh`
- Set up your environment
- `source modules`
## Logging in & getting started
-