151 lines
5.9 KiB
Markdown
151 lines
5.9 KiB
Markdown
---
|
|
author: fredrik robertsen
|
|
date: 2025-11-04
|
|
title: "exercise 7"
|
|
---
|
|
|
|
## how i solved the problem (not a task)
|
|
|
|
i'm writing this as a little documentation. this is not part of the task.
|
|
|
|
i first automated running the code on snotra creating various `remote_` targets
|
|
in the provided makefile, such that i could easily copy my source code over to
|
|
snotra, then compile and run it, finally checking it against the precomputed
|
|
sequential version. this was done using a mix of `scp`, `ssh` and `srun` calls.
|
|
after which it was a breeze testing the code, so i could get on with
|
|
implementing it.
|
|
|
|
after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20
|
|
roughly. i wondered if it could go faster, so i used `nvprof ./parallel` to
|
|
profile the code on snotra, then analyzed that data and found better solutions
|
|
with an llm, [kagi ai](https://kagi.com/). we arrived at having a combined
|
|
kernel that performed both the time step and boundary conditions. this made it
|
|
easier to do the cooperative groups, such that we only had "minimal" (1M) kernel
|
|
launches. this kernel takes in some arguments that are `__restricted__`, which
|
|
magically makes the passing of arguments more memory efficient. it works my
|
|
telling the compiler that these arguments don't overlap in memory, allowing for
|
|
compile time optimizations. it makes sense that each time step is disjoint in
|
|
memory. in addition to this, we are passing it precomputed coefficients to the
|
|
kernel.
|
|
|
|
other than this, the solution seems to be fairly standard.
|
|
|
|
## `__global__` vs `__device__`
|
|
|
|
global kernels are called from the host and spawns a gpu thread grid. device
|
|
kernels are only called from gpus and execute in the callers context and cannot
|
|
spawn new threads. functions marked with `__global__` can thus be thought of as
|
|
the main entry point of the program, and `__device__` functions are gpu helper
|
|
functions that carry the context of the gpu, i.e. they have access to `blockDim`
|
|
and such.
|
|
|
|
## cuda vs mpi
|
|
|
|
mpi is spmd, running multiple processes on the same program code to obtain
|
|
parallelism through utilization of the cpu cores. this is good for compute
|
|
clusters where a great deal of processors are available. these programs handle
|
|
concurrency through ranks (local process ids) and message passing. such
|
|
communication overhead may be the main issue for some problems.
|
|
|
|
cuda is similar to threaded parallelism, in that they share program code and
|
|
address space, but is run on an nvidia gpu instead of a cpu core. gpus have many
|
|
cores and are capable of running thousands of threads in parallel. the problem
|
|
often boils down to properly feeding the gpu, keeping it busy.
|
|
|
|
we have seen that the previous implementations of the 2d wave equation (mpi and
|
|
threads) yielded some speed-ups, but the cuda code has been the greatest
|
|
speed-up i've seen yet. i went from some 40 seconds on snotra to only about
|
|
8 after some optimizations. in theory, we could probably increase the problem
|
|
size and run the program on a gpu cluster with more gpus and use hybrid
|
|
programming to obtain even greater throughput.
|
|
|
|
another advantage of cuda is that you can, as with the threaded code, implement
|
|
everything sequentially and then swap out the parts that are bottlenecking you
|
|
with parallel kernels to mitigate the bottleneck. this allows for more iterative
|
|
development, rather than one-shotting it all in one go, as with mpi.
|
|
|
|
together with good profiling tools, you can more easily identify bottlenecks to
|
|
help speed up your code. this makes cuda nice to write.
|
|
|
|
## pros and cons of cooperative groups
|
|
|
|
pros:
|
|
|
|
- grid-level synchronization without atomics or extra kernel launches
|
|
- cleaner code for hierarchical patterns
|
|
|
|
cons:
|
|
|
|
- frequent sync points serialize execution, reducing parallelism
|
|
- high overhead not worth it for simple kernels like stencils
|
|
|
|
## gpu occupancy
|
|
|
|
`occupancy = active warps / max warps per sm`.
|
|
|
|
the theoretical occupancy that was printed out using the provided formula
|
|
consistently gave `1.0`, suggesting optimal occupancy of the gpu would be 100%.
|
|
looking at the formula, it would mean that for the entire duration of our
|
|
program, we would need to keep all warps fed with data. however, if we use block
|
|
size of 27x28 we obtain theoretical occupancy of `0.75`. this differs from the
|
|
provided image, suggesting `0.979167`. that's weird. i might have implemented
|
|
the formula wrongly.
|
|
|
|
output for 8x8 block size:
|
|
|
|
```
|
|
./parallel
|
|
CUDA device count: 1
|
|
CUDA device #0:
|
|
Name: Tesla T4
|
|
Compute capability: 7.5
|
|
Multiprocessors: 40
|
|
Warp size: 32
|
|
Global memory: 14.6GiB bytes
|
|
Per-block shared memory: 48.0KiB
|
|
Per-block registers: 65536
|
|
Cooperative launch: YES
|
|
Total elapsed time: 9.008482 seconds
|
|
Grid size set to: (16, 16)
|
|
Launched blocks of size: (8, 8)
|
|
Theoretical occupancy: 1.000000
|
|
python3 compare.py data_sequential/00000.dat data/00000.dat
|
|
|
|
Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
|
|
|
|
python3 compare.py data_sequential/00075.dat data/00075.dat
|
|
|
|
Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
|
|
|
|
Job terminated on selbu
|
|
```
|
|
|
|
output for 27x28 block size:
|
|
|
|
```
|
|
./parallel
|
|
CUDA device count: 1
|
|
CUDA device #0:
|
|
Name: Tesla T4
|
|
Compute capability: 7.5
|
|
Multiprocessors: 40
|
|
Warp size: 32
|
|
Global memory: 14.6GiB bytes
|
|
Per-block shared memory: 48.0KiB
|
|
Per-block registers: 65536
|
|
Cooperative launch: YES
|
|
Total elapsed time: 10.004013 seconds
|
|
Grid size set to: (5, 5)
|
|
Launched blocks of size: (27, 28)
|
|
Theoretical occupancy: 0.750000
|
|
python3 compare.py data_sequential/00000.dat data/00000.dat
|
|
|
|
Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
|
|
|
|
python3 compare.py data_sequential/00075.dat data/00075.dat
|
|
|
|
Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
|
|
|
|
Job terminated on selbu
|
|
```
|