TDT4200/exercise7/report.md

---
author: fredrik robertsen
date: 2025-11-04
title: "exercise 7"
---

## how i solved the problem (not a task)

i'm writing this as a little documentation. this is not part of the task.

i first automated running the code on snotra creating various `remote_` targets
in the provided makefile, such that i could easily copy my source code over to
snotra, then compile and run it, finally checking it against the precomputed
sequential version. this was done using a mix of `scp`, `ssh` and `srun` calls.
after which it was a breeze testing the code, so i could get on with
implementing it.

after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20
roughly. i wondered if it could go faster, so i used `nvprof ./parallel` to
profile the code on snotra, then analyzed that data and found better solutions
with an llm, [kagi ai](https://kagi.com/). we arrived at having a combined
kernel that performed both the time step and boundary conditions. this made it
easier to do the cooperative groups, such that we only had "minimal" (1M) kernel
launches. this kernel takes in some arguments that are `__restricted__`, which
magically makes the passing of arguments more memory efficient. it works my
telling the compiler that these arguments don't overlap in memory, allowing for
compile time optimizations. it makes sense that each time step is disjoint in
memory. in addition to this, we are passing it precomputed coefficients to the
kernel.

other than this, the solution seems to be fairly standard.

## `__global__` vs `__device__`

global kernels are called from the host and spawns a gpu thread grid. device
kernels are only called from gpus and execute in the callers context and cannot
spawn new threads. functions marked with `__global__` can thus be thought of as
the main entry point of the program, and `__device__` functions are gpu helper
functions that carry the context of the gpu, i.e. they have access to `blockDim`
and such.

## cuda vs mpi

mpi is spmd, running multiple processes on the same program code to obtain
parallelism through utilization of the cpu cores. this is good for compute
clusters where a great deal of processors are available. these programs handle
concurrency through ranks (local process ids) and message passing. such
communication overhead may be the main issue for some problems.

cuda is similar to threaded parallelism, in that they share program code and
address space, but is run on an nvidia gpu instead of a cpu core. gpus have many
cores and are capable of running thousands of threads in parallel. the problem
often boils down to properly feeding the gpu, keeping it busy.

we have seen that the previous implementations of the 2d wave equation (mpi and
threads) yielded some speed-ups, but the cuda code has been the greatest
speed-up i've seen yet. i went from some 40 seconds on snotra to only about
8 after some optimizations. in theory, we could probably increase the problem
size and run the program on a gpu cluster with more gpus and use hybrid
programming to obtain even greater throughput.

another advantage of cuda is that you can, as with the threaded code, implement
everything sequentially and then swap out the parts that are bottlenecking you
with parallel kernels to mitigate the bottleneck. this allows for more iterative
development, rather than one-shotting it all in one go, as with mpi.

together with good profiling tools, you can more easily identify bottlenecks to
help speed up your code. this makes cuda nice to write.

## pros and cons of cooperative groups

pros:

- grid-level synchronization without atomics or extra kernel launches
- cleaner code for hierarchical patterns

cons:

- frequent sync points serialize execution, reducing parallelism
- high overhead not worth it for simple kernels like stencils

## gpu occupancy

`occupancy = active warps / max warps per sm`.

the theoretical occupancy that was printed out using the provided formula
consistently gave `1.0`, suggesting optimal occupancy of the gpu would be 100%.
looking at the formula, it would mean that for the entire duration of our
program, we would need to keep all warps fed with data. however, if we use block
size of 27x28 we obtain theoretical occupancy of `0.75`. this differs from the
provided image, suggesting `0.979167`. that's weird. i might have implemented
the formula wrongly.

output for 8x8 block size:

```
./parallel
CUDA device count: 1
CUDA device #0:
        Name: Tesla T4
        Compute capability: 7.5
        Multiprocessors: 40
        Warp size: 32
        Global memory: 14.6GiB bytes
        Per-block shared memory: 48.0KiB
        Per-block registers: 65536
        Cooperative launch: YES
Total elapsed time: 9.008482 seconds
Grid size set to:             (16, 16)
Launched blocks of size:      (8, 8)
Theoretical occupancy:        1.000000
python3 compare.py data_sequential/00000.dat data/00000.dat

Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001

python3 compare.py data_sequential/00075.dat data/00075.dat

Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001

Job terminated on selbu
```

output for 27x28 block size:

```
./parallel
CUDA device count: 1
CUDA device #0:
        Name: Tesla T4
        Compute capability: 7.5
        Multiprocessors: 40
        Warp size: 32
        Global memory: 14.6GiB bytes
        Per-block shared memory: 48.0KiB
        Per-block registers: 65536
        Cooperative launch: YES
Total elapsed time: 10.004013 seconds
Grid size set to:             (5, 5)
Launched blocks of size:      (27, 28)
Theoretical occupancy:        0.750000
python3 compare.py data_sequential/00000.dat data/00000.dat

Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001

python3 compare.py data_sequential/00075.dat data/00075.dat

Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001

Job terminated on selbu
```