5.9 KiB
author, date, title
| author | date | title |
|---|---|---|
| fredrik robertsen | 2025-11-04 | exercise 7 |
how i solved the problem (not a task)
i'm writing this as a little documentation. this is not part of the task.
i first automated running the code on snotra creating various remote_ targets
in the provided makefile, such that i could easily copy my source code over to
snotra, then compile and run it, finally checking it against the precomputed
sequential version. this was done using a mix of scp, ssh and srun calls.
after which it was a breeze testing the code, so i could get on with
implementing it.
after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20
roughly. i wondered if it could go faster, so i used nvprof ./parallel to
profile the code on snotra, then analyzed that data and found better solutions
with an llm, kagi ai. we arrived at having a combined
kernel that performed both the time step and boundary conditions. this made it
easier to do the cooperative groups, such that we only had "minimal" (1M) kernel
launches. this kernel takes in some arguments that are __restricted__, which
magically makes the passing of arguments more memory efficient. it works my
telling the compiler that these arguments don't overlap in memory, allowing for
compile time optimizations. it makes sense that each time step is disjoint in
memory. in addition to this, we are passing it precomputed coefficients to the
kernel.
other than this, the solution seems to be fairly standard.
__global__ vs __device__
global kernels are called from the host and spawns a gpu thread grid. device
kernels are only called from gpus and execute in the callers context and cannot
spawn new threads. functions marked with __global__ can thus be thought of as
the main entry point of the program, and __device__ functions are gpu helper
functions that carry the context of the gpu, i.e. they have access to blockDim
and such.
cuda vs mpi
mpi is spmd, running multiple processes on the same program code to obtain parallelism through utilization of the cpu cores. this is good for compute clusters where a great deal of processors are available. these programs handle concurrency through ranks (local process ids) and message passing. such communication overhead may be the main issue for some problems.
cuda is similar to threaded parallelism, in that they share program code and address space, but is run on an nvidia gpu instead of a cpu core. gpus have many cores and are capable of running thousands of threads in parallel. the problem often boils down to properly feeding the gpu, keeping it busy.
we have seen that the previous implementations of the 2d wave equation (mpi and threads) yielded some speed-ups, but the cuda code has been the greatest speed-up i've seen yet. i went from some 40 seconds on snotra to only about 8 after some optimizations. in theory, we could probably increase the problem size and run the program on a gpu cluster with more gpus and use hybrid programming to obtain even greater throughput.
another advantage of cuda is that you can, as with the threaded code, implement everything sequentially and then swap out the parts that are bottlenecking you with parallel kernels to mitigate the bottleneck. this allows for more iterative development, rather than one-shotting it all in one go, as with mpi.
together with good profiling tools, you can more easily identify bottlenecks to help speed up your code. this makes cuda nice to write.
pros and cons of cooperative groups
pros:
- grid-level synchronization without atomics or extra kernel launches
- cleaner code for hierarchical patterns
cons:
- frequent sync points serialize execution, reducing parallelism
- high overhead not worth it for simple kernels like stencils
gpu occupancy
occupancy = active warps / max warps per sm.
the theoretical occupancy that was printed out using the provided formula
consistently gave 1.0, suggesting optimal occupancy of the gpu would be 100%.
looking at the formula, it would mean that for the entire duration of our
program, we would need to keep all warps fed with data. however, if we use block
size of 27x28 we obtain theoretical occupancy of 0.75. this differs from the
provided image, suggesting 0.979167. that's weird. i might have implemented
the formula wrongly.
output for 8x8 block size:
./parallel
CUDA device count: 1
CUDA device #0:
Name: Tesla T4
Compute capability: 7.5
Multiprocessors: 40
Warp size: 32
Global memory: 14.6GiB bytes
Per-block shared memory: 48.0KiB
Per-block registers: 65536
Cooperative launch: YES
Total elapsed time: 9.008482 seconds
Grid size set to: (16, 16)
Launched blocks of size: (8, 8)
Theoretical occupancy: 1.000000
python3 compare.py data_sequential/00000.dat data/00000.dat
Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
python3 compare.py data_sequential/00075.dat data/00075.dat
Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
Job terminated on selbu
output for 27x28 block size:
./parallel
CUDA device count: 1
CUDA device #0:
Name: Tesla T4
Compute capability: 7.5
Multiprocessors: 40
Warp size: 32
Global memory: 14.6GiB bytes
Per-block shared memory: 48.0KiB
Per-block registers: 65536
Cooperative launch: YES
Total elapsed time: 10.004013 seconds
Grid size set to: (5, 5)
Launched blocks of size: (27, 28)
Theoretical occupancy: 0.750000
python3 compare.py data_sequential/00000.dat data/00000.dat
Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
python3 compare.py data_sequential/00075.dat data/00075.dat
Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
Job terminated on selbu