Files
TDT4200/exercise7/report.md
2025-11-04 17:12:40 +01:00

5.9 KiB

author, date, title
author date title
fredrik robertsen 2025-11-04 exercise 7

how i solved the problem (not a task)

i'm writing this as a little documentation. this is not part of the task.

i first automated running the code on snotra creating various remote_ targets in the provided makefile, such that i could easily copy my source code over to snotra, then compile and run it, finally checking it against the precomputed sequential version. this was done using a mix of scp, ssh and srun calls. after which it was a breeze testing the code, so i could get on with implementing it.

after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20 roughly. i wondered if it could go faster, so i used nvprof ./parallel to profile the code on snotra, then analyzed that data and found better solutions with an llm, kagi ai. we arrived at having a combined kernel that performed both the time step and boundary conditions. this made it easier to do the cooperative groups, such that we only had "minimal" (1M) kernel launches. this kernel takes in some arguments that are __restricted__, which magically makes the passing of arguments more memory efficient. it works my telling the compiler that these arguments don't overlap in memory, allowing for compile time optimizations. it makes sense that each time step is disjoint in memory. in addition to this, we are passing it precomputed coefficients to the kernel.

other than this, the solution seems to be fairly standard.

__global__ vs __device__

global kernels are called from the host and spawns a gpu thread grid. device kernels are only called from gpus and execute in the callers context and cannot spawn new threads. functions marked with __global__ can thus be thought of as the main entry point of the program, and __device__ functions are gpu helper functions that carry the context of the gpu, i.e. they have access to blockDim and such.

cuda vs mpi

mpi is spmd, running multiple processes on the same program code to obtain parallelism through utilization of the cpu cores. this is good for compute clusters where a great deal of processors are available. these programs handle concurrency through ranks (local process ids) and message passing. such communication overhead may be the main issue for some problems.

cuda is similar to threaded parallelism, in that they share program code and address space, but is run on an nvidia gpu instead of a cpu core. gpus have many cores and are capable of running thousands of threads in parallel. the problem often boils down to properly feeding the gpu, keeping it busy.

we have seen that the previous implementations of the 2d wave equation (mpi and threads) yielded some speed-ups, but the cuda code has been the greatest speed-up i've seen yet. i went from some 40 seconds on snotra to only about 8 after some optimizations. in theory, we could probably increase the problem size and run the program on a gpu cluster with more gpus and use hybrid programming to obtain even greater throughput.

another advantage of cuda is that you can, as with the threaded code, implement everything sequentially and then swap out the parts that are bottlenecking you with parallel kernels to mitigate the bottleneck. this allows for more iterative development, rather than one-shotting it all in one go, as with mpi.

together with good profiling tools, you can more easily identify bottlenecks to help speed up your code. this makes cuda nice to write.

pros and cons of cooperative groups

pros:

  • grid-level synchronization without atomics or extra kernel launches
  • cleaner code for hierarchical patterns

cons:

  • frequent sync points serialize execution, reducing parallelism
  • high overhead not worth it for simple kernels like stencils

gpu occupancy

occupancy = active warps / max warps per sm.

the theoretical occupancy that was printed out using the provided formula consistently gave 1.0, suggesting optimal occupancy of the gpu would be 100%. looking at the formula, it would mean that for the entire duration of our program, we would need to keep all warps fed with data. however, if we use block size of 27x28 we obtain theoretical occupancy of 0.75. this differs from the provided image, suggesting 0.979167. that's weird. i might have implemented the formula wrongly.

output for 8x8 block size:

./parallel
CUDA device count: 1
CUDA device #0:
        Name: Tesla T4
        Compute capability: 7.5
        Multiprocessors: 40
        Warp size: 32
        Global memory: 14.6GiB bytes
        Per-block shared memory: 48.0KiB
        Per-block registers: 65536
        Cooperative launch: YES
Total elapsed time: 9.008482 seconds
Grid size set to:             (16, 16)
Launched blocks of size:      (8, 8)
Theoretical occupancy:        1.000000
python3 compare.py data_sequential/00000.dat data/00000.dat

Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001

python3 compare.py data_sequential/00075.dat data/00075.dat

Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001

Job terminated on selbu

output for 27x28 block size:

./parallel
CUDA device count: 1
CUDA device #0:
        Name: Tesla T4
        Compute capability: 7.5
        Multiprocessors: 40
        Warp size: 32
        Global memory: 14.6GiB bytes
        Per-block shared memory: 48.0KiB
        Per-block registers: 65536
        Cooperative launch: YES
Total elapsed time: 10.004013 seconds
Grid size set to:             (5, 5)
Launched blocks of size:      (27, 28)
Theoretical occupancy:        0.750000
python3 compare.py data_sequential/00000.dat data/00000.dat

Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001

python3 compare.py data_sequential/00075.dat data/00075.dat

Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001

Job terminated on selbu