--- author: fredrik robertsen date: 2025-11-04 title: "exercise 7" --- ## how i solved the problem (not a task) i'm writing this as a little documentation. this is not part of the task. i first automated running the code on snotra creating various `remote_` targets in the provided makefile, such that i could easily copy my source code over to snotra, then compile and run it, finally checking it against the precomputed sequential version. this was done using a mix of `scp`, `ssh` and `srun` calls. after which it was a breeze testing the code, so i could get on with implementing it. after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20 roughly. i wondered if it could go faster, so i used `nvprof ./parallel` to profile the code on snotra, then analyzed that data and found better solutions with an llm, [kagi ai](https://kagi.com/). we arrived at having a combined kernel that performed both the time step and boundary conditions. this made it easier to do the cooperative groups, such that we only had "minimal" (1M) kernel launches. this kernel takes in some arguments that are `__restricted__`, which magically makes the passing of arguments more memory efficient. it works my telling the compiler that these arguments don't overlap in memory, allowing for compile time optimizations. it makes sense that each time step is disjoint in memory. in addition to this, we are passing it precomputed coefficients to the kernel. other than this, the solution seems to be fairly standard. ## `__global__` vs `__device__` global kernels are called from the host and spawns a gpu thread grid. device kernels are only called from gpus and execute in the callers context and cannot spawn new threads. functions marked with `__global__` can thus be thought of as the main entry point of the program, and `__device__` functions are gpu helper functions that carry the context of the gpu, i.e. they have access to `blockDim` and such. ## cuda vs mpi mpi is spmd, running multiple processes on the same program code to obtain parallelism through utilization of the cpu cores. this is good for compute clusters where a great deal of processors are available. these programs handle concurrency through ranks (local process ids) and message passing. such communication overhead may be the main issue for some problems. cuda is similar to threaded parallelism, in that they share program code and address space, but is run on an nvidia gpu instead of a cpu core. gpus have many cores and are capable of running thousands of threads in parallel. the problem often boils down to properly feeding the gpu, keeping it busy. we have seen that the previous implementations of the 2d wave equation (mpi and threads) yielded some speed-ups, but the cuda code has been the greatest speed-up i've seen yet. i went from some 40 seconds on snotra to only about 8 after some optimizations. in theory, we could probably increase the problem size and run the program on a gpu cluster with more gpus and use hybrid programming to obtain even greater throughput. another advantage of cuda is that you can, as with the threaded code, implement everything sequentially and then swap out the parts that are bottlenecking you with parallel kernels to mitigate the bottleneck. this allows for more iterative development, rather than one-shotting it all in one go, as with mpi. together with good profiling tools, you can more easily identify bottlenecks to help speed up your code. this makes cuda nice to write. ## pros and cons of cooperative groups pros: - grid-level synchronization without atomics or extra kernel launches - cleaner code for hierarchical patterns cons: - frequent sync points serialize execution, reducing parallelism - high overhead not worth it for simple kernels like stencils ## gpu occupancy `occupancy = active warps / max warps per sm`. the theoretical occupancy that was printed out using the provided formula consistently gave `1.0`, suggesting optimal occupancy of the gpu would be 100%. looking at the formula, it would mean that for the entire duration of our program, we would need to keep all warps fed with data. however, if we use block size of 27x28 we obtain theoretical occupancy of `0.75`. this differs from the provided image, suggesting `0.979167`. that's weird. i might have implemented the formula wrongly. output for 8x8 block size: ``` ./parallel CUDA device count: 1 CUDA device #0: Name: Tesla T4 Compute capability: 7.5 Multiprocessors: 40 Warp size: 32 Global memory: 14.6GiB bytes Per-block shared memory: 48.0KiB Per-block registers: 65536 Cooperative launch: YES Total elapsed time: 9.008482 seconds Grid size set to: (16, 16) Launched blocks of size: (8, 8) Theoretical occupancy: 1.000000 python3 compare.py data_sequential/00000.dat data/00000.dat Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001 python3 compare.py data_sequential/00075.dat data/00075.dat Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001 Job terminated on selbu ``` output for 27x28 block size: ``` ./parallel CUDA device count: 1 CUDA device #0: Name: Tesla T4 Compute capability: 7.5 Multiprocessors: 40 Warp size: 32 Global memory: 14.6GiB bytes Per-block shared memory: 48.0KiB Per-block registers: 65536 Cooperative launch: YES Total elapsed time: 10.004013 seconds Grid size set to: (5, 5) Launched blocks of size: (27, 28) Theoretical occupancy: 0.750000 python3 compare.py data_sequential/00000.dat data/00000.dat Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001 python3 compare.py data_sequential/00075.dat data/00075.dat Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001 Job terminated on selbu ```