diff --git a/exercise7/report.md b/exercise7/report.md new file mode 100644 index 0000000..d704f41 --- /dev/null +++ b/exercise7/report.md @@ -0,0 +1,150 @@ +--- +author: fredrik robertsen +date: 2025-11-04 +title: "exercise 7" +--- + +## how i solved the problem (not a task) + +i'm writing this as a little documentation. this is not part of the task. + +i first automated running the code on snotra creating various `remote_` targets +in the provided makefile, such that i could easily copy my source code over to +snotra, then compile and run it, finally checking it against the precomputed +sequential version. this was done using a mix of `scp`, `ssh` and `srun` calls. +after which it was a breeze testing the code, so i could get on with +implementing it. + +after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20 +roughly. i wondered if it could go faster, so i used `nvprof ./parallel` to +profile the code on snotra, then analyzed that data and found better solutions +with an llm, [kagi ai](https://kagi.com/). we arrived at having a combined +kernel that performed both the time step and boundary conditions. this made it +easier to do the cooperative groups, such that we only had "minimal" (1M) kernel +launches. this kernel takes in some arguments that are `__restricted__`, which +magically makes the passing of arguments more memory efficient. it works my +telling the compiler that these arguments don't overlap in memory, allowing for +compile time optimizations. it makes sense that each time step is disjoint in +memory. in addition to this, we are passing it precomputed coefficients to the +kernel. + +other than this, the solution seems to be fairly standard. + +## `__global__` vs `__device__` + +global kernels are called from the host and spawns a gpu thread grid. device +kernels are only called from gpus and execute in the callers context and cannot +spawn new threads. functions marked with `__global__` can thus be thought of as +the main entry point of the program, and `__device__` functions are gpu helper +functions that carry the context of the gpu, i.e. they have access to `blockDim` +and such. + +## cuda vs mpi + +mpi is spmd, running multiple processes on the same program code to obtain +parallelism through utilization of the cpu cores. this is good for compute +clusters where a great deal of processors are available. these programs handle +concurrency through ranks (local process ids) and message passing. such +communication overhead may be the main issue for some problems. + +cuda is similar to threaded parallelism, in that they share program code and +address space, but is run on an nvidia gpu instead of a cpu core. gpus have many +cores and are capable of running thousands of threads in parallel. the problem +often boils down to properly feeding the gpu, keeping it busy. + +we have seen that the previous implementations of the 2d wave equation (mpi and +threads) yielded some speed-ups, but the cuda code has been the greatest +speed-up i've seen yet. i went from some 40 seconds on snotra to only about +8 after some optimizations. in theory, we could probably increase the problem +size and run the program on a gpu cluster with more gpus and use hybrid +programming to obtain even greater throughput. + +another advantage of cuda is that you can, as with the threaded code, implement +everything sequentially and then swap out the parts that are bottlenecking you +with parallel kernels to mitigate the bottleneck. this allows for more iterative +development, rather than one-shotting it all in one go, as with mpi. + +together with good profiling tools, you can more easily identify bottlenecks to +help speed up your code. this makes cuda nice to write. + +## pros and cons of cooperative groups + +pros: + +- grid-level synchronization without atomics or extra kernel launches +- cleaner code for hierarchical patterns + +cons: + +- frequent sync points serialize execution, reducing parallelism +- high overhead not worth it for simple kernels like stencils + +## gpu occupancy + +`occupancy = active warps / max warps per sm`. + +the theoretical occupancy that was printed out using the provided formula +consistently gave `1.0`, suggesting optimal occupancy of the gpu would be 100%. +looking at the formula, it would mean that for the entire duration of our +program, we would need to keep all warps fed with data. however, if we use block +size of 27x28 we obtain theoretical occupancy of `0.75`. this differs from the +provided image, suggesting `0.979167`. that's weird. i might have implemented +the formula wrongly. + +output for 8x8 block size: + +``` +./parallel +CUDA device count: 1 +CUDA device #0: + Name: Tesla T4 + Compute capability: 7.5 + Multiprocessors: 40 + Warp size: 32 + Global memory: 14.6GiB bytes + Per-block shared memory: 48.0KiB + Per-block registers: 65536 + Cooperative launch: YES +Total elapsed time: 9.008482 seconds +Grid size set to: (16, 16) +Launched blocks of size: (8, 8) +Theoretical occupancy: 1.000000 +python3 compare.py data_sequential/00000.dat data/00000.dat + +Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001 + +python3 compare.py data_sequential/00075.dat data/00075.dat + +Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001 + +Job terminated on selbu +``` + +output for 27x28 block size: + +``` +./parallel +CUDA device count: 1 +CUDA device #0: + Name: Tesla T4 + Compute capability: 7.5 + Multiprocessors: 40 + Warp size: 32 + Global memory: 14.6GiB bytes + Per-block shared memory: 48.0KiB + Per-block registers: 65536 + Cooperative launch: YES +Total elapsed time: 10.004013 seconds +Grid size set to: (5, 5) +Launched blocks of size: (27, 28) +Theoretical occupancy: 0.750000 +python3 compare.py data_sequential/00000.dat data/00000.dat + +Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001 + +python3 compare.py data_sequential/00075.dat data/00075.dat + +Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001 + +Job terminated on selbu +```