ex7: report

2025-11-04 16:18:03 +01:00
parent 46ffa95912
commit efeeeaa9a2
1 changed files with 150 additions and 0 deletions
--- a/exercise7/report.md
+++ b/exercise7/report.md
@@ -0,0 +1,150 @@
+---
+author: fredrik robertsen
+date: 2025-11-04
+title: "exercise 7"
+---
+
+## how i solved the problem (not a task)
+
+i'm writing this as a little documentation. this is not part of the task.
+
+i first automated running the code on snotra creating various `remote_` targets
+in the provided makefile, such that i could easily copy my source code over to
+snotra, then compile and run it, finally checking it against the precomputed
+sequential version. this was done using a mix of `scp`, `ssh` and `srun` calls.
+after which it was a breeze testing the code, so i could get on with
+implementing it.
+
+after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20
+roughly. i wondered if it could go faster, so i used `nvprof ./parallel` to
+profile the code on snotra, then analyzed that data and found better solutions
+with an llm, [kagi ai](https://kagi.com/). we arrived at having a combined
+kernel that performed both the time step and boundary conditions. this made it
+easier to do the cooperative groups, such that we only had "minimal" (1M) kernel
+launches. this kernel takes in some arguments that are `__restricted__`, which
+magically makes the passing of arguments more memory efficient. it works my
+telling the compiler that these arguments don't overlap in memory, allowing for
+compile time optimizations. it makes sense that each time step is disjoint in
+memory. in addition to this, we are passing it precomputed coefficients to the
+kernel.
+
+other than this, the solution seems to be fairly standard.
+
+## `__global__` vs `__device__`
+
+global kernels are called from the host and spawns a gpu thread grid. device
+kernels are only called from gpus and execute in the callers context and cannot
+spawn new threads. functions marked with `__global__` can thus be thought of as
+the main entry point of the program, and `__device__` functions are gpu helper
+functions that carry the context of the gpu, i.e. they have access to `blockDim`
+and such.
+
+## cuda vs mpi
+
+mpi is spmd, running multiple processes on the same program code to obtain
+parallelism through utilization of the cpu cores. this is good for compute
+clusters where a great deal of processors are available. these programs handle
+concurrency through ranks (local process ids) and message passing. such
+communication overhead may be the main issue for some problems.
+
+cuda is similar to threaded parallelism, in that they share program code and
+address space, but is run on an nvidia gpu instead of a cpu core. gpus have many
+cores and are capable of running thousands of threads in parallel. the problem
+often boils down to properly feeding the gpu, keeping it busy.
+
+we have seen that the previous implementations of the 2d wave equation (mpi and
+threads) yielded some speed-ups, but the cuda code has been the greatest
+speed-up i've seen yet. i went from some 40 seconds on snotra to only about
+8 after some optimizations. in theory, we could probably increase the problem
+size and run the program on a gpu cluster with more gpus and use hybrid
+programming to obtain even greater throughput.
+
+another advantage of cuda is that you can, as with the threaded code, implement
+everything sequentially and then swap out the parts that are bottlenecking you
+with parallel kernels to mitigate the bottleneck. this allows for more iterative
+development, rather than one-shotting it all in one go, as with mpi.
+
+together with good profiling tools, you can more easily identify bottlenecks to
+help speed up your code. this makes cuda nice to write.
+
+## pros and cons of cooperative groups
+
+pros:
+
+- grid-level synchronization without atomics or extra kernel launches
+- cleaner code for hierarchical patterns
+
+cons:
+
+- frequent sync points serialize execution, reducing parallelism
+- high overhead not worth it for simple kernels like stencils
+
+## gpu occupancy
+
+`occupancy = active warps / max warps per sm`.
+
+the theoretical occupancy that was printed out using the provided formula
+consistently gave `1.0`, suggesting optimal occupancy of the gpu would be 100%.
+looking at the formula, it would mean that for the entire duration of our
+program, we would need to keep all warps fed with data. however, if we use block
+size of 27x28 we obtain theoretical occupancy of `0.75`. this differs from the
+provided image, suggesting `0.979167`. that's weird. i might have implemented
+the formula wrongly.
+
+output for 8x8 block size:
+
+```
+./parallel
+CUDA device count: 1
+CUDA device #0:
+        Name: Tesla T4
+        Compute capability: 7.5
+        Multiprocessors: 40
+        Warp size: 32
+        Global memory: 14.6GiB bytes
+        Per-block shared memory: 48.0KiB
+        Per-block registers: 65536
+        Cooperative launch: YES
+Total elapsed time: 9.008482 seconds
+Grid size set to:             (16, 16)
+Launched blocks of size:      (8, 8)
+Theoretical occupancy:        1.000000
+python3 compare.py data_sequential/00000.dat data/00000.dat
+
+Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
+
+python3 compare.py data_sequential/00075.dat data/00075.dat
+
+Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
+
+Job terminated on selbu
+```
+
+output for 27x28 block size:
+
+```
+./parallel
+CUDA device count: 1
+CUDA device #0:
+        Name: Tesla T4
+        Compute capability: 7.5
+        Multiprocessors: 40
+        Warp size: 32
+        Global memory: 14.6GiB bytes
+        Per-block shared memory: 48.0KiB
+        Per-block registers: 65536
+        Cooperative launch: YES
+Total elapsed time: 10.004013 seconds
+Grid size set to:             (5, 5)
+Launched blocks of size:      (27, 28)
+Theoretical occupancy:        0.750000
+python3 compare.py data_sequential/00000.dat data/00000.dat
+
+Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
+
+python3 compare.py data_sequential/00075.dat data/00075.dat
+
+Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
+
+Job terminated on selbu
+```