ex7: report
This commit is contained in:
150
exercise7/report.md
Normal file
150
exercise7/report.md
Normal file
@@ -0,0 +1,150 @@
|
||||
---
|
||||
author: fredrik robertsen
|
||||
date: 2025-11-04
|
||||
title: "exercise 7"
|
||||
---
|
||||
|
||||
## how i solved the problem (not a task)
|
||||
|
||||
i'm writing this as a little documentation. this is not part of the task.
|
||||
|
||||
i first automated running the code on snotra creating various `remote_` targets
|
||||
in the provided makefile, such that i could easily copy my source code over to
|
||||
snotra, then compile and run it, finally checking it against the precomputed
|
||||
sequential version. this was done using a mix of `scp`, `ssh` and `srun` calls.
|
||||
after which it was a breeze testing the code, so i could get on with
|
||||
implementing it.
|
||||
|
||||
after getting cuda to run, it ran at about 2x the sequential speed, i.e. 40/20
|
||||
roughly. i wondered if it could go faster, so i used `nvprof ./parallel` to
|
||||
profile the code on snotra, then analyzed that data and found better solutions
|
||||
with an llm, [kagi ai](https://kagi.com/). we arrived at having a combined
|
||||
kernel that performed both the time step and boundary conditions. this made it
|
||||
easier to do the cooperative groups, such that we only had "minimal" (1M) kernel
|
||||
launches. this kernel takes in some arguments that are `__restricted__`, which
|
||||
magically makes the passing of arguments more memory efficient. it works my
|
||||
telling the compiler that these arguments don't overlap in memory, allowing for
|
||||
compile time optimizations. it makes sense that each time step is disjoint in
|
||||
memory. in addition to this, we are passing it precomputed coefficients to the
|
||||
kernel.
|
||||
|
||||
other than this, the solution seems to be fairly standard.
|
||||
|
||||
## `__global__` vs `__device__`
|
||||
|
||||
global kernels are called from the host and spawns a gpu thread grid. device
|
||||
kernels are only called from gpus and execute in the callers context and cannot
|
||||
spawn new threads. functions marked with `__global__` can thus be thought of as
|
||||
the main entry point of the program, and `__device__` functions are gpu helper
|
||||
functions that carry the context of the gpu, i.e. they have access to `blockDim`
|
||||
and such.
|
||||
|
||||
## cuda vs mpi
|
||||
|
||||
mpi is spmd, running multiple processes on the same program code to obtain
|
||||
parallelism through utilization of the cpu cores. this is good for compute
|
||||
clusters where a great deal of processors are available. these programs handle
|
||||
concurrency through ranks (local process ids) and message passing. such
|
||||
communication overhead may be the main issue for some problems.
|
||||
|
||||
cuda is similar to threaded parallelism, in that they share program code and
|
||||
address space, but is run on an nvidia gpu instead of a cpu core. gpus have many
|
||||
cores and are capable of running thousands of threads in parallel. the problem
|
||||
often boils down to properly feeding the gpu, keeping it busy.
|
||||
|
||||
we have seen that the previous implementations of the 2d wave equation (mpi and
|
||||
threads) yielded some speed-ups, but the cuda code has been the greatest
|
||||
speed-up i've seen yet. i went from some 40 seconds on snotra to only about
|
||||
8 after some optimizations. in theory, we could probably increase the problem
|
||||
size and run the program on a gpu cluster with more gpus and use hybrid
|
||||
programming to obtain even greater throughput.
|
||||
|
||||
another advantage of cuda is that you can, as with the threaded code, implement
|
||||
everything sequentially and then swap out the parts that are bottlenecking you
|
||||
with parallel kernels to mitigate the bottleneck. this allows for more iterative
|
||||
development, rather than one-shotting it all in one go, as with mpi.
|
||||
|
||||
together with good profiling tools, you can more easily identify bottlenecks to
|
||||
help speed up your code. this makes cuda nice to write.
|
||||
|
||||
## pros and cons of cooperative groups
|
||||
|
||||
pros:
|
||||
|
||||
- grid-level synchronization without atomics or extra kernel launches
|
||||
- cleaner code for hierarchical patterns
|
||||
|
||||
cons:
|
||||
|
||||
- frequent sync points serialize execution, reducing parallelism
|
||||
- high overhead not worth it for simple kernels like stencils
|
||||
|
||||
## gpu occupancy
|
||||
|
||||
`occupancy = active warps / max warps per sm`.
|
||||
|
||||
the theoretical occupancy that was printed out using the provided formula
|
||||
consistently gave `1.0`, suggesting optimal occupancy of the gpu would be 100%.
|
||||
looking at the formula, it would mean that for the entire duration of our
|
||||
program, we would need to keep all warps fed with data. however, if we use block
|
||||
size of 27x28 we obtain theoretical occupancy of `0.75`. this differs from the
|
||||
provided image, suggesting `0.979167`. that's weird. i might have implemented
|
||||
the formula wrongly.
|
||||
|
||||
output for 8x8 block size:
|
||||
|
||||
```
|
||||
./parallel
|
||||
CUDA device count: 1
|
||||
CUDA device #0:
|
||||
Name: Tesla T4
|
||||
Compute capability: 7.5
|
||||
Multiprocessors: 40
|
||||
Warp size: 32
|
||||
Global memory: 14.6GiB bytes
|
||||
Per-block shared memory: 48.0KiB
|
||||
Per-block registers: 65536
|
||||
Cooperative launch: YES
|
||||
Total elapsed time: 9.008482 seconds
|
||||
Grid size set to: (16, 16)
|
||||
Launched blocks of size: (8, 8)
|
||||
Theoretical occupancy: 1.000000
|
||||
python3 compare.py data_sequential/00000.dat data/00000.dat
|
||||
|
||||
Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
|
||||
|
||||
python3 compare.py data_sequential/00075.dat data/00075.dat
|
||||
|
||||
Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
|
||||
|
||||
Job terminated on selbu
|
||||
```
|
||||
|
||||
output for 27x28 block size:
|
||||
|
||||
```
|
||||
./parallel
|
||||
CUDA device count: 1
|
||||
CUDA device #0:
|
||||
Name: Tesla T4
|
||||
Compute capability: 7.5
|
||||
Multiprocessors: 40
|
||||
Warp size: 32
|
||||
Global memory: 14.6GiB bytes
|
||||
Per-block shared memory: 48.0KiB
|
||||
Per-block registers: 65536
|
||||
Cooperative launch: YES
|
||||
Total elapsed time: 10.004013 seconds
|
||||
Grid size set to: (5, 5)
|
||||
Launched blocks of size: (27, 28)
|
||||
Theoretical occupancy: 0.750000
|
||||
python3 compare.py data_sequential/00000.dat data/00000.dat
|
||||
|
||||
Data files data_sequential/00000.dat and data/00000.dat are identical within the margin of 0.0001
|
||||
|
||||
python3 compare.py data_sequential/00075.dat data/00075.dat
|
||||
|
||||
Data files data_sequential/00075.dat and data/00075.dat are identical within the margin of 0.0001
|
||||
|
||||
Job terminated on selbu
|
||||
```
|
||||
Reference in New Issue
Block a user