116 lines
5.3 KiB
Markdown
116 lines
5.3 KiB
Markdown
---
|
|
author: fredrik robertsen
|
|
date: 2025-10-27
|
|
title: "exercise 6"
|
|
---
|
|
|
|
## how i solved it (not a task)
|
|
|
|
this is just my reflection on my solution, not necessarily an answer to a task.
|
|
|
|
each subtask was quite easy and only a few lines of code. the key idea is to
|
|
copy the inner-most part of the nested for-loop from `host_calculate` and use
|
|
that to create a `device_calculate`. then we essentially want to spawn threads
|
|
on the gpu that each calculate one singular pixel. to do this, we `cudaMalloc`
|
|
some memory to work with on the device, then launch some cores on a grid of
|
|
thread blocks. a grid is calculated as a 2d array, dividing the output image
|
|
into chunks that each thread block calculates. each block has typically 8x8,
|
|
16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on
|
|
snotra.
|
|
|
|
the issue when dividing the workload like this is that we may have some leftover
|
|
pixels that aren't being calculated due to the loss of precision from truncating
|
|
the division between the image size and the block size. to counteract this, we
|
|
want to divide, then round up. this can be done with a math trick
|
|
|
|
```{=typst}
|
|
$ ceil(a / b) = (a + b - 1) / b. $
|
|
```
|
|
|
|
with this trick, the cuda code correctly calculates most pixels on an image
|
|
where the output size is not divisible by the block sizes.
|
|
|
|
i say most pixels, because a particular phenomenon became apparent: gpu cores
|
|
have different precision when handling floating-point numbers. thus some pixels
|
|
will be counted as different from the host code.
|
|
|
|
side note: i crafted a cool makefile target that automates running the code on
|
|
snotra. it looks like this:
|
|
|
|
```Makefile
|
|
remote_stop:
|
|
@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true
|
|
remote_copy_src:
|
|
@scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs
|
|
remote_compile: remote_stop remote_copy_src
|
|
@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh'
|
|
```
|
|
|
|
with this i can simply run `make remote_compile` to stop any currently running
|
|
jobs on snotra (which has never really been an issue), then it will copy my
|
|
source file over to snotra and compile it and run it there according to a bash
|
|
script on the server:
|
|
|
|
```sh
|
|
#!/bin/bash
|
|
srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1"
|
|
```
|
|
|
|
i'm not familiar with slurm jobs, but this seemed to be a similar command to
|
|
what was being run when you `connect tdt4200` from the login-node. it worked
|
|
fine for me to compile and run the source code on snotra, which will likely be
|
|
valuable experience for upcoming assignments.
|
|
|
|
## questions
|
|
|
|
1. i did get a massive speed-up between the host and device code, according to
|
|
the provided benchmarking prints. this is running on snotra. running with
|
|
blocksizes 8x8 i get
|
|
```
|
|
Host time: 1424.517 ms
|
|
Device calculation: 6.196 ms
|
|
Copy result: 58.629 ms
|
|
```
|
|
which is, if we factor in the memory i/o, `1400 / (6 + 58) ~= 22` times
|
|
faster. i believe the host time is also doing i/o, so this is fair. this is
|
|
huge, and makes sense that a gpu is good at processing images, considering
|
|
that is what they were made for.
|
|
2. which gpu did i use? snotra
|
|
|
|
````{=typst}
|
|
#text(size: 9pt)[```
|
|
selbu:~$ nvidia-smi
|
|
Mon Oct 27 20:13:23 2025
|
|
+-----------------------------------------------------------------------------------------+
|
|
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|
|
|-----------------------------------------+------------------------+----------------------+
|
|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
|
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
|
| | | MIG M. |
|
|
|=========================================+========================+======================|
|
|
| 0 Tesla T4 On | 00000000:88:00.0 Off | 0 |
|
|
| N/A 36C P8 15W / 70W | 0MiB / 15360MiB | 0% Default |
|
|
| | | N/A |
|
|
+-----------------------------------------+------------------------+----------------------+
|
|
|
|
+-----------------------------------------------------------------------------------------+
|
|
| Processes: |
|
|
| GPU GI CI PID Type Process name GPU Memory |
|
|
| ID ID Usage |
|
|
|=========================================================================================|
|
|
| No running processes found |
|
|
+-----------------------------------------------------------------------------------------+
|
|
```]
|
|
````
|
|
|
|
3. simd vs spmd
|
|
- simd stands for single instruction multiple data.
|
|
- our cuda program lets us program the gpu and instruct it. this is an
|
|
example of simd, since we are able to perform a single operation (kernel)
|
|
across multiple data (pixels) at hardware level.
|
|
- cpus can utilize vector instructions and -registers to compute things in
|
|
parallel, though at a much smaller scale than gpus.
|
|
- spmd stands for single program multiple data.
|
|
- mpi is an example of this, since you create multiple processes that
|
|
execute the same program code.
|