TDT4200/exercise6/report.md

---
author: fredrik robertsen
date: 2025-10-27
title: "exercise 6"
---

## how i solved it (not a task)

this is just my reflection on my solution, not necessarily an answer to a task.

each subtask was quite easy and only a few lines of code. the key idea is to
copy the inner-most part of the nested for-loop from `host_calculate` and use
that to create a `device_calculate`. then we essentially want to spawn threads
on the gpu that each calculate one singular pixel. to do this, we `cudaMalloc`
some memory to work with on the device, then launch some cores on a grid of
thread blocks. a grid is calculated as a 2d array, dividing the output image
into chunks that each thread block calculates. each block has typically 8x8,
16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on
snotra.

the issue when dividing the workload like this is that we may have some leftover
pixels that aren't being calculated due to the loss of precision from truncating
the division between the image size and the block size. to counteract this, we
want to divide, then round up. this can be done with a math trick

```{=typst}
$ ceil(a / b) = (a + b - 1) / b. $
```

with this trick, the cuda code correctly calculates most pixels on an image
where the output size is not divisible by the block sizes.

i say most pixels, because a particular phenomenon became apparent: gpu cores
have different precision when handling floating-point numbers. thus some pixels
will be counted as different from the host code.

side note: i crafted a cool makefile target that automates running the code on
snotra. it looks like this:

```Makefile
remote_stop:
	@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true
remote_copy_src:
	@scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs
remote_compile: remote_stop remote_copy_src
	@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh'
```

with this i can simply run `make remote_compile` to stop any currently running
jobs on snotra (which has never really been an issue), then it will copy my
source file over to snotra and compile it and run it there according to a bash
script on the server:

```sh
#!/bin/bash
srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1"
```

i'm not familiar with slurm jobs, but this seemed to be a similar command to
what was being run when you `connect tdt4200` from the login-node. it worked
fine for me to compile and run the source code on snotra, which will likely be
valuable experience for upcoming assignments.

## questions

1. i did get a massive speed-up between the host and device code, according to
   the provided benchmarking prints. this is running on snotra. running with
   blocksizes 8x8 i get
   ```
   Host time:          1424.517 ms
   Device calculation:   6.196 ms
   Copy result:         58.629 ms
   ```
   which is, if we factor in the memory i/o, `1400 / (6 + 58) ~= 22` times
   faster. i believe the host time is also doing i/o, so this is fair. this is
   huge, and makes sense that a gpu is good at processing images, considering
   that is what they were made for.
2. which gpu did i use? snotra

   ````{=typst}
   #text(size: 9pt)[```
   selbu:~$ nvidia-smi
   Mon Oct 27 20:13:23 2025
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  Tesla T4                       On  |   00000000:88:00.0 Off |                    0 |
   | N/A   36C    P8             15W /   70W |       0MiB /  15360MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes: |
   | GPU GI CI PID Type Process name GPU Memory |
   | ID ID Usage |
   |=========================================================================================|
   | No running processes found |
   +-----------------------------------------------------------------------------------------+
   ```]
   ````

3. simd vs spmd
   - simd stands for single instruction multiple data.
     - our cuda program lets us program the gpu and instruct it. this is an
       example of simd, since we are able to perform a single operation (kernel)
       across multiple data (pixels) at hardware level.
     - cpus can utilize vector instructions and -registers to compute things in
       parallel, though at a much smaller scale than gpus.
   - spmd stands for single program multiple data.
     - mpi is an example of this, since you create multiple processes that
       execute the same program code.