From 7f963e562be9bd62915df1f0345d2288f5aba1c9 Mon Sep 17 00:00:00 2001
From: fredrikr79 <fredrikrobertsen7@gmail.com>
Date: Mon, 27 Oct 2025 20:28:45 +0100
Subject: [PATCH] ex6: report

---
 exercise6/report.md | 115 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 exercise6/report.md

diff --git a/exercise6/report.md b/exercise6/report.md
new file mode 100644
index 0000000..6f82a94
--- /dev/null
+++ b/exercise6/report.md
@@ -0,0 +1,115 @@
+---
+author: fredrik robertsen
+date: 2025-10-27
+title: "exercise 6"
+---
+
+## how i solved it (not a task)
+
+this is just my reflection on my solution, not necessarily an answer to a task.
+
+each subtask was quite easy and only a few lines of code. the key idea is to
+copy the inner-most part of the nested for-loop from `host_calculate` and use
+that to create a `device_calculate`. then we essentially want to spawn threads
+on the gpu that each calculate one singular pixel. to do this, we `cudaMalloc`
+some memory to work with on the device, then launch some cores on a grid of
+thread blocks. a grid is calculated as a 2d array, dividing the output image
+into chunks that each thread block calculates. each block has typically 8x8,
+16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on
+snotra.
+
+the issue when dividing the workload like this is that we may have some leftover
+pixels that aren't being calculated due to the loss of precision from truncating
+the division between the image size and the block size. to counteract this, we
+want to divide, then round up. this can be done with a math trick
+
+```{=typst}
+$ ceil(a / b) = (a + b - 1) / b. $
+```
+
+with this trick, the cuda code correctly calculates most pixels on an image
+where the output size is not divisible by the block sizes.
+
+i say most pixels, because a particular phenomenon became apparent: gpu cores
+have different precision when handling floating-point numbers. thus some pixels
+will be counted as different from the host code.
+
+side note: i crafted a cool makefile target that automates running the code on
+snotra. it looks like this:
+
+```Makefile
+remote_stop:
+	@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true
+remote_copy_src:
+	@scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs
+remote_compile: remote_stop remote_copy_src
+	@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh'
+```
+
+with this i can simply run `make remote_compile` to stop any currently running
+jobs on snotra (which has never really been an issue), then it will copy my
+source file over to snotra and compile it and run it there according to a bash
+script on the server:
+
+```sh
+#!/bin/bash
+srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1"
+```
+
+i'm not familiar with slurm jobs, but this seemed to be a similar command to
+what was being run when you `connect tdt4200` from the login-node. it worked
+fine for me to compile and run the source code on snotra, which will likely be
+valuable experience for upcoming assignments.
+
+## questions
+
+1. i did get a massive speed-up between the host and device code, according to
+   the provided benchmarking prints. this is running on snotra. running with
+   blocksizes 8x8 i get
+   ```
+   Host time:          1424.517 ms
+   Device calculation:   6.196 ms
+   Copy result:         58.629 ms
+   ```
+   which is, if we factor in the memory i/o, `1400 / (6 + 58) ~= 22` times
+   faster. i believe the host time is also doing i/o, so this is fair. this is
+   huge, and makes sense that a gpu is good at processing images, considering
+   that is what they were made for.
+2. which gpu did i use? snotra
+
+   ````{=typst}
+   #text(size: 9pt)[```
+   selbu:~$ nvidia-smi
+   Mon Oct 27 20:13:23 2025
+   +-----------------------------------------------------------------------------------------+
+   | NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
+   |-----------------------------------------+------------------------+----------------------+
+   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+   |                                         |                        |               MIG M. |
+   |=========================================+========================+======================|
+   |   0  Tesla T4                       On  |   00000000:88:00.0 Off |                    0 |
+   | N/A   36C    P8             15W /   70W |       0MiB /  15360MiB |      0%      Default |
+   |                                         |                        |                  N/A |
+   +-----------------------------------------+------------------------+----------------------+
+
+   +-----------------------------------------------------------------------------------------+
+   | Processes: |
+   | GPU GI CI PID Type Process name GPU Memory |
+   | ID ID Usage |
+   |=========================================================================================|
+   | No running processes found |
+   +-----------------------------------------------------------------------------------------+
+   ```]
+   ````
+
+3. simd vs spmd
+   - simd stands for single instruction multiple data.
+     - our cuda program lets us program the gpu and instruct it. this is an
+       example of simd, since we are able to perform a single operation (kernel)
+       across multiple data (pixels) at hardware level.
+     - cpus can utilize vector instructions and -registers to compute things in
+       parallel, though at a much smaller scale than gpus.
+   - spmd stands for single program multiple data.
+     - mpi is an example of this, since you create multiple processes that
+       execute the same program code.