From 7f963e562be9bd62915df1f0345d2288f5aba1c9 Mon Sep 17 00:00:00 2001 From: fredrikr79 Date: Mon, 27 Oct 2025 20:28:45 +0100 Subject: [PATCH] ex6: report --- exercise6/report.md | 115 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 exercise6/report.md diff --git a/exercise6/report.md b/exercise6/report.md new file mode 100644 index 0000000..6f82a94 --- /dev/null +++ b/exercise6/report.md @@ -0,0 +1,115 @@ +--- +author: fredrik robertsen +date: 2025-10-27 +title: "exercise 6" +--- + +## how i solved it (not a task) + +this is just my reflection on my solution, not necessarily an answer to a task. + +each subtask was quite easy and only a few lines of code. the key idea is to +copy the inner-most part of the nested for-loop from `host_calculate` and use +that to create a `device_calculate`. then we essentially want to spawn threads +on the gpu that each calculate one singular pixel. to do this, we `cudaMalloc` +some memory to work with on the device, then launch some cores on a grid of +thread blocks. a grid is calculated as a 2d array, dividing the output image +into chunks that each thread block calculates. each block has typically 8x8, +16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on +snotra. + +the issue when dividing the workload like this is that we may have some leftover +pixels that aren't being calculated due to the loss of precision from truncating +the division between the image size and the block size. to counteract this, we +want to divide, then round up. this can be done with a math trick + +```{=typst} +$ ceil(a / b) = (a + b - 1) / b. $ +``` + +with this trick, the cuda code correctly calculates most pixels on an image +where the output size is not divisible by the block sizes. + +i say most pixels, because a particular phenomenon became apparent: gpu cores +have different precision when handling floating-point numbers. thus some pixels +will be counted as different from the host code. + +side note: i crafted a cool makefile target that automates running the code on +snotra. it looks like this: + +```Makefile +remote_stop: + @ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true +remote_copy_src: + @scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs +remote_compile: remote_stop remote_copy_src + @ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh' +``` + +with this i can simply run `make remote_compile` to stop any currently running +jobs on snotra (which has never really been an issue), then it will copy my +source file over to snotra and compile it and run it there according to a bash +script on the server: + +```sh +#!/bin/bash +srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1" +``` + +i'm not familiar with slurm jobs, but this seemed to be a similar command to +what was being run when you `connect tdt4200` from the login-node. it worked +fine for me to compile and run the source code on snotra, which will likely be +valuable experience for upcoming assignments. + +## questions + +1. i did get a massive speed-up between the host and device code, according to + the provided benchmarking prints. this is running on snotra. running with + blocksizes 8x8 i get + ``` + Host time: 1424.517 ms + Device calculation: 6.196 ms + Copy result: 58.629 ms + ``` + which is, if we factor in the memory i/o, `1400 / (6 + 58) ~= 22` times + faster. i believe the host time is also doing i/o, so this is fair. this is + huge, and makes sense that a gpu is good at processing images, considering + that is what they were made for. +2. which gpu did i use? snotra + + ````{=typst} + #text(size: 9pt)[``` + selbu:~$ nvidia-smi + Mon Oct 27 20:13:23 2025 + +-----------------------------------------------------------------------------------------+ + | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | + |-----------------------------------------+------------------------+----------------------+ + | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |=========================================+========================+======================| + | 0 Tesla T4 On | 00000000:88:00.0 Off | 0 | + | N/A 36C P8 15W / 70W | 0MiB / 15360MiB | 0% Default | + | | | N/A | + +-----------------------------------------+------------------------+----------------------+ + + +-----------------------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + |=========================================================================================| + | No running processes found | + +-----------------------------------------------------------------------------------------+ + ```] + ```` + +3. simd vs spmd + - simd stands for single instruction multiple data. + - our cuda program lets us program the gpu and instruct it. this is an + example of simd, since we are able to perform a single operation (kernel) + across multiple data (pixels) at hardware level. + - cpus can utilize vector instructions and -registers to compute things in + parallel, though at a much smaller scale than gpus. + - spmd stands for single program multiple data. + - mpi is an example of this, since you create multiple processes that + execute the same program code.