5.3 KiB
author, date, title
| author | date | title |
|---|---|---|
| fredrik robertsen | 2025-10-27 | exercise 6 |
how i solved it (not a task)
this is just my reflection on my solution, not necessarily an answer to a task.
each subtask was quite easy and only a few lines of code. the key idea is to
copy the inner-most part of the nested for-loop from host_calculate and use
that to create a device_calculate. then we essentially want to spawn threads
on the gpu that each calculate one singular pixel. to do this, we cudaMalloc
some memory to work with on the device, then launch some cores on a grid of
thread blocks. a grid is calculated as a 2d array, dividing the output image
into chunks that each thread block calculates. each block has typically 8x8,
16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on
snotra.
the issue when dividing the workload like this is that we may have some leftover pixels that aren't being calculated due to the loss of precision from truncating the division between the image size and the block size. to counteract this, we want to divide, then round up. this can be done with a math trick
$ ceil(a / b) = (a + b - 1) / b. $
with this trick, the cuda code correctly calculates most pixels on an image where the output size is not divisible by the block sizes.
i say most pixels, because a particular phenomenon became apparent: gpu cores have different precision when handling floating-point numbers. thus some pixels will be counted as different from the host code.
side note: i crafted a cool makefile target that automates running the code on snotra. it looks like this:
remote_stop:
@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true
remote_copy_src:
@scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs
remote_compile: remote_stop remote_copy_src
@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh'
with this i can simply run make remote_compile to stop any currently running
jobs on snotra (which has never really been an issue), then it will copy my
source file over to snotra and compile it and run it there according to a bash
script on the server:
#!/bin/bash
srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1"
i'm not familiar with slurm jobs, but this seemed to be a similar command to
what was being run when you connect tdt4200 from the login-node. it worked
fine for me to compile and run the source code on snotra, which will likely be
valuable experience for upcoming assignments.
questions
-
i did get a massive speed-up between the host and device code, according to the provided benchmarking prints. this is running on snotra. running with blocksizes 8x8 i get
Host time: 1424.517 ms Device calculation: 6.196 ms Copy result: 58.629 mswhich is, if we factor in the memory i/o,
1400 / (6 + 58) ~= 22times faster. i believe the host time is also doing i/o, so this is fair. this is huge, and makes sense that a gpu is good at processing images, considering that is what they were made for. -
which gpu did i use? snotra
#text(size: 9pt)[``` selbu:~$ nvidia-smi Mon Oct 27 20:13:23 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 On | 00000000:88:00.0 Off | 0 | | N/A 36C P8 15W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ```] -
simd vs spmd
- simd stands for single instruction multiple data.
- our cuda program lets us program the gpu and instruct it. this is an example of simd, since we are able to perform a single operation (kernel) across multiple data (pixels) at hardware level.
- cpus can utilize vector instructions and -registers to compute things in parallel, though at a much smaller scale than gpus.
- spmd stands for single program multiple data.
- mpi is an example of this, since you create multiple processes that execute the same program code.
- simd stands for single instruction multiple data.