Files
TDT4200/exercise6/report.md
2025-10-27 20:53:34 +01:00

5.3 KiB

author, date, title
author date title
fredrik robertsen 2025-10-27 exercise 6

how i solved it (not a task)

this is just my reflection on my solution, not necessarily an answer to a task.

each subtask was quite easy and only a few lines of code. the key idea is to copy the inner-most part of the nested for-loop from host_calculate and use that to create a device_calculate. then we essentially want to spawn threads on the gpu that each calculate one singular pixel. to do this, we cudaMalloc some memory to work with on the device, then launch some cores on a grid of thread blocks. a grid is calculated as a 2d array, dividing the output image into chunks that each thread block calculates. each block has typically 8x8, 16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on snotra.

the issue when dividing the workload like this is that we may have some leftover pixels that aren't being calculated due to the loss of precision from truncating the division between the image size and the block size. to counteract this, we want to divide, then round up. this can be done with a math trick

$ ceil(a / b) = (a + b - 1) / b. $

with this trick, the cuda code correctly calculates most pixels on an image where the output size is not divisible by the block sizes.

i say most pixels, because a particular phenomenon became apparent: gpu cores have different precision when handling floating-point numbers. thus some pixels will be counted as different from the host code.

side note: i crafted a cool makefile target that automates running the code on snotra. it looks like this:

remote_stop:
	@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true
remote_copy_src:
	@scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs
remote_compile: remote_stop remote_copy_src
	@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh'

with this i can simply run make remote_compile to stop any currently running jobs on snotra (which has never really been an issue), then it will copy my source file over to snotra and compile it and run it there according to a bash script on the server:

#!/bin/bash
srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1"

i'm not familiar with slurm jobs, but this seemed to be a similar command to what was being run when you connect tdt4200 from the login-node. it worked fine for me to compile and run the source code on snotra, which will likely be valuable experience for upcoming assignments.

questions

  1. i did get a massive speed-up between the host and device code, according to the provided benchmarking prints. this is running on snotra. running with blocksizes 8x8 i get

    Host time:          1424.517 ms
    Device calculation:   6.196 ms
    Copy result:         58.629 ms
    

    which is, if we factor in the memory i/o, 1400 / (6 + 58) ~= 22 times faster. i believe the host time is also doing i/o, so this is fair. this is huge, and makes sense that a gpu is good at processing images, considering that is what they were made for.

  2. which gpu did i use? snotra

    #text(size: 9pt)[```
    selbu:~$ nvidia-smi
    Mon Oct 27 20:13:23 2025
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  Tesla T4                       On  |   00000000:88:00.0 Off |                    0 |
    | N/A   36C    P8             15W /   70W |       0MiB /  15360MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes: |
    | GPU GI CI PID Type Process name GPU Memory |
    | ID ID Usage |
    |=========================================================================================|
    | No running processes found |
    +-----------------------------------------------------------------------------------------+
    ```]
    
  3. simd vs spmd

    • simd stands for single instruction multiple data.
      • our cuda program lets us program the gpu and instruct it. this is an example of simd, since we are able to perform a single operation (kernel) across multiple data (pixels) at hardware level.
      • cpus can utilize vector instructions and -registers to compute things in parallel, though at a much smaller scale than gpus.
    • spmd stands for single program multiple data.
      • mpi is an example of this, since you create multiple processes that execute the same program code.