ex6: report
This commit is contained in:
@@ -0,0 +1,115 @@
|
||||
---
|
||||
author: fredrik robertsen
|
||||
date: 2025-10-27
|
||||
title: "exercise 6"
|
||||
---
|
||||
|
||||
## how i solved it (not a task)
|
||||
|
||||
this is just my reflection on my solution, not necessarily an answer to a task.
|
||||
|
||||
each subtask was quite easy and only a few lines of code. the key idea is to
|
||||
copy the inner-most part of the nested for-loop from `host_calculate` and use
|
||||
that to create a `device_calculate`. then we essentially want to spawn threads
|
||||
on the gpu that each calculate one singular pixel. to do this, we `cudaMalloc`
|
||||
some memory to work with on the device, then launch some cores on a grid of
|
||||
thread blocks. a grid is calculated as a 2d array, dividing the output image
|
||||
into chunks that each thread block calculates. each block has typically 8x8,
|
||||
16x16 or 32x32 threads. i found 16x16 to be most performant for this, running on
|
||||
snotra.
|
||||
|
||||
the issue when dividing the workload like this is that we may have some leftover
|
||||
pixels that aren't being calculated due to the loss of precision from truncating
|
||||
the division between the image size and the block size. to counteract this, we
|
||||
want to divide, then round up. this can be done with a math trick
|
||||
|
||||
```{=typst}
|
||||
$ ceil(a / b) = (a + b - 1) / b. $
|
||||
```
|
||||
|
||||
with this trick, the cuda code correctly calculates most pixels on an image
|
||||
where the output size is not divisible by the block sizes.
|
||||
|
||||
i say most pixels, because a particular phenomenon became apparent: gpu cores
|
||||
have different precision when handling floating-point numbers. thus some pixels
|
||||
will be counted as different from the host code.
|
||||
|
||||
side note: i crafted a cool makefile target that automates running the code on
|
||||
snotra. it looks like this:
|
||||
|
||||
```Makefile
|
||||
remote_stop:
|
||||
@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) "stop-jobs -f" || true
|
||||
remote_copy_src:
|
||||
@scp -J $(USERNAME)@$(BASTION) mandel.cu $(USERNAME)@$(REMOTE):~/ntnuhome/jobs
|
||||
remote_compile: remote_stop remote_copy_src
|
||||
@ssh -J $(USERNAME)@$(BASTION) $(USERNAME)@$(REMOTE) 'cd ntnuhome/jobs/ && ./run_mandel.sh'
|
||||
```
|
||||
|
||||
with this i can simply run `make remote_compile` to stop any currently running
|
||||
jobs on snotra (which has never really been an issue), then it will copy my
|
||||
source file over to snotra and compile it and run it there according to a bash
|
||||
script on the server:
|
||||
|
||||
```sh
|
||||
#!/bin/bash
|
||||
srun -N1 -n1 --gres=gpu:1 --partition=TDT4200 --time=0:10:00 bash -c "nvcc -o mandel mandel.cu -O3 -lm && ./mandel 1"
|
||||
```
|
||||
|
||||
i'm not familiar with slurm jobs, but this seemed to be a similar command to
|
||||
what was being run when you `connect tdt4200` from the login-node. it worked
|
||||
fine for me to compile and run the source code on snotra, which will likely be
|
||||
valuable experience for upcoming assignments.
|
||||
|
||||
## questions
|
||||
|
||||
1. i did get a massive speed-up between the host and device code, according to
|
||||
the provided benchmarking prints. this is running on snotra. running with
|
||||
blocksizes 8x8 i get
|
||||
```
|
||||
Host time: 1424.517 ms
|
||||
Device calculation: 6.196 ms
|
||||
Copy result: 58.629 ms
|
||||
```
|
||||
which is, if we factor in the memory i/o, `1400 / (6 + 58) ~= 22` times
|
||||
faster. i believe the host time is also doing i/o, so this is fair. this is
|
||||
huge, and makes sense that a gpu is good at processing images, considering
|
||||
that is what they were made for.
|
||||
2. which gpu did i use? snotra
|
||||
|
||||
````{=typst}
|
||||
#text(size: 9pt)[```
|
||||
selbu:~$ nvidia-smi
|
||||
Mon Oct 27 20:13:23 2025
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|
||||
|-----------------------------------------+------------------------+----------------------+
|
||||
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|=========================================+========================+======================|
|
||||
| 0 Tesla T4 On | 00000000:88:00.0 Off | 0 |
|
||||
| N/A 36C P8 15W / 70W | 0MiB / 15360MiB | 0% Default |
|
||||
| | | N/A |
|
||||
+-----------------------------------------+------------------------+----------------------+
|
||||
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
| Processes: |
|
||||
| GPU GI CI PID Type Process name GPU Memory |
|
||||
| ID ID Usage |
|
||||
|=========================================================================================|
|
||||
| No running processes found |
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
```]
|
||||
````
|
||||
|
||||
3. simd vs spmd
|
||||
- simd stands for single instruction multiple data.
|
||||
- our cuda program lets us program the gpu and instruct it. this is an
|
||||
example of simd, since we are able to perform a single operation (kernel)
|
||||
across multiple data (pixels) at hardware level.
|
||||
- cpus can utilize vector instructions and -registers to compute things in
|
||||
parallel, though at a much smaller scale than gpus.
|
||||
- spmd stands for single program multiple data.
|
||||
- mpi is an example of this, since you create multiple processes that
|
||||
execute the same program code.
|
||||
Reference in New Issue
Block a user