GPU Compilers
Challenges in Learning About GPUs
- Architectural Variations
- All the vendors keep changing the architecture. Different vendors build different hardware. Graphics programs run on all vendors’ hardware because vendors ship their own drivers and do finalization on the device.
- Inconsistent Terminology
- Vendors use different terms for the same features and reuse CPU terms somewhat differently. For instance, a “Cuda core” is not the same as a ‘CPU core’.
- Legacy Terminology
- Some terms come from the GPU’s history of graphics. For example, a shader is a program.
- Software Abstractions
- Big frameworks hide the details.
- CUDA’s Market Dominance
- Most explanations use CUDA terms since CUDA is the market leader. For instance, AMD uses ‘wave’ to mean more or less the same as a CUDA ‘warp’, but lots of AMD documentation uses ‘warps’.
- CUDA: Language and Model
- CUDA is both a programming language and a programming model, so you can have CUDA Fortran, CUDA Python, etc.
hardware styles NVIDIA model
- compute and graphics parts have same programmable parts, compute leaves out non-programmable features
- graphics customers pay a tax for the compute instructions
- more transistors for compute instructions
- one big die
hardware styles AMD model
- compute has extra instructions (no tax on graphics customers to support compute)
- chiplet model (great engineering!)
- R series is graphics, C series is compute
programming model
both vendors use the CUDA programming model. AMD supports a variation of the CUDA language
machines have multiple SIMD processors, each SIMD can be running a different instruction but each lane of a SIMD runs the same instruction
a lane of a SIMD is called a thread
the programming model is SIMT (single instruction multiple threads)
User writes a scalar program, compiler maps that program to a lane of a SIMD, many instances of the program run at once, hardware combines copies of the scalar program into warps, hardware schedules warps into the SIMD engines
programs are called kernels
threading model
warps/ waves/wavefronts
- threads are grouped together into warps.
- Size is fixed by hardware (usually 32), programmers know this and often make data set sizes a multiple of 32.
- since all threads in a warp are running the same instruction, there is no need for explicit synchronization
- there are a few instructions that work across a warp, - which break the model and give the compiler problems
thread blocks groups of warps,
- threads within a block can access a fast scratchpad memory (called shared or LDS) - Violates the C memory model.
- threads are identified by a 3d index inside a block
- thread blocks need synchronization operations.
- hardware schedules each block into execution units. Max block size is limited by the size of a execution unit.
grid groups of blocks,
- 3d collection of blocks,
- usually hardware limits mean that all the threads in a grid do not run at the same time
- programmers need to make grids big enough to fill the hardware
- software launches a grid and a program
- threads within a grid but in different blocks do not have sync operations
Heterogeneous programming
the cpu is called the host
the gpu is called the device
cpu launches grids and kernels to gpu
Computations launched on the device execute asynchronously with respect to the host, and it is the user’s responsibility to synchronize
memory spaces
- cpu and gpu have different memory spaces (cpu can copy from one to the other). Some amount of combined memory (slower then not combined)
- gpu has shared/lds memory which can be accessed by threads within a block, passing an address to a different block does not work
- main gpu memory is called global accessible by all threads
- gpu has per thread memory called local or scratch or private memory - unlike C, passing an address in local memory to another thread does not work. (under some restrictions AMD implements part of the C memory model)
- a few special gpu memory types: constant, texture, surface (left over from graphics)
- gpu can treat registers as fast memory
Hardware
NVIDIA
Each box here is hardware, an int32/fp32/fp64 can perform one operation, so here we have 32 fp32 units which can do 32 float operations in parallel as well as 16 int32 and 16 fp64 units , there are also 8 units can do loads or stores and a final special function unit that can do transcendental operations like sin/cos
under some conditions two instructions (a float and an int) from the same warp can execute at the same time
I’m not going to talk much about the tensor cores
all these units execute the same instruction (SIMT) Simple instruction multiple thread (not the same as SIMD but related )
one instruction does vector work
add 32 float32 values (all coming from registers) and store the result in 32 other registers
Notice no branch prediction, no out of order execution
great at switching a warp holds the instruction, it knows which registers it owns (continuous set, so it just needs a start and length) switching to a different warp, means changing these two numbers and the pc (this is done by the dispatch unt )
When we do a load, we need to wait for the result. CPU might do some kind of out of order execution, a gpu switches to another warp
finally we need to pick the warp to switch to, this is done by the warp scheduler (half of the hardware scheduler)
streaming multiprocessors (SM)
nvidia packs 4 execution engines into a SM (streaming multi-processor) ands an L1 instruction cache, a special memory accelerator for tensors and 256kb l1 data cache/ shared memory block
pack sm’s together
AMD
and
AMD has a similar block with some important differences
- At the bottom of a compute core there are 4 general purpose simd engines each of length 16 and one scalar engine
Given 5 waves, this compute core can execute 4 vector instructions and one scalar instruction per clock Two instructions from the same wave never execute at the same time
The SIMD engines can execute different instructions
The simd sizes vary over different chips
amd CDNA
packing sms
simd and control flow
to execute an if then else
- compute the if condition
- turn off the lanes where the condition is false
- execute the if side
- flip the lanes
- execute the else side
time is the sum of the times for then and the else
number of warps in flight
since an entire block has to fit on one compute unit/sm, the resources used in the block limit the number of warps on a sm,
if a warp needs 100 registers and there are 256 vector registers on the compute unit, then two warps can run at once, compiler controls number of registers
latency
gpu is a throughput machine- how many threads finish in a unit of time not how long a single thread takes to finish
unlike the cpu, gpu can have lots of loads in flight, time for these loads overlap so compiler tries to group loads together, but this needs extra registers
- a warp issues a group of loads
- warp issues a wait for loads to finish (hardware in cpu, software in gpu)
- hardware switches to another warp (if there is on), good to have a lot of warps 1, if all warps waiting for memory, alu units are idle
differences
- NVIDA committed to AI, and cut down on the non-tensor operations, AND did not
- Generally you need more waves on an AMD processor
- AMD added infinity connectors so that they could use chiplets
- AMD has a scalar processor - The programming model does not - So its up to the compiler to use it
- NVIDIA has 1 architecture for the compiler, AMD has two
compiler model
- like a regular compiler for a scalar machine
- new problem: registers used in warp limits number of warps in flight, so ra is different
- new problem: control flow is more critical
- new problem: latency means grouping loads but not to much
- new problem: arch keeps changing
thread coarsening
Suppose we are computing a matrix multiply
we could say each thread writes one result so a 32 x 32 matrix would need 32 * 32 threads each thread reads one column and one row of the input,
we have a lot of reuse (redundant loads of data )
we could say each thread writes 4 results, so we need 1/4 of the threads each thread reads a raw and 4 columns
technique
- merge multiple threads so each resulting thread writes multiple outputs
- do the redundant work once and save in registers
- use the registers for computing all the outputs
But
- needs extra registers
- increased efficiency but reduced parallelism
After thread coarsening, computation from merged threads can see each others results
doing this by hand
for (atomid=0; atomid<numatoms; atomid++) {
float dy = coory - atominfo[atomid].y;
float dysqpdzsq = (dy * dy) + atominfo[atomid].z;
float dx1 = coorx1 - atominfo[atomid].x;
float dx2 = coorx2 - atominfo[atomid].x;
float dx3 = coorx3 - atominfo[atomid].x;
float dx4 = coorx4 - atominfo[atomid].x;
energyvalx1 += atominfo[atomid].w * (1.0f / sqrtf(dx1*dx1 + dysqpdzsq));
energyvalx2 += atominfo[atomid].w * (1.0f / sqrtf(dx2*dx2 + dysqpdzsq));
energyvalx3 += atominfo[atomid].w * (1.0f / sqrtf(dx3*dx3 + dysqpdzsq));
energyvalx4 += atominfo[atomid].w * (1.0f / sqrtf(dx4*dx4 + dysqpdzsq)); } …