GPU programming

Parallel Computing

Parallel Computing

What

multiple units work at the same time to complete a certain job.

How

a variety of possible approaches based on:

  • type and quantity of work to do;
  • kinds and capabilities of available hardware.
Why
  • less processing time;
  • possibility to process more data.

How much faster?

Theoretical upper bound: TL(N)=T0/NT_L(N) = T_0/N where T0T_0 serial runtime, NN number of units. Ideal speed-up: SL(N)=T0/TL(Ν)=ΝS_L(N) = T_0/T_L(Ν) = Ν. (No upper limit to the speed-up).

Amdahl’s argument: only a fraction pp of the program can be run in parallel. Upper bound for runtime:

TA(N)=pT0/N+(1p)T0T_A(N) = p T_0/N + (1 - p) T_0

Maximum speed-up:

SA(N)=T0/TA(N)=1p/N+(1p)S_A(N) = T_0/T_A(N) = \frac{1}{p/N + (1 - p)}

Upper limit to the speed-up: 1/(1p)1/(1-p).

Amdahl’s law/I

Upper limit to the speed-up: SA()=1/(1p)S_A(\infty) = 1/(1-p).

If pp is small, there is not much to gain from parallelization. Even for large pp, there’s such a thing as too many units:

pp 2 4 100 200 400 \infty
10% 1.05 1.08 1.11 1.11 1.11 1.11
25% 1.14 1.23 1.33 1.33 1.33 1.33
50% 1.33 1.60 1.98 1.99 1.99 2.00
75% 1.60 2.29 3.88 3.94 3.97 4.00
99% 1.98 3.88 50.2 66.9 80.2 100.

(Diminishing returns: doubling the number of units brings progressively less additional speed-up.)

Amdahl’s law/II: what to optimize

Assume total work can be decomposed into multiple parts p1,p2,...prp_1, p_2, ... p_r and each can be sped-up by s1,s2,...,srs_1, s_2, ..., s_r. Total speed-up:

S=1p1/s1+p2/s2+...+pr/srS = \frac{1}{p_1/s_1 + p_2/s_2 + ... + p_r/s_r}

Should you focus where sis_i is larger or where pip_i is larger?

Example: p1=0.25,p2=0.75p_1 = 0.25, p_2 = 0.75, but s1=20,s2=5s_1 = 20, s_2 = 5.

from 1 from 2 from both
1.31 2.5 6.15

Focus on the larger part: better speed-up with less investment.

How much more work?

Amdhal’s law focuses on maximum speed-up assuming constant work (strong scaling). How about more work in the same time (weak scaling)?

Gustafson’s law: if a fraction pp of the work can be sped up by a factor of ss, then the total amount of work that can be done in the same time will be:

SG=W(s)/W0=1p+sp S_G = W(s)/W_0 = 1 - p + sp

Example: if 80% of the work can be sped up by a factor of 20, then in the same time you can do 0.2+20*0.8=16.20.2 + 20*0.8 = 16.2 times more work.

Gustavson’s law/I

Beware of non-linear effects! More work done means more data processed, but how much? Depends on the order (complexity).

Example: WW is O(D3)O(D^3) where DD is the data size. Then 16× more work does not mean 16× more data processed, but only 163=4\sqrt[3]{16} = 4 times more data.

Two kinds

Kinds of parallelism

task-based

each unit carries out a different job;
example: “assembly line” model;

data-based

all units do the same work, on (different) subsets of the data;
example: domain partitioning.

(The distinction is not always clear-cut.)

Examples

In a research team, a person develops remote sensing software, another person develops simulation models, and a third one hazard assessment procedures. Task-based parallelism.

In a surveillance system each video camera monitors a different area. Data-based parallelism.

Task-based

Preferred if
  • the problem can be divided into tasks;
  • number of tasks ≈ number of units;
  • units carrying out different tasks can communicate easily;

Data-based

Preferred if
  • the work to be done is ‘simple’;
  • big data size;
  • it is easy to distribute data among units so as to minimize data exchange.

Task-based: serial

Three tasks (blue, red, green), one unit (one row), two data sets (A, B)

Task-based: two units (dependent case)

Three dependent tasks, two units (two rows), two data sets

Task-based: two units (independent case)

Three independent tasks, two units (two rows), two data sets

Task-based: two units, more work (dependent case)

Three dependent tasks, two units, more data sets

Task-based: three units, more work (dependent case)

Three dependent tasks, three units, more work

Task-based: three units, more work (independent case)

Three independent tasks, three units, more work

Data-based: serial

Data-based: two units

Data-based: three units

Data-based: etc

Hardware

Parallel hardware can be classified as

shared-memory

units can read/write the same memory;
example: multiple processors on a single computer;

distributed-memory

each unit has its own memory;
example: networked computers, nodes in a cluster;

hybrid

example: multiprocessor nodes in a cluster

Shared vs distributed memory

Shared memory

Pro
  • simpler communication;
  • less latency;
Con
  • concurrency problems;
  • limits to number of units, amount of memory.

Distributed memory

Pro
  • less concurrency problems
  • (theoretically) infinite scalability;
Con
  • more complex communication;
  • higher latency.

Glossary (parenthesis)

Glossary (parenthesis)

concurrency problem

multiple units trying to access the same resource;

latency

time elapsed between data requests and data availability;

bandwidth

data transfer speed, measured in amounts of data transferred per time unit (in modern hardware, typically measured in GB/sec);

Concurrency

  1. if everybody is reading, no problem;
  2. if a unit writes and other units read, consistency problems (reads may return partially updated data);
  3. if multiple unit write concurrency problems (mixed partial updates, and if one write prevails, no knowledge on which one).
Solutions
  • atomic operations;
  • semaphores and mutexes;

Bandwidth and latency

(Famous) extreme example

HyperTransport Truckload of µSD
bandwidth 50 GB/sec > 1000 TB/sec
latency 256ns max hours
Solutions
  • avoid useless transfers (optimizes bandwidth);
  • merge transfers (amortize latency).

Supply and demand
(closed parenthesis)

Demand

In scientific, industrial and commercial contexts, huge processing power is wanted for:

monitoring

huge amounts of data to analyze in real time;

modelling

study, validation and application of models (economics, geophysics, mechanics, …);

simulations

forecasting (e.g. meteorology), scenarios for a variety of phenomena (natural disaster, infrastructure failure, …)

Supply

cluster

traditional solution; nn computer (nodes) connected with high-speed networks;

GPU

recent ‘accidental’ solution: originally developed for videogames, people find they are also excellent for general-purpose parallel computing

accelerators

takes off from the success of GPUs, rely on the same principles, but lack graphics specialization (e.g. Xeon Phi).

Cluster

Solution around which distributed computing was developed.

Pro
  • classical, well-tested solution;
  • huge software ecosystem;
  • widespread know-how;
  • great flexibility.
Con
  • huge (hardware) initial investment costs;
  • huge (hardware) maintenance costs (power consumption, cooling, …);
  • high cost/performance ratio;
  • low performance/power consumption ratio;

GPU

Video cards: born for videogames, people realize it’s possible to use them for computations. GPGPU (General-purpose Pprogramming on GPU) is born.

Pro
  • low initial (hardware) investment costs;
  • low (hardware) maintenance cost;
  • low price/performance ratio;
  • high performance/power consumption ratio;
Con
  • new solution (officially born in late 2006, early 2007);
  • less software, less tested;
  • less know-how;
  • less flexibility.

Videogames and parallel computing (parenthesis)

History of videogames

Pong (1972)

History of videogames

Big Top (1983)

History of videogames

Prince of Persia (1989)

History of videogames

Prince of Persia II (1994)

History of videogames

Prince of Persia 3D (1999)

History of videogames

Prince of Persia: The Sands of time (2003)

History of videogames

Prince of Persia: Warrior Within (2004)

History of videogames

Prince of Persia: The Two Thrones (2005)

History of videogames

Pac-Man (1980)

History of videogames

Castle Wolfenstein (1981)

History of videogames

Wolfenstein 3D (1992)

History of videogames

Return to Castle Wolfenstein (2001)

History of videogames

Games evolved with:

Factors:

Better software

Software trick example: ‘fast inverse square root’

float rsqrt(float x)
{
    float h = x/2;
    int i = *(int*)&x;
    i = 0x5f3758d - (i>>2);
    x = *(float*)&i;
    return x*(1.5f - h*h); /* newton */
}

Better hardware

(pre-)History of video cards

CGA (Color Graphics Adapter), 1981:

EGA (Enhanced Graphics Adapter), 1984:

(pre-)History of video cards

VGA (Video Graphics Array), 1987:

(pre-)History of video cards

IBM 8514, 1987:

IBM XGA, 1990:

History of video cards

The ’90s: Graphical User Interfaces get widespread. Video cards gain the capability to speed-up 2D graphics.

Videogames start exploring the third dimension. Dedicated video cards for the animation of 3D scene see the light.

High cost, but games like it: cleaner graphics, fluid animations. Huge commercial success.

Impact of first 3D video cards

Tomb Raider, VGA, 15fps
Tomb Raider, 3Dfx, 30fps

OpenGL and DirectX

Access to 3D hardware is standardized through standard APIs: OpenGL (cross-platform), DirectX (Windows only).

Over time, advanced functions for programmable shading are introduced, allowing on-the-fly sophisticated changes to geometries and other forms of scene post-processing.

Programmable 3D video cards are born, the term GPU (Graphics Processing Unit) is born.

Rendering with shading in OpenGL 2.0

Shading in OpenGL 2.0

From GPU to GPGPU

GPU: very powerful hardware, but specialized for 3D graphics: (projective) transformation of points in space and on the plane, linear algebra (matrices and vectors).

People start thinking: can we use it for (generic) computations? Yes!

Pre-history of GPGPU (2004): mathematical problem ⇒ transform into graphic problem ⇒ 3D scene ⇒ render scene to run computations ⇒ get resulting image ⇒ ‘decode’ it and get solution

Complex, but efficient: particle systems with 1M particles could run at 20fps.

‘Prehistoric’ GPGPU

Flock of birds avoiding fixed and moving obstacles (2004) Particle system with 1M particles (2004)

Birth of true GPGPU

Major hardware vendors help developers by making the GPU computing power more easily accessible.

ATI (then AMD)

CAL, Computing Abstraction Layer, provides low-level access to the shaders in their GPUs;

NVIDIA

CUDA, Compute Unified Driver Architecture, provides high- and low-level interfaces to the shaders in their GPUs.

GPGPU (closed parenthesis)

GPU (bis)

Shared memory programming model (we’re talking about GPU memory, not system memory!).

GPU: several (tens of) multiprocessors (MP), each equipped with several (~100) PE/MP ⇒ thousands of operations per clock cycle (clocks < 2GHz).
Throughput: > 10TFLOPS.

CPU: around 10 cores, ~20 PE/core ⇒ ~1000 operations per clock cycle (clocks < 3GHz).
Throughput: < 10GFLOPS.

PE: Processing Element. GPU stream processor, CPU SIMD lane

Recent AMD arch (Zen) has pumped up the number of cores/CPU

GPU vs CPU: performance

Model MPs PE/
MP
FLOPS
(TFLOPS)
bandwidth
(GB/s)
AMD Radeon VII 60 64 11.1 1028
NVIDIA TITAN RTX 72 64 12.4 672
Model cores PE/
core
FLOPS
(GFLOPS)
bandwidth
(GB/s)
AMD EPYC
7702P 2GHz
64 16 4096 143.1
Intel Xeon
W-3275M 2.5GHz
28 32 3072 131.13

GPU > 20× best multicore CPU

Peak theoretical single-precision FLOPS with FMA (fused mulitply-add).

Intel CPUs reduce the frequency when using AVX. Hyperthreading is not considered, as it does not improve performance when fully using the CPU and can actually have a negative impact.

GPU vs CPU: costs

Model Price TDP
AMD Radeon VII 700€ (16GB HBM) 300W
NVIDIA TITAN RTX 2800€ (24GB GDDR6) 280W
AMD EPYC
7702P 2GHz
4000–5000€ 200W
Intel Xeon
W-3275M 2.5GHz
>7000€ 205W

GPU price includes onboard VRAM, as specified.

CPU vs GPU: summary

CPU

sophisticated, generic hardware;
supports larger problems (up to 4TiB RAM for AMD);
great flexibility, lower performance in specific cases;
(best choice for task-based parallelism);

GPU

simpler hardware, specialized for parallel computing;
fixed (smaller) memory size;
great performance in specific cases, less flexibility; (best choice for data-based parallelism);

APU (AMD)

hybrid hardware: one or more CPU core, one more GPU core, tight integration (HSA: Hybrid System Architecture).

Parallelization

Parallelization

Parallelization

transformation of a serial algorithm (sequence of steps to be executed one after the other) into a parallel algorithm.

Two extremes:

intrinsically serial problems

the serial nature of the problem is structural to the problem;

embarrassingly parallel problem

problems for which there’s an obvious way to parallelize the work (obvious, but not necessarily the most efficient).

Intrinsically serial problems

Cannot be parallelized because it’s not possible to distribute work across multiple units. Classic example:

1 oven can cook 1 cake in 40 minutes;
8 ovens cannot cook 1 cake in 5 minutes.

There still may be data-based parallelization opportunities:

8 ovens can cook 8 cakes in 40 minutes.

Embarrassingly parallel problems

They present clearly separate tasks whose execution may overlap, or total data independence.

Examples:

Apparently serial problems

Class of problems for which the ‘natural’ approach is intrinsically serial, but for which there exists a different approach that can be parallelized.

Classic examples:

reduction

obtain a single datum from a collection of data; examples: maximum, minimum, sum, etc;

scan

progressive summations, data selection, and similar problems;

sorting

rearrange data so that they respect a given order relation.

In hardware

SIMD

single instruction, multiple data: hardware instructions that operate on vectors (MMX, AltiVec, SSE, NEON, AVX, etc);

SPMD

single program, multiple data: a sequence of instructions will be executed by multiple units at the same time; may be supported in hardware or via software;

SIMT

single instruction, multiple threads: the hardware automatically manages multiple threads that all execute the same instruction (hardware-based multi-processing).

In hardware (bis)

GPUs are typically SPMD and SIMT, (modern) CPUs are SIMD, optionally with software-controlled SPMD.

In software

Multi-processing

single program starts multiple thread; shared memory model;
standard (on CPU): OpenMP.

Message-passing

multiple processes communicate by exchanging messages; distributed memory model;
standard: MPI (Message Passing Interface).

Hybrid

multiple processes, each of which is multi-threaded.

(end)