Parallel Computing

What

multiple units work at the same time to complete a certain job.

How

a variety of possible approaches based on:

type and quantity of work to do;
kinds and capabilities of available hardware.

Why

less processing time;
possibility to process more data.

How much faster?

Theoretical upper bound: $T_L(N) = T_0/N$ where $T_0$ serial runtime, $N$ number of units. Ideal speed-up: $S_L(N) = T_0/T_L(Ν) = Ν$ . (No upper limit to the speed-up).

Amdahl’s argument: only a fraction $p$ of the program can be run in parallel. Upper bound for runtime:

$T_A(N) = p T_0/N + (1 - p) T_0$

Maximum speed-up:

$S_A(N) = T_0/T_A(N) = \frac{1}{p/N + (1 - p)}$

Upper limit to the speed-up: $1/(1-p)$ .

Amdahl’s law/I

Upper limit to the speed-up: $S_A(\infty) = 1/(1-p)$ .

If $p$ is small, there is not much to gain from parallelization. Even for large $p$ , there’s such a thing as too many units:

$p$	2	4	100	200	400	$\infty$
10%	1.05	1.08	1.11	1.11	1.11	1.11
25%	1.14	1.23	1.33	1.33	1.33	1.33
50%	1.33	1.60	1.98	1.99	1.99	2.00
75%	1.60	2.29	3.88	3.94	3.97	4.00
99%	1.98	3.88	50.2	66.9	80.2	100.

(Diminishing returns: doubling the number of units brings progressively less additional speed-up.)

Amdahl’s law/II: what to optimize

Assume total work can be decomposed into multiple parts $p_1, p_2, ... p_r$ and each can be sped-up by $s_1, s_2, ..., s_r$ . Total speed-up:

$S = \frac{1}{p_1/s_1 + p_2/s_2 + ... + p_r/s_r}$

Should you focus where $s_i$ is larger or where $p_i$ is larger?

Example: $p_1 = 0.25, p_2 = 0.75$ , but $s_1 = 20, s_2 = 5$ .

from 1	from 2	from both
1.31	2.5	6.15

Focus on the larger part: better speed-up with less investment.

How much more work?

Amdhal’s law focuses on maximum speed-up assuming constant work (strong scaling). How about more work in the same time (weak scaling)?

Gustafson’s law: if a fraction $p$ of the work can be sped up by a factor of $s$ , then the total amount of work that can be done in the same time will be:

$S_G = W(s)/W_0 = 1 - p + sp$

Example: if 80% of the work can be sped up by a factor of 20, then in the same time you can do $0.2 + 20*0.8 = 16.2$ times more work.

Gustavson’s law/I

Beware of non-linear effects! More work done means more data processed, but how much? Depends on the order (complexity).

Example: $W$ is $O(D^3)$ where $D$ is the data size. Then 16× more work does not mean 16× more data processed, but only $\sqrt[3]{16} = 4$ times more data.

Two kinds

Kinds of parallelism

task-based: each unit carries out a different job;
example: “assembly line” model;
data-based: all units do the same work, on (different) subsets of the data;
example: domain partitioning.

(The distinction is not always clear-cut.)

Examples

In a research team, a person develops remote sensing software, another person develops simulation models, and a third one hazard assessment procedures. Task-based parallelism.

In a surveillance system each video camera monitors a different area. Data-based parallelism.

Task-based

Preferred if

the problem can be divided into tasks;
number of tasks ≈ number of units;
units carrying out different tasks can communicate easily;

Data-based

Preferred if

the work to be done is ‘simple’;
big data size;
it is easy to distribute data among units so as to minimize data exchange.

Task-based: serial

Three tasks (blue, red, green), one unit (one row), two data sets (A, B)

Task-based: two units (dependent case)

Three dependent tasks, two units (two rows), two data sets

Task-based: two units (independent case)

Three independent tasks, two units (two rows), two data sets

Task-based: two units, more work (dependent case)

Three dependent tasks, two units, more data sets

Task-based: three units, more work (dependent case)

Three dependent tasks, three units, more work

Task-based: three units, more work (independent case)

Three independent tasks, three units, more work

Data-based: two units

Data-based: three units

Hardware

Parallel hardware can be classified as

shared-memory: units can read/write the same memory;
example: multiple processors on a single computer;
distributed-memory: each unit has its own memory;
example: networked computers, nodes in a cluster;
hybrid: example: multiprocessor nodes in a cluster

Shared vs distributed memory

Shared memory

Pro

simpler communication;
less latency;

Con

concurrency problems;
limits to number of units, amount of memory.

Distributed memory

Pro

less concurrency problems
(theoretically) infinite scalability;

Con

more complex communication;
higher latency.

Glossary (parenthesis)

concurrency problem: multiple units trying to access the same resource;
latency: time elapsed between data requests and data availability;
bandwidth: data transfer speed, measured in amounts of data transferred per time unit (in modern hardware, typically measured in GB/sec);

Concurrency

if everybody is reading, no problem;
if a unit writes and other units read, consistency problems (reads may return partially updated data);
if multiple unit write concurrency problems (mixed partial updates, and if one write prevails, no knowledge on which one).

Solutions

atomic operations;
semaphores and mutexes;

Bandwidth and latency

(Famous) extreme example

	HyperTransport	Truckload of µSD
bandwidth	50 GB/sec	> 1000 TB/sec
latency	256ns max	hours

1 TB = $10^3$ GB;
1 hour = $3.6 \cdot 10^{12}$ ns.

Solutions

avoid useless transfers (optimizes bandwidth);
merge transfers (amortize latency).

Supply and demand
(closed parenthesis)

Demand

In scientific, industrial and commercial contexts, huge processing power is wanted for:

monitoring: huge amounts of data to analyze in real time;
modelling: study, validation and application of models (economics, geophysics, mechanics, …);
simulations: forecasting (e.g. meteorology), scenarios for a variety of phenomena (natural disaster, infrastructure failure, …)

Supply

cluster: traditional solution; $n$ computer (nodes) connected with high-speed networks;
GPU: recent ‘accidental’ solution: originally developed for videogames, people find they are also excellent for general-purpose parallel computing
accelerators: takes off from the success of GPUs, rely on the same principles, but lack graphics specialization (e.g. Xeon Phi).

Cluster

Solution around which distributed computing was developed.

Pro

classical, well-tested solution;
huge software ecosystem;
widespread know-how;
great flexibility.

Con

huge (hardware) initial investment costs;
huge (hardware) maintenance costs (power consumption, cooling, …);
high cost/performance ratio;
low performance/power consumption ratio;

GPU

Video cards: born for videogames, people realize it’s possible to use them for computations. GPGPU (General-purpose Pprogramming on GPU) is born.

Pro

low initial (hardware) investment costs;
low (hardware) maintenance cost;
low price/performance ratio;
high performance/power consumption ratio;

Con

new solution (officially born in late 2006, early 2007);
less software, less tested;
less know-how;
less flexibility.

Videogames and parallel computing (parenthesis)

History of videogames

History of videogames

History of videogames

History of videogames

History of videogames

History of videogames

Prince of Persia: The Sands of time (2003)

History of videogames

History of videogames

Prince of Persia: The Two Thrones (2005)

History of videogames

History of videogames

History of videogames

History of videogames

History of videogames

Games evolved with:

more sophisticated gameplay
more complex graphics
shift to the third dimension
more realistic game physics

Factors:

better software
better hardware

Better software

Software trick example: ‘fast inverse square root’

float rsqrt(float x)
{
    float h = x/2;
    int i = *(int*)&x;
    i = 0x5f3758d - (i>>2);
    x = *(float*)&i;
    return x*(1.5f - h*h); /* newton */
}

Better hardware

faster CPUs
more disk space, faster disks
more and faster memory
dedicated video hardware

(pre-)History of video cards

CGA (Color Graphics Adapter), 1981:

640x200, monochrome
320x200, 4 colors
text modes, 16 colors
(except for tricks)

EGA (Enhanced Graphics Adapter), 1984:

640x350, 16 colors palette (from 64 total colors, 2 bits per color per pixel)

(pre-)History of video cards

VGA (Video Graphics Array), 1987:

6 bits per color per pixel: 262,144 different colors
256 colors programmable palette
standard modes:
- 640x480, 16 colors palette (from the 256)
- 320x200, 256 colors
first fully programmable video adapter, could be pushed to
- 800 columns
- 600 rows

(pre-)History of video cards

IBM 8514, 1987:

1024x768, 256 colors, interlaced
first ‘professional’ consumer device that accelerated basic 2D drawings:
- line draw
- color fill
- ‘bit blit’ (pixel block combination with AND, OR, XOR).

IBM XGA, 1990:

800x600, 16-bit color (65,536 hues)
1024x768, 256-color palette

History of video cards

The ’90s: Graphical User Interfaces get widespread. Video cards gain the capability to speed-up 2D graphics.

Videogames start exploring the third dimension. Dedicated video cards for the animation of 3D scene see the light.

High cost, but games like it: cleaner graphics, fluid animations. Huge commercial success.

Impact of first 3D video cards

OpenGL and DirectX

Access to 3D hardware is standardized through standard APIs: OpenGL (cross-platform), DirectX (Windows only).

Over time, advanced functions for programmable shading are introduced, allowing on-the-fly sophisticated changes to geometries and other forms of scene post-processing.

Programmable 3D video cards are born, the term GPU (Graphics Processing Unit) is born.

Rendering with shading in OpenGL 2.0

From GPU to GPGPU

GPU: very powerful hardware, but specialized for 3D graphics: (projective) transformation of points in space and on the plane, linear algebra (matrices and vectors).

People start thinking: can we use it for (generic) computations? Yes!

Pre-history of GPGPU (2004): mathematical problem ⇒ transform into graphic problem ⇒ 3D scene ⇒ render scene to run computations ⇒ get resulting image ⇒ ‘decode’ it and get solution

Complex, but efficient: particle systems with 1M particles could run at 20fps.

‘Prehistoric’ GPGPU


Flock of birds avoiding fixed and moving obstacles (2004)	Particle system with 1M particles (2004)

Birth of true GPGPU

Major hardware vendors help developers by making the GPU computing power more easily accessible.

ATI (then AMD): CAL, Computing Abstraction Layer, provides low-level access to the shaders in their GPUs;
NVIDIA: CUDA, Compute Unified Driver Architecture, provides high- and low-level interfaces to the shaders in their GPUs.

GPGPU (closed parenthesis)

GPU (bis)

Shared memory programming model (we’re talking about GPU memory, not system memory!).

GPU: several (tens of) multiprocessors (MP), each equipped with several (~100) PE/MP ⇒ thousands of operations per clock cycle (clocks < 2GHz).
Throughput: > 10TFLOPS.

CPU: around 10 cores, ~20 PE/core ⇒ ~1000 operations per clock cycle (clocks < 3GHz).
Throughput: < 10GFLOPS.

PE: Processing Element. GPU stream processor, CPU SIMD lane

Recent AMD arch (Zen) has pumped up the number of cores/CPU

GPU vs CPU: performance

Model	MPs	PE/ MP	FLOPS (TFLOPS)	bandwidth (GB/s)
AMD Radeon VII	60	64	11.1	1028
NVIDIA TITAN RTX	72	64	12.4	672
Model	cores	PE/ core	FLOPS (GFLOPS)	bandwidth (GB/s)
AMD EPYC 7702P 2GHz	64	16	4096	143.1
Intel Xeon W-3275M 2.5GHz	28	32	3072	131.13

GPU > 20× best multicore CPU

Peak theoretical single-precision FLOPS with FMA (fused mulitply-add).

Intel CPUs reduce the frequency when using AVX. Hyperthreading is not considered, as it does not improve performance when fully using the CPU and can actually have a negative impact.

GPU vs CPU: costs

Model	Price	TDP
AMD Radeon VII	700€ (16GB HBM)	300W
NVIDIA TITAN RTX	2800€ (24GB GDDR6)	280W
AMD EPYC 7702P 2GHz	4000–5000€	200W
Intel Xeon W-3275M 2.5GHz	>7000€	205W

GPU price includes onboard VRAM, as specified.

CPU vs GPU: summary

CPU: sophisticated, generic hardware;
supports larger problems (up to 4TiB RAM for AMD);
great flexibility, lower performance in specific cases;
(best choice for task-based parallelism);
GPU: simpler hardware, specialized for parallel computing;
fixed (smaller) memory size;
great performance in specific cases, less flexibility; (best choice for data-based parallelism);
APU (AMD): hybrid hardware: one or more CPU core, one more GPU core, tight integration (HSA: Hybrid System Architecture).

Parallelization

Parallelization: transformation of a serial algorithm (sequence of steps to be executed one after the other) into a parallel algorithm.

Two extremes:

intrinsically serial problems: the serial nature of the problem is structural to the problem;
embarrassingly parallel problem: problems for which there’s an obvious way to parallelize the work (obvious, but not necessarily the most efficient).

Intrinsically serial problems

Cannot be parallelized because it’s not possible to distribute work across multiple units. Classic example:

1 oven can cook 1 cake in 40 minutes;
8 ovens cannot cook 1 cake in 5 minutes.

There still may be data-based parallelization opportunities:

8 ovens can cook 8 cakes in 40 minutes.

Embarrassingly parallel problems

They present clearly separate tasks whose execution may overlap, or total data independence.

Examples:

many simple image processing algorithms (pixel parallelization);
cellular automata (cell parallelization);
particle systems (SPH, MD, etc) (particle parallelization);
linear algebra (component parallelization).

Apparently serial problems

Class of problems for which the ‘natural’ approach is intrinsically serial, but for which there exists a different approach that can be parallelized.

Classic examples:

reduction: obtain a single datum from a collection of data; examples: maximum, minimum, sum, etc;
scan: progressive summations, data selection, and similar problems;
sorting: rearrange data so that they respect a given order relation.

In hardware

SIMD: single instruction, multiple data: hardware instructions that operate on vectors (MMX, AltiVec, SSE, NEON, AVX, etc);
SPMD: single program, multiple data: a sequence of instructions will be executed by multiple units at the same time; may be supported in hardware or via software;
SIMT: single instruction, multiple threads: the hardware automatically manages multiple threads that all execute the same instruction (hardware-based multi-processing).

In hardware (bis)

GPUs are typically SPMD and SIMT, (modern) CPUs are SIMD, optionally with software-controlled SPMD.

In software

Multi-processing: single program starts multiple thread; shared memory model;
standard (on CPU): OpenMP.
Message-passing: multiple processes communicate by exchanging messages; distributed memory model;
standard: MPI (Message Passing Interface).
Hybrid: multiple processes, each of which is multi-threaded.

GPU programming

Parallel Computing

Parallel Computing

How much faster?

Amdahl’s law/I

Amdahl’s law/II: what to optimize

How much more work?

Gustavson’s law/I

Two kinds

Examples

Task-based

Data-based

Task-based: serial

Task-based: two units (dependent case)

Task-based: two units (independent case)

Task-based: two units, more work (dependent case)

Task-based: three units, more work (dependent case)

Task-based: three units, more work (independent case)

Data-based: serial

Data-based: two units

Data-based: three units

Data-based: etc

Hardware

Shared vs distributed memory

Shared memory

Distributed memory

Glossary (parenthesis)