multiple units work at the same time to complete a certain job.
a variety of possible approaches based on:
Theoretical upper bound: where serial runtime, number of units. Ideal speed-up: . (No upper limit to the speed-up).
Amdahl’s argument: only a fraction of the program can be run in parallel. Upper bound for runtime:
Maximum speed-up:
Upper limit to the speed-up: .
Upper limit to the speed-up: .
If is small, there is not much to gain from parallelization. Even for large , there’s such a thing as too many units:
2 | 4 | 100 | 200 | 400 | ||
---|---|---|---|---|---|---|
10% | 1.05 | 1.08 | 1.11 | 1.11 | 1.11 | 1.11 |
25% | 1.14 | 1.23 | 1.33 | 1.33 | 1.33 | 1.33 |
50% | 1.33 | 1.60 | 1.98 | 1.99 | 1.99 | 2.00 |
75% | 1.60 | 2.29 | 3.88 | 3.94 | 3.97 | 4.00 |
99% | 1.98 | 3.88 | 50.2 | 66.9 | 80.2 | 100. |
(Diminishing returns: doubling the number of units brings progressively less additional speed-up.)
Assume total work can be decomposed into multiple parts and each can be sped-up by . Total speed-up:
Should you focus where is larger or where is larger?
Example: , but .
from 1 | from 2 | from both |
---|---|---|
1.31 | 2.5 | 6.15 |
Focus on the larger part: better speed-up with less investment.
Amdhal’s law focuses on maximum speed-up assuming constant work (strong scaling). How about more work in the same time (weak scaling)?
Gustafson’s law: if a fraction of the work can be sped up by a factor of , then the total amount of work that can be done in the same time will be:
Example: if 80% of the work can be sped up by a factor of 20, then in the same time you can do times more work.
Beware of non-linear effects! More work done means more data processed, but how much? Depends on the order (complexity).
Example: is where is the data size. Then 16× more work does not mean 16× more data processed, but only times more data.
Kinds of parallelism
each unit carries out a different job;
example: “assembly line” model;
all units do the same work, on (different) subsets of the data;
example: domain partitioning.
(The distinction is not always clear-cut.)
In a research team, a person develops remote sensing software, another person develops simulation models, and a third one hazard assessment procedures. Task-based parallelism.
In a surveillance system each video camera monitors a different area. Data-based parallelism.
Parallel hardware can be classified as
units can read/write the same memory;
example: multiple processors on a single computer;
each unit has its own memory;
example: networked computers, nodes in a cluster;
example: multiprocessor nodes in a cluster
multiple units trying to access the same resource;
time elapsed between data requests and data availability;
data transfer speed, measured in amounts of data transferred per time unit (in modern hardware, typically measured in GB/sec);
(Famous) extreme example
HyperTransport | Truckload of µSD | |
---|---|---|
bandwidth | 50 GB/sec | > 1000 TB/sec |
latency | 256ns max | hours |
In scientific, industrial and commercial contexts, huge processing power is wanted for:
huge amounts of data to analyze in real time;
study, validation and application of models (economics, geophysics, mechanics, …);
forecasting (e.g. meteorology), scenarios for a variety of phenomena (natural disaster, infrastructure failure, …)
traditional solution; computer (nodes) connected with high-speed networks;
recent ‘accidental’ solution: originally developed for videogames, people find they are also excellent for general-purpose parallel computing
takes off from the success of GPUs, rely on the same principles, but lack graphics specialization (e.g. Xeon Phi).
Solution around which distributed computing was developed.
Video cards: born for videogames, people realize it’s possible to use them for computations. GPGPU (General-purpose Pprogramming on GPU) is born.
Games evolved with:
Factors:
Software trick example: ‘fast inverse square root’
CGA (Color Graphics Adapter), 1981:
EGA (Enhanced Graphics Adapter), 1984:
VGA (Video Graphics Array), 1987:
IBM 8514, 1987:
IBM XGA, 1990:
The ’90s: Graphical User Interfaces get widespread. Video cards gain the capability to speed-up 2D graphics.
Videogames start exploring the third dimension. Dedicated video cards for the animation of 3D scene see the light.
High cost, but games like it: cleaner graphics, fluid animations. Huge commercial success.
Access to 3D hardware is standardized through standard APIs: OpenGL (cross-platform), DirectX (Windows only).
Over time, advanced functions for programmable shading are introduced, allowing on-the-fly sophisticated changes to geometries and other forms of scene post-processing.
Programmable 3D video cards are born, the term GPU (Graphics Processing Unit) is born.
GPU: very powerful hardware, but specialized for 3D graphics: (projective) transformation of points in space and on the plane, linear algebra (matrices and vectors).
People start thinking: can we use it for (generic) computations? Yes!
Pre-history of GPGPU (2004): mathematical problem ⇒ transform into graphic problem ⇒ 3D scene ⇒ render scene to run computations ⇒ get resulting image ⇒ ‘decode’ it and get solution
Complex, but efficient: particle systems with 1M particles could run at 20fps.
Flock of birds avoiding fixed and moving obstacles (2004) | Particle system with 1M particles (2004) |
Major hardware vendors help developers by making the GPU computing power more easily accessible.
CAL, Computing Abstraction Layer, provides low-level access to the shaders in their GPUs;
CUDA, Compute Unified Driver Architecture, provides high- and low-level interfaces to the shaders in their GPUs.
Shared memory programming model (we’re talking about GPU memory, not system memory!).
GPU: several (tens of) multiprocessors (MP), each equipped with several (~100) PE/MP ⇒ thousands of operations per clock cycle (clocks < 2GHz).
Throughput: > 10TFLOPS.
CPU: around 10 cores, ~20 PE/core ⇒ ~1000 operations per clock cycle (clocks < 3GHz).
Throughput: < 10GFLOPS.
PE: Processing Element. GPU stream processor, CPU SIMD lane
Recent AMD arch (Zen) has pumped up the number of cores/CPU
Model | MPs |
PE/ MP |
FLOPS (TFLOPS) |
bandwidth (GB/s) |
---|---|---|---|---|
AMD Radeon VII | 60 | 64 | 11.1 | 1028 |
NVIDIA TITAN RTX | 72 | 64 | 12.4 | 672 |
Model | cores |
PE/ core |
FLOPS (GFLOPS) |
bandwidth (GB/s) |
AMD EPYC 7702P 2GHz |
64 | 16 | 4096 | 143.1 |
Intel Xeon W-3275M 2.5GHz |
28 | 32 | 3072 | 131.13 |
GPU > 20× best multicore CPU
Peak theoretical single-precision FLOPS with FMA (fused mulitply-add).
Intel CPUs reduce the frequency when using AVX. Hyperthreading is not considered, as it does not improve performance when fully using the CPU and can actually have a negative impact.
Model | Price | TDP |
---|---|---|
AMD Radeon VII | 700€ (16GB HBM) | 300W |
NVIDIA TITAN RTX | 2800€ (24GB GDDR6) | 280W |
AMD EPYC 7702P 2GHz |
4000–5000€ | 200W |
Intel Xeon W-3275M 2.5GHz |
>7000€ | 205W |
GPU price includes onboard VRAM, as specified.
sophisticated, generic hardware;
supports larger problems (up to 4TiB RAM for AMD);
great flexibility, lower performance in specific cases;
(best choice for task-based parallelism);
simpler hardware, specialized for parallel computing;
fixed (smaller) memory size;
great performance in specific cases, less flexibility; (best choice for data-based parallelism);
hybrid hardware: one or more CPU core, one more GPU core, tight integration (HSA: Hybrid System Architecture).
transformation of a serial algorithm (sequence of steps to be executed one after the other) into a parallel algorithm.
Two extremes:
the serial nature of the problem is structural to the problem;
problems for which there’s an obvious way to parallelize the work (obvious, but not necessarily the most efficient).
Cannot be parallelized because it’s not possible to distribute work across multiple units. Classic example:
1 oven can cook 1 cake in 40 minutes;
8 ovens cannot cook 1 cake in 5 minutes.
There still may be data-based parallelization opportunities:
8 ovens can cook 8 cakes in 40 minutes.
They present clearly separate tasks whose execution may overlap, or total data independence.
Examples:
Class of problems for which the ‘natural’ approach is intrinsically serial, but for which there exists a different approach that can be parallelized.
Classic examples:
obtain a single datum from a collection of data; examples: maximum, minimum, sum, etc;
progressive summations, data selection, and similar problems;
rearrange data so that they respect a given order relation.
single instruction, multiple data: hardware instructions that operate on vectors (MMX, AltiVec, SSE, NEON, AVX, etc);
single program, multiple data: a sequence of instructions will be executed by multiple units at the same time; may be supported in hardware or via software;
single instruction, multiple threads: the hardware automatically manages multiple threads that all execute the same instruction (hardware-based multi-processing).
GPUs are typically SPMD and SIMT, (modern) CPUs are SIMD, optionally with software-controlled SPMD.
single program starts multiple thread; shared memory model;
standard (on CPU): OpenMP.
multiple processes communicate by exchanging messages; distributed memory model;
standard: MPI (Message Passing Interface).
multiple processes, each of which is multi-threaded.