Parallel computing is the execution of multiple computations simultaneously to solve a problem faster or at larger scale than is practical on a single processing element. It contrasts with purely serial execution and is foundational to High-performance computing, large‑scale data analytics, and modern AI workloads, where speed and capacity are gained by coordinating many processing elements across shared‑memory nodes, distributed clusters, or accelerator-rich systems. Authoritative overviews distinguish the approach from concurrent but non‑parallel execution and describe principal categories of hardware and software support for parallelism. Parallel processing | Britannica;
Computer science: Parallel and distributed computing | Britannica.
Historical development
The modern taxonomy of parallel architectures traces to Michael J. Flynn’s 1966 classification (SISD, SIMD, MISD, MIMD), later extended in 1972, which has remained a standard way to describe instruction and data stream parallelism. Flynn's taxonomy. Early vector and SIMD machines (e.g., Cray and Thinking Machines systems) explored data-parallel execution, while multiprocessing MIMD designs became dominant for general-purpose supercomputers.
Supercomputer | Britannica.
Two classic scaling models formalized limits and opportunities: Amdahl's law, which bounds speedup of a fixed-size workload by its serial fraction (1967), and Gustafson's law, which argues that practical workloads often scale in size with available resources, enabling linear or near-linear speedups (1988). See Gene Amdahl’s original 1967 paper and John Gustafson’s 1988 article. CERN Document Server – Amdahl (1967);
Communications of the ACM – “Reevaluating Amdahl’s Law”.
Architectural classification
- –SIMD and SIMT: Single-instruction, multiple-data designs apply one instruction to many data elements; modern graphics processors implement SIMT, a closely related execution style.
Flynn's taxonomy;
CUDA C++ Programming Guide.
- –MIMD: Multiple instruction streams on multiple data streams cover multicore CPUs, shared-memory multiprocessors, and distributed clusters.
Parallel processing | Britannica.
- –Models for algorithm design include PRAM (idealized shared memory) and BSP (bulk-synchronous parallel), which abstract communication and synchronization costs.
Parallel RAM;
A bridging model for parallel computation | Communications of the ACM;
CACM overview of BSP.
Memory and communication models
Parallel machines are commonly organized as:
- –Shared memory: multiple processors/cores access a single address space, requiring synchronization to avoid races and typically using cache coherence; programming is comparatively straightforward but scalability is limited by contention and coherence traffic.
HPC @ LLNL tutorial;
Shared memory (overview).
- –Distributed memory: each node has private memory; data is exchanged by explicit message passing over an interconnect. This scales to very large systems but requires explicit communication design.
HPC @ LLNL tutorial;
Shared vs Distributed Memory (Supercomputing Wales). Hybrid systems (distributed clusters of shared‑memory nodes) are standard in contemporary supercomputers.
Computer science: Parallel and distributed computing | Britannica.
Programming models and languages
- –Message Passing Interface (MPI) is the de facto standard for message passing on distributed-memory systems; the MPI Forum released MPI 5.0 on June 5, 2025, extending the standard’s capabilities for large-scale, heterogeneous systems.
MPI Forum.
- –OpenMP provides compiler directives and runtime APIs for shared-memory parallelism, including threads, tasks, SIMD, and device offload; OpenMP 6.0 reference guides were published in November 2024.
OpenMP Reference Guides;
OpenMP 4.0 announcement (accelerators/SIMD).
- –Accelerator programming includes CUDA for NVIDIA GPUs and portable heterogeneous frameworks such as SYCL (a Khronos standard) that target CPUs, GPUs, and FPGAs via a single‑source C++ model.
CUDA C++ Programming Guide;
Khronos SYCL 2020 specification;
SYCL overview (Khronos);
Intel SYCL 101.
- –Data‑parallel frameworks for large clusters include MapReduce, which automates partitioning, scheduling, failure handling, and inter‑machine communication for many key–value computations.
Google Research: MapReduce (OSDI 2004);
USENIX OSDI 2004 paper entry.
Performance concepts and limits
Speedup S(p) and efficiency E(p) are common metrics for p processors, with S(p)=T(1)/T(p) and E(p)=S(p)/p. Amdahl’s law shows that the serial fraction of work limits speedup for fixed problems, while Gustafson’s law highlights scaling of problem size at fixed time. Practical performance is further bounded by communication, synchronization, load imbalance, and memory hierarchy effects; models such as BSP incorporate cost parameters for computation and communication to guide algorithm design. CERN Document Server – Amdahl (1967);
Communications of the ACM – Gustafson (1988);
A bridging model for parallel computation | CACM. For accessible introductions to speedup/efficiency and practical considerations, see the LLNL tutorial.
HPC @ LLNL tutorial.
Systems, scales, and benchmarks
The TOP500 project ranks supercomputers by HPL (dense linear algebra) performance and also tracks HPCG and mixed‑precision HPL‑MxP results. As of the June 2025 list (65th edition), three U.S. Department of Energy systems exceeded one exaflop on HPL: El Capitan (1.742 EFlop/s, No. 1), Frontier (1.353 EFlop/s, No. 2), and Aurora (1.012 EFlop/s, No. 3). The list details architectures combining many‑core CPUs, advanced GPUs, high‑speed interconnects, and large memory hierarchies. TOP500 – June 2025 list;
TOP500 – June 2025 highlights.
Applications
Parallel computing enables large‑scale numerical simulation (climate, CFD, materials), data analysis and search, cryptography, graphics and visualization, and training/inference for deep neural networks, where GPUs and other accelerators expose massive data parallelism. Introductory and industry resources summarize the role of parallelism across scientific and commercial domains. Computer science: Parallel and distributed computing | Britannica;
IBM – What is parallel computing?.
Selected theoretical and instructional models
Idealized models help design and analyze parallel algorithms. PRAM abstracts a shared‑memory machine with concurrent processors (with EREW/CREW/CRCW variants for read/write conflicts), while BSP structures computation into supersteps separated by global synchronization, exposing communication and latency parameters that influence cost. These inform practical designs implemented with MPI, OpenMP, and accelerator frameworks. Parallel RAM;
A bridging model for parallel computation | CACM;
HPC @ LLNL tutorial.
Further reading
- –Hennessy and Patterson’s textbook surveys architectural support for ILP and multicore/Manycore designs relevant to parallel systems. (book://John L. Hennessy & David A. Patterson|Computer Architecture: A Quantitative Approach|Morgan Kaufmann|2017)
- –Grama et al. provide algorithms, models, and performance analysis methods widely used in teaching and practice. (book://Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar|Introduction to Parallel Computing (2nd ed.)|Addison‑Wesley|2003)
- –Valiant’s BSP papers formalize bridging models for scalable computation.
A bridging model for parallel computation | CACM;
A bridging model for parallel computation, communication, and I/O | ACM Computing Surveys.
