Medium Pulse: News And Articles To Read

News And Articles To Read

Devices & Accelerators for ML/DL and New Compute

Devices & Accelerators for ML/DL and New Compute

Machine learning (ML) and deep learning (DL) workloads have driven a seismic shift in hardware design: from general-purpose CPUs to a rich ecosystem of specialized devices and accelerators optimized for throughput, memory bandwidth, and energy per operation. This article surveys device technologies (CMOS and beyond), accelerator microarchitectures and dataflows, memory and interconnect innovations, packaging and system integration, software/compilers, evaluation metrics, deployment tradeoffs, and the near-to-long term roadmap for new compute (analog/neuromorphic/photonic/quantum). It’s written for designers, architects, product managers, and researchers who must select or build hardware for ML workloads.

1. Workload characteristics that drive hardware design

Modern ML/DL workloads share properties that define hardware requirements:

  • Massive parallelism: MAC-heavy kernels (convolutions, MLPs, attention) that can be parallelized across many processing elements.

  • Data movement dominated: Energy and latency are often dominated by moving weights/activations between memory and compute.

  • Diverse precision needs: From FP32/FP16 to mixed-precision (BF16, FP8, INT8, INT4) and sparsity-exploiting formats.

  • Irregularity: Sparse models, dynamic activation patterns, and conditional execution complicate pipeline utilization.

  • Memory capacity & bandwidth pressure: Large models (LLMs) require massive on-device memory or sophisticated offloading.

  • Latency vs throughput tradeoffs: Inference often needs low latency; training values throughput and reproducibility.

Design decisions must optimize performance-per-watt, programmability, and cost for target workloads (cloud training, inference at edge, embedded real-time).

2. Device building blocks

2.1 Advanced CMOS at the heart

Most accelerators today use advanced CMOS logic (FinFET, GAA) to pack compute density and power efficiency. Device-level choices affect:

  • transistor drive current, leakage, and operating voltage (hence energy efficiency),

  • SRAM and standard-cell density (PE array packing),

  • IO speed and analog/mixed-signal peripheral performance.

2.2 Emerging device technologies (enabling new compute)

  • RRAM/PCM/FeFET: nonvolatile memories enabling high-density analog storage and in-memory compute (crossbar MACs).

  • MRAM / STT-MRAM: fast, durable nonvolatile memory for fast weight storage or caches.

  • Photonic devices: modulators, detectors, and integrated waveguides for ultra-high bandwidth links and analog photonic MACs.

  • Spintronics / skyrmionics: potential for nonvolatile logic/in-memory primitives.

  • Quantum devices: specialized for quantum accelerators, not general DL yet.
    These devices introduce new tradeoffs: analog noise, variability, endurance, cryogenic needs, and integration difficulty.

3. Accelerator microarchitectures & dataflows

3.1 Systolic arrays & matrix engines

  • Classic approach (e.g., many tensor cores): 2D arrays of PEs performing MACs with local neighbor communication. Excellent for dense GEMM and convolutions.

  • Pros: regular control, high utilization on regular dense kernels.

  • Cons: less efficient for sparsity and irregular workloads.

3.2 SIMD/Vector + Tensor cores

  • Tensor units embedded in GPUs/Tensor cores accelerate mixed precision ops. Flexible and programmable with established toolchains.

3.3 Dataflow & spatial architectures

  • Architectures that schedule compute to minimize traffic (weight stationary, output stationary, row stationary). Examples: Google TPU (weight-stationary systolic), specialized ASICs that move compute to data. Great energy savings when matched to mapping.

3.4 Near-memory and in-memory compute

  • Near-memory: logic tightly coupled to DRAM/HBM to reduce off-chip traffic (HBM-PIM, logic-in-package).

  • In-memory (analog/digital): compute inside memory arrays (RRAM crossbars for analog MACs; DRAM/ SRAM bitwise logic). Offers extreme energy reduction by eliminating data movement; accuracy, ADC/DAC cost, and nonidealities are the main challenges.

3.5 Sparse & conditional compute accelerators

  • Hardware that exploits weight/activation sparsity (zero skipping, compressed formats) to skip computation and memory movement. Effective speedups require support in dataflow and memory controllers.

3.6 Reconfigurable hardware: FPGAs & CGRAs

  • FPGAs offer flexibility for evolving models; CGRAs provide an energy-efficient middle ground for domain-specific mapping. Useful for prototyping, edge customization, and latency-critical inference.

3.7 Neuromorphic & event-driven accelerators

  • Spiking neural network accelerators and asynchronous event-driven chips target ultra-low-power inference for sparse, event-based sensors.

4. Memory hierarchies & storage

4.1 On-chip SRAM, caches, and scratchpads

  • SRAM scratchpads co-designed with compiler/static scheduling give deterministic access and energy efficiency for mapped layers.

4.2 HBM and wide I/O DRAM

  • HBM stacked memory dramatically increases bandwidth (and power density); critical for training and large model inference.

4.3 NVM and storage-class memory

  • NVM (PCM, RRAM) used for persistent weight storage, near-memory compute, or tiered memory. Endurance and write energy are constraints.

4.4 Memory compression & tiling

  • Quantization, pruning, activation recomputation, and tiling reduce on-chip storage needs; hardware-friendly compression schemes matter.

5. Interconnect, packaging & system integration

5.1 Chiplets & heterogeneous packaging

  • Disaggregating compute, memory, and IO into chiplets (chiplet fabrics, 2.5D interposers, 3D stacking) allows mixing process nodes and scaling memory bandwidth without monolithic die cost.

5.2 High-speed interconnects

  • NVLink, CXL, PCIe Gen variants, silicon photonics for rack-scale fabrics. Interconnect topologies (fat-tree, torus, all-reduce optimized) shape distributed training efficiency.

5.3 DPUs/SmartNICs & offload fabrics

  • Offloading communication, security, and data orchestration to DPUs frees CPU cycles and improves cluster utilization for model parallel training.

6. Analog vs Digital tradeoffs

6.1 Analog compute (crossbars, photonics)

  • Pros: potentially orders-of-magnitude energy efficiency and parallelism for MACs.

  • Cons: limited precision, drift/variability, costly periphery (high-speed ADCs/DACs), and difficulty for training (weight updates).

6.2 Digital compute

  • Pros: deterministic precision, mature toolchains, easier training support.

  • Cons: higher energy for data movement.

Mixed approaches (analog compute for dense MACs + digital correction) are promising practical compromises.

7. Software stack and programmability

7.1 Compiler and runtime

  • ML compilers (XLA, TVM, MLIR) map models to accelerator primitives and orchestrate data movement. Hardware-friendly transformations (operator fusion, layout transforms, quantization-aware mapping) are critical.

7.2 Operator libraries and autotuners

  • Optimized kernels (cuBLAS, cuDNN, oneDNN) and autotuners implement best dataflow and memory tile sizes per hardware.

7.3 Model-aware hardware features

  • HW features such as sparsity flags, compressed tensor formats, and mixed-precision atomic operations require compiler support.

7.4 Debugging, profiling, and validation

  • Performance counters, RTL visibility, and model-in-the-loop testing ensure correctness and performance — particularly for approximate/analog devices.

8. Metrics and benchmarking

Key metrics to evaluate accelerators:

  • Throughput (TOPS, TFLOPS) per device and per watt.

  • Energy per inference/train step (Joule/sample).

  • Latency (tail percentiles) for interactive inference.

  • Utilization / efficiency (how well HW is used on real workloads).

  • Model accuracy vs quantization/error tradeoffs.

  • Cost per training/inference (TCO) — includes HW amortization and energy cost.

Benchmarks: MLPerf (training and inference), domain-specific workloads (NLP, vision, recommender systems). Use realistic pipelines (tokenization, beam search) not just raw kernel counts.

9. Design challenges & practical tradeoffs

9.1 Energy & thermal limits

Packing high compute density stresses cooling and PDN. Liquid cooling, BSPDN, and thermal-aware floorplanning are increasingly necessary.

9.2 Variability & robustness (analog/NVM)

Analog crossbars show drift and device-to-device variation; algorithms must be robust (noise-aware training, calibration).

9.3 Programmability vs efficiency

More specialized hardware yields energy gains but reduces flexibility. Reconfigurable fabrics and software layers can mitigate obsolescence risk.

9.4 Scaling for large models

Scaling LLMs requires model parallelism, memory offloading, and communication-efficient algorithms (tensor/pipeline parallelism, ZeRO, activation recomputation). Hardware must support efficient collective ops and large memory capacities.

9.5 Security & privacy

Accelerators must support encrypted memory, secure enclaves, and hardware roots-of-trust for sensitive model IP and private data.

10. Use-case focused device choices

  • Cloud training (max throughput): GPU clusters with HBM/large interconnects or TPU-like systolic arrays; chiplets + liquid cooling.

  • On-prem inference (throughput & cost): Custom ASICs with INT8/INT4 support and high memory bandwidth; DPUs for orchestration.

  • Edge inference (low power, latency): NPU/ASIC with SRAM scratchpads, DVFS, subthreshold cores, or tiny analog accelerators.

  • Mobile on-device models: DNN accelerators integrated in SoC (NPUs) and optimized compilers (quantization, pruning).

  • Specialized analytics (graph search, similarity): PIM architectures or content-addressable memory replacements.

11. Roadmap & near-term trends (next 3–7 years)

  • Wider adoption of mixed-precision and adaptive precision flows (FP8, BF16+INT8 combos).

  • Chiplet ecosystems enabling faster iteration and cheaper die mixes.

  • HBM adoption beyond GPUs (ASICs, NPUs) and HBM-PIM experimentation.

  • More in-memory compute products for inference accelerators on edge and in-datacenter niche workloads.

  • Photonic interconnects reaching maturity for rack-scale fabrics and ultra-low latency links.

  • Analog + digital hybrid accelerators in commercial products (with quantized, noise-aware algorithms).

12. Long-term directions (7+ years)

  • Analog training accelerators with on-chip weight updates and improved device reliability.

  • Neuromorphic and event-driven systems for ultra-low-power, real-time perception.

  • Integrated photonic compute for linear algebra primitives at scale.

  • Quantum accelerators for specialized ML kernels (sampling, optimization) if/when fault-tolerant scales.

  • AI-native device-level co-design where materials, devices, and models are jointly optimized.

13. Practical recommendations for teams

  1. Profile target workloads first. Pick hardware that matches the dominant computational pattern (dense GEMM, sparse workloads, low-latency).

  2. Prioritize memory bandwidth and locality. For many models, extra compute is wasted if memory is the bottleneck.

  3. Leverage software stack maturity. GPUs/TPUs have mature ecosystems; for custom ASICs, budget for compiler and kernel engineering.

  4. Plan packaging & cooling early. Compute density decisions drive mechanical and PDN needs.

  5. Design for quantization & sparsity. Include hardware hooks for compressed tensors and irregular execution.

  6. Prototype on reconfigurable platforms. Use FPGAs/CGRA to validate models and mappings before ASIC NRE.

  7. Build verification & monitoring. Counters, telemetry, and model-in-the-loop tests are essential for analog/approx hardware.

The hardware landscape for ML/DL is now a broad, multi-dimensional design space where devices, microarchitectures, memory systems, packaging, and software co-design determine success. No single solution fits all needs: cloud training favors throughput-oriented HBM+GPU/TPU clusters, on-device inference favors energy-efficient NPUs or analog primitives, and new compute paradigms (CIM, photonics, neuromorphic, quantum) will gradually carve niche advantages or grow into mainstream components as device maturity, toolchains, and ecosystems evolve. Practical innovation happens at the seams — where algorithmic needs meet device capabilities and system integration.

VLSI Expert India: Dr. Pallavi Agrawal, Ph.D., M.Tech, B.Tech (MANIT Bhopal) – Electronics and Telecommunications Engineering