Hardware

NVIDIA Blackwell: The Architecture Behind the AI Infrastructure Boom

NVIDIA Blackwell GPU architecture

NVIDIA announced the Blackwell architecture at GTC 2024 on March 18, 2024. It is now in production and being deployed across hyperscale data centres globally. For anyone building on or buying AI infrastructure, understanding what Blackwell actually delivers — and how it differs from its predecessor — is a practical necessity.

The Core Architecture

According to NVIDIA's official Blackwell architecture documentation, Blackwell GPUs pack 208 billion transistors, manufactured using a custom TSMC 4NP process — an enhancement of the 4N node used for the previous Hopper generation. The architecture uses two reticle-limited dies connected by a 10 TB/s chip-to-chip interconnect (NV-High Bandwidth Interface), operating as a single unified GPU.

NVIDIA CEO Jensen Huang stated in a CNBC interview that the company spent approximately $10 billion in R&D for the NV-HBI die interconnect alone. The dual-die approach was necessary because Hopper had nearly hit TSMC's reticle limit — the maximum die size that lithography machines can physically produce.

// Key Specs (Source: NVIDIA Official Documentation)
  • 208 billion transistors; TSMC 4NP custom process
  • Two reticle-limited dies connected by 10 TB/s chip-to-chip link
  • 20 petaFLOPS FP4 AI compute per GPU
  • NVLink 5.0: 1.8 TB/s per GPU bandwidth; supports up to 576 GPUs in a single domain
  • Second-generation Transformer Engine with FP4 and FP6 precision support

The Second-Generation Transformer Engine

The Transformer Engine — first introduced with Hopper — dynamically adjusts precision between FP8 and FP16 during training to maximise throughput without sacrificing accuracy. Blackwell's second-generation version adds support for FP4 (MXFP4) and FP6 (MXFP6) precision formats via new micro-tensor scaling.

According to NVIDIA's official Blackwell platform announcement, the second-generation Transformer Engine contributes to the GB200 NVL72 system delivering up to 30x faster real-time LLM inference for trillion-parameter models compared to the H100-based predecessor. This figure covers the full system configuration, not individual chip performance.

NVLink 5.0 and Scale-Out Architecture

Blackwell's fifth-generation NVLink delivers 1.8 TB/s of bandwidth per GPU — doubled from Hopper's NVLink 4.0 — and supports up to 576 GPUs operating as a single logical domain. This scale-out capability is architecturally significant for training frontier models, where inter-GPU communication bandwidth is often the limiting factor at scale.

The GB200 NVL72 — NVIDIA's rack-scale product combining 36 Grace CPUs and 72 Blackwell GPUs — is the primary system configuration being deployed by hyperscalers. By November 2024, Morgan Stanley was reporting that "the entire 2025 production" of Blackwell silicon was "already sold out," per Wikipedia's documented timeline.

The Supply and Yield Story

The path to production was not smooth. It was reported in October 2024 that a design flaw had been identified and fixed in collaboration with TSMC. Jensen Huang acknowledged the issue publicly, describing it as "functional" and noting it "caused the yields to be low." The fix required collaboration with TSMC and delayed the production ramp — a reminder that leading-edge semiconductor development carries inherent schedule risk.

Competitive Context

AMD's MI300X remains the most credible alternative in data centre AI acceleration. Google's TPU v5 series is competitive on specific Google-optimised workloads. However, NVIDIA's software ecosystem — particularly CUDA's deep integration across every major AI framework — creates a practical advantage that raw hardware comparisons understate. Companies standardised on NVIDIA infrastructure face significant switching costs that go beyond chip performance.

The Blackwell Ultra variant, announced in August 2025 per NVIDIA's Technical Blog, extends the architecture further with attention acceleration improvements and additional AI compute FLOPS — continuing the cadence of incremental enhancements within the Blackwell generation.