Infinite scale: The architecture behind the Azure AI superfactory

This post has been republished via RSS; it originally appeared at: The Official Microsoft Blog.

Today, we are unveiling the next Fairwater site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is connected to our first Fairwater site in Wisconsin, prior generations of AI supercomputers and the broader Azure global datacenter footprint to create the world’s first planet-scale AI superfactory. By packing computing power more densely than ever before, each Fairwater site is built to efficiently meet unprecedented demand for AI compute, push the frontiers of model intelligence and empower every person and organization on the planet to achieve more.

To meet this demand, we have reinvented how we design AI datacenters and the systems we run inside of them. Fairwater is a departure from the traditional cloud datacenter model and uses a single flat network that can integrate hundreds of thousands of the latest NVIDIA GB200 and GB300 GPUs into a massive supercomputer. These innovations are a product of decades of experience designing datacenters and networks, as well as learnings from supporting some of the largest AI training jobs on the planet.

While the Fairwater datacenter design is well suited for training the next generation of frontier models, it is also built with fungibility in mind. Training has evolved from a single monolithic job into a range of workloads with different requirements (such as pre-training, fine-tuning, reinforcement learning and synthetic data generation). Microsoft has deployed a dedicated AI WAN backbone to integrate each Fairwater site into a broader elastic system that enables dynamic allocation of diverse AI workloads and maximizes GPU utilization of the combined system.

Below, we walk through some of the exciting technical innovations that support Fairwater, from the way we build datacenters to the networking within and across the sites.

Maximum density of compute

Modern AI infrastructure is increasingly constrained by the laws of physics. The speed of light is now a key bottleneck in our ability to tightly integrate accelerators, compute and storage with performant latency. Fairwater is designed to maximize the density of compute to minimize latency within and across racks and maximize system performance.

One of the key levers for driving density is improving cooling at scale. AI servers in the Fairwater datacenters are connected to a facility-wide cooling system designed for longevity, with a closed-loop approach that reuses the liquid continuously after the initial fill with no evaporation. The water used in the initial fill is equivalent to what 20 homes consume in a year and is only replaced if water chemistry indicates it is needed (it is designed for 6-plus years), making it extremely efficient and sustainable.

Liquid-based cooling also provides much higher heat transfer, enabling us to maximize rack and row-level power (~140kW per rack, 1,360 kW per row) to pack compute as densely as possible inside the datacenter. State-of-the-art cooling also helps us maximize utilization of this dense compute in steady-state operations, enabling large training jobs to run performantly at high scale. After cycling through a system of cold plate paths across the GPU fleet, heat is dissipated by one of the largest chiller plants on the planet.

An image of a rack level direct liquid cooling — Rack level direct liquid cooling.

Another way we are driving compute density is with a two-story datacenter building design. Many AI workloads are very sensitive to latency, which means cable run lengths can meaningfully impact cluster performance. Every GPU in Fairwater is connected to every other GPU, so the two-story datacenter building approach allows for placement of racks in three dimensions to minimize cable lengths, which in turn improves latency, bandwidth, reliability and cost.

An image of two-story networking architecture — Two-story networking architecture.

High-availability, low-cost power

We are pushing the envelope in serving this compute with cost-efficient, reliable power. The Atlanta site was selected with resilient utility power in mind and is capable of achieving 4×9 availability at 3×9 cost. By securing highly available grid power, we can also forgo traditional resiliency approaches for the GPU fleet (such as on-site generation, UPS systems and dual-corded distribution), driving cost savings for customers and faster time-to-market for Microsoft.

We have also worked with our industry partners to codevelop power-management solutions to mitigate power oscillations created by large scale jobs, a growing challenge in maintaining grid stability as AI demand scales. This includes a software-driven solution that introduces supplementary workloads during periods of reduced activity, a hardware-driven solution where the GPUs enforce their own power thresholds and an on-site energy storage solution to further mask power fluctuations without utilizing excess power.

Cutting-edge accelerators and networking systems

Fairwater’s world-class datacenter design is powered by purpose-built servers, cutting-edge AI accelerators and novel networking systems. Each Fairwater datacenter runs a single, coherent cluster of interconnected NVIDIA Blackwell GPUs, with an advanced network architecture that can scale reliably beyond traditional Clos network limits with current-gen switches (hundreds of thousands of GPUs on a single flat network). This required innovation across scale-up networking, scale-out networking and networking protocol.

In terms of scale-up, each rack of AI accelerators houses up to 72 NVIDIA Blackwell GPUs, connected via NVLink for ultra-low-latency communication within the rack. Blackwell accelerators provide the highest compute density available today, with support for low-precision number formats like FP4 to increase total FLOPS and enable efficient memory use. Each rack provides 1.8 TB of GPU-to-GPU bandwidth, with over 14 TB of pooled memory available to each GPU.

An image of densely populated GPU racks with app driven networking — Densely populated GPU racks with app driven networking.

These racks then use scale-out networking to create pods and clusters that enable all GPUs to function as a single supercomputer with minimal hop counts. We achieve this with a two-tier, ethernet-based backend network that supports massive cluster sizes with 800 Gbps GPU-to-GPU connectivity. Relying on a broad ethernet ecosystem and SONiC (Software for Open Network in the Cloud – which is our own operating system for our network switches) also helps us avoid vendor lock-in and manage cost, as we can use commodity hardware instead of proprietary solutions.

We have also worked with partners like OpenAI and NVIDIA to define a breakthrough custom networking protocol — Multi-Path Reliable Connected (MRC) — to enable deeper control and optimization of network routes. Improvements across packet trimming, packet spray and high-frequency telemetry are core components of our optimized AI network. Together, these technologies deliver advanced congestion control, rapid detection and retransmission and agile load balancing, ensuring ultra-reliable, low-latency performance for modern AI workloads.

Planet scale

Even with these innovations, compute demands for large training jobs (now measured in trillions of parameters) are quickly outpacing the power and space constraints of a single facility. To serve these needs, we have built a dedicated AI WAN optical network to extend Fairwater’s scale-up and scale-out networks. Leveraging our scale and decades of hyperscale expertise, we delivered over 120,000 new fiber miles across the US last year — expanding AI network reach and reliability nationwide.

With this high-performance, high-resiliency backbone, we can directly connect different generations of supercomputers into an AI superfactory that exceeds the capabilities of a single site across geographically diverse locations. This empowers AI developers to tap our broader network of Azure AI datacenters, segmenting traffic based on their needs across scale-up and scale-out networks within a site, as well as across sites via the continent spanning AI WAN.

This is a meaningful departure from the past, where all traffic had to ride the scale-out network regardless of the requirements of the workload. Not only does it provide customers with fit-for-purpose networking at a more granular level, it also helps create fungibility to maximize the flexibility and utilization of our infrastructure.

Putting it all together

The new Fairwater site in Atlanta represents the next leap in the Azure AI infrastructure and reflects our experience running the largest AI training jobs on the planet. It combines breakthrough innovations in compute density, sustainability and networking systems to efficiently serve the massive demand for computational power we are seeing. It also integrates deeply with other AI datacenters and the broader Azure platform to form the world’s first AI superfactory. Together, these innovations provide a flexible, fit-for-purpose infrastructure that can serve the full spectrum of modern AI workloads and empower every person and organization on the planet to achieve more. For our customers, this means easier integration of AI into every workflow and the ability to create innovative AI solutions that were previously unattainable.

Find out more about how Microsoft Azure can help you integrate AI to streamline and strengthen development lifecycles here.

Scott Guthrie is responsible for hyperscale cloud computing solutions and services including Azure, Microsoft’s cloud computing platform, generative AI solutions, data platforms and information and cybersecurity. These platforms and services help organizations worldwide solve urgent challenges and drive long-term transformation.

The post Infinite scale: The architecture behind the Azure AI superfactory appeared first on The Official Microsoft Blog.

Maximum density of compute

High-availability, low-cost power

Cutting-edge accelerators and networking systems

Planet scale

Putting it all together

Leave a Reply Cancel reply