According to a TrendForce report, Nvidia may have to postpone production increases for next-generation AI servers based on the B200 and GB200 platforms due to overheating, power consumption, and the need to optimize interconnects. That’s what it means. The market research firm believes that peak mass production and shipments of Blackwell machines will occur in mid-2025, which would mean a delay of about six months. Nvidia has not yet confirmed or denied this claim.
As expected, Nvidia and its partners will only be able to ship a limited number of Blackwell-based servers in 2024, as they will need to use lower-yield B200s for their servers. However, Dell is already shipping Blackwell server racks. However, while a sophisticated version of Nvidia’s B200 processor will go into production in October and be in the company’s hands in January, TrendForce doesn’t expect the increase in Blackwell-based servers to surge any time soon. yeah. Volume production and peak shipments of the B200 and GB200 will only take place between the second and third quarters of 2025, according to the company, due to heating, power consumption, and high-speed interconnect requirements.
Just a few months ago, it was reported that an Nvidia NVL72 rack based on the GB200 platform with 72 B200 GPUs would draw 120 kW of power, which is already significantly higher than current AI server racks (Typical high-density racks have up to 20 kW of power, while H100-based racks are reported to consume around 40 kW). TrendForce claims that Nvidia has updated the specifications of the device, bringing the power consumption to 140 kW. This is more than a typical data center can feed into a single rack.
The problem is that Nvidia’s Blackwell GPUs are reportedly prone to overheating in servers with 72 processors, even when drawing up to 120 kW per rack. This problem has forced Nvidia to repeatedly modify its server rack design, as overheating not only reduces GPU performance but also risks damaging the hardware. 140 kW per rack requires further changes to the server design, which could lead to setbacks.
Increased power consumption means additional cooling requirements. Liquid cooling is essential for Blackwell servers, but modern sidecar cooling distribution units (CDUs) can only handle 60 kW to 80 kW of heat output. To achieve this objective, cooling system providers are optimizing the cold plate design and aiming to double or triple the capacity of the CDU. TrendForce expects the performance of liquid-to-liquid spigot CDUs to exceed 1.3 mW, and further advances are possible, so excessive heat dissipation will eventually become less of a problem.
However, according to the report, power consumption and thermal management are not the only problems that Nvidia and its partners must solve. TrendForce claims that Nvidia needs to optimize its interconnects, but does not elaborate on which interconnects need to be optimized.
What do the claimed initial issues with Nvidia’s B200 and GB200 servers mean for the launch timing and availability of the B200A, which is based on simplified Blackwell processors, and the B300 and GB300 machines, which feature refreshed Blackwell GPUs? I don’t know yet if it will have an impact. The B200A features significantly lower power consumption compared to the B200/GB200, while the refreshed B300 series Blackwell GPUs feature more memory and higher compute performance typically achieved with higher power. These products can consume even more power. Over 140 kW per rack, requiring more advanced components and cooling.