Reports of Nvidia’s GB200 NVL72 server racks overheating are said to be exaggerated. Business Insider reports that Blackwell’s cooling design flaws have already been addressed. Dylan Patel, principal analyst at Semianalysis, reportedly told Business Insider that Blackwell’s design issues, which have existed for several months, have been largely resolved and that the overheating issue has been largely exaggerated. Ru.
Five analysts at semi-analysis, which monitors the semiconductor industry, reported that the cooling system issue, which prompted “rework” from multiple suppliers, was a “minor” change. Blackwell’s cooling failure was particularly problematic in Nvidia’s massive 72-chip server racks, which can draw up to 120kW. Flaws in the rack design caused the GPUs inside to overheat, forcing Nvidia to re-evaluate the design multiple times. This will delay the shipment of Nvidia’s GB200 hardware, and the required design changes will cause further delays.
Nvidia’s B200 GPU is the most powerful processing chip for AI workloads. For example, the GB200 superchip’s TDP is configurable in thousands of watts, with a maximum rating of 2,700 watts. These unusually high power numbers make it virtually impossible to use air cooling within the constraints of standard rack mount form factors.
This physical issue led Nvidia to require liquid cooling in its latest Blackwell GPUs. Data centers will also need to retrofit their server farms to accommodate the infrastructure required to support water-cooled servers.
Nvidia could potentially solve this problem by creating slower air-cooled GPUs. GPU manufacturers are still doing it in the form of GPUs like the H200 NVL. But to stay at the forefront of the AI GPU arms race, Nvidia is prioritizing performance regardless of cost. So the company chose to sacrifice air cooling to produce GPUs that require thousands of watts of power.
The good news is that Nvidia’s 72-chip Blackwell cooling issues are apparently minor and have already been largely resolved. Additionally, only Nvidia’s flagship 72-chip server racks have issues.