NVIDIA today announced that xAI’s Colossus supercomputer cluster of 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale using the NVIDIA Spectrum-X™ Ethernet networking platform I did. The platform is designed to deliver superior performance for multi-tenant hyperscale AI factories. Standards-based Ethernet for remote direct memory access (RDMA) networks.
Colossus, the world’s largest AI supercomputer, is used to train xAI’s Grok family of large-scale language models, and chatbots are available as a feature for X Premium subscribers. xAI plans to double the size of Colossus to a total of 200,000 NVIDIA Hopper GPUs.
This support facility and state-of-the-art supercomputer were built by xAI and NVIDIA in just 122 days, compared to the months or years that systems of this scale typically take. It took 19 days from the time the first rack rolled on the floor until training began.
Colossus achieves unprecedented network performance while training extremely large Grok models. Across all three layers of the network fabric, the system experienced no application latency degradation or packet loss due to flow collisions. Maintaining 95% data throughput enabled by Spectrum-X congestion control.
Standard Ethernet cannot achieve this level of performance at scale. Standard Ethernet provides only 60% data throughput, yet has thousands of flow collisions.
“AI is becoming mission-critical, requiring improvements in performance, security, scalability, and cost efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to give innovators like xAI faster processing, analysis, and execution of AI workloads, which in turn accelerates the development, deployment, and time-to-market of AI solutions. It has been.”
Elon Musk said of X, “Colossus is the world’s most powerful training system. It’s a great job by the xAI team, NVIDIA, and our many partners/suppliers.”
“xAI has built the world’s largest and most powerful supercomputer,” an xAI spokesperson said. “NVIDIA’s Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at scale and build ultra-accelerated and optimized AI factories based on Ethernet standards.”
At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds up to 800Gb/s and is based on the Spectrum-4 switch ASIC. xAI chose to pair Spectrum-X SN5600 switches with NVIDIA BlueField-3® SuperNICs for unprecedented performance.
Spectrum-X Ethernet networking for AI brings advanced capabilities that provide highly efficient, scalable bandwidth with low latency and short tail delay, previously only available in InfiniBand. These features include adaptive routing, congestion control, and enhanced AI Fabric visibility and performance isolation with NVIDIA Direct Data Placement Technology, all of which can be used in multi-tenant production AI clouds and large enterprise environments. is an important requirement.