No one will be surprised. Nvidia remains the fastest AI and HPC accelerator across all MLPerf benchmarks. And while Google submitted its results, AMD did not participate.
After all, MLPerf is the industry’s best benchmark for AI workloads. Do people look at them and decide to buy? Probably not. However, these results remain indicative of the pace of improvement in AI hardware and software performance when comparing historical hardware and software releases. And of course, benchmarks provide Nvidia with bragging rights.
AMD submitted a single benchmark AI inference performance last quarter, and it actually performed very well. But this time I was surprised that they were unable or unwilling to test the hardware. Perhaps they are too busy preparing MI325 for shipment to customers.
Meanwhile, Google submitted some results for its TPU V5p and TPU Trillium platforms, which are used to train Apple Intelligence models. Let’s take a look.
First of all, Nvidia Hopper is still the fastest GPU available
While Blackwell is ramping up production and appears as a “preview” solution in the MLPerf dictionary, Nvidia Hopper is currently Nvidia’s “available” GPU. Key sponsors of the AI wave that hit the market last year show off new software that boosts performance by up to 30% and submit benchmarks on clusters scaled to over 11,000 GPUs using NVLink, NVSwitch, and ConnectX I did. 7, and Quantum X400 InfiniBand Networking. Clearly, the H100 brought in most of Nvidia’s revenue this quarter, and Nvidia has managed to demonstrate that this system is by no means lazy.
Dell submitted results for the H100 and H200, showing an improvement of about 15% on the larger memory H200. However, these benchmarks are not representative of the actual larger model that should provide significant benefits to the H200. Perhaps this is another reason why AMD won’t participate. That’s because the AMD MI300 has more HBM memory than its Nvidia counterpart, and these benchmarks don’t show that potential benefit.
Blackwell: Twice Hopper’s training performance
Nvidia demonstrated that Blackwell had four times the performance of Hopper in last quarter’s round of inference processing benchmarks, but nearly half of that was probably due to the use of 4-bit arithmetic on the B100 and 8-bit on the H100. This was due to the use of arithmetic. Since the B100 is two GPU dies, both using 8-bit floating point, Blackwell should expect it to complete training jobs in about 1/2 the time of its predecessor, the H100. And that’s exactly what the MLPerf benchmark shows. That means up to 2x better performance.
Let’s look at this from a historical perspective. The Nvidia A100 was introduced four and a half years ago. Blackwell is 12x faster than A100. Your training performance will improve by orders of magnitude. This difference stems from seven improvements made over those four years, as outlined in the slides below. As Nvidia continues to tell us, its AI is a full-stack solution, with everything from software like Transformer Engine to overlapping compute and communications contributing to this impressive result. Masu. As it turns out, Moore’s Law yields less than 4 times the performance. Therefore, this is different from a CPU chip. Achieving this level of innovation and performance requires fully optimized system, software, rack, and data center architectures.
Google TPU
Honestly, I’m having a hard time understanding Google’s benchmark results, but the company didn’t have a spokesperson at the MLCommons briefing. Google Trillium TPU provided four results for clusters ranging from 512 to 3072 Trillium accelerators. Comparing the 102 minute performance of a 512 node cluster looks good until you see that Nvidia Blackwell completed the task in just over 193 minutes using only 8 nodes. Normalizing, which is always a dangerous and inaccurate mathematical exercise, makes Blackwell over 30 times faster. With 512 nodes, Blackwell can probably complete training in just over 3 minutes. If someone at Google can contact me, I will be happy to correct any mistakes I made in this analysis.
Where is this going?
Clearly, Nvidia’s leadership in chips, software, networking, and infrastructure delivers. This outcome doesn’t get anyone close, and most of the industry refuses to provide the transparency needed to properly evaluate alternatives. Many of these alternative vendors simply say that Nvidia controls MLPerf and trying to compete with them is a waste of time. I’m not sure about the “control” part of that argument, but I agree that NVIDIA will throw in as many engineers as necessary to win these bake-offs. And they do. And they’re going to keep winning.
“We want to be able to double or triple performance at scale every year for the next 10 years,” Nvidia CEO Jensen Huang said on an episode of the AI-focused podcast “No Priors.” ”