Nvidia, Oracle, Google, Dell, and 13 other companies reported how long it takes their computers to train the major neural networks in use today. Among those achievements was a first glimpse of Nvidia’s next-generation GPU, the B200, and Google’s next accelerator, Trillium. The B200 recorded twice the performance in some tests compared to today’s flagship Nvidia chip, the H100. And Trillium delivered an almost 4x performance improvement compared to chips Google tested in 2023.
The benchmark test, called MLPerf v4.1, consists of six tasks: recommendation, pretraining of large language models (LLMs) GPT-3 and BERT-large, fine-tuning of large language models of Llama 2 70B, and objects. Masu. Detection, graph node classification, and image generation.
Training GPT-3 is such a huge undertaking that it would be unrealistic to do it all just to provide a benchmark. Instead, the test works by training you to a point where experts determine you are likely to reach your goal if you continue. The goal of Llama 2 70B is not to train an LLM from scratch, but rather to take an already trained model and fine-tune it to specialize in a particular expertise, in this case government documents. Graph node classification is a type of machine learning used in fraud detection and drug discovery.
As what’s important in AI has evolved, so has the suite of tests, primarily towards the use of generative AI. This latest version of MLPerf marks a complete switch in what has been tested since the benchmark effort began. “At this point, all of the original benchmarks have been phased out,” says David Kanter, who leads benchmarking efforts at MLCommons. In the last round, some benchmarks took only a few seconds to run.
The performance of the best machine learning systems on various benchmarks exceeds the performance that would be expected for profits from Moore’s Law (blue line) alone. The solid line represents the current benchmark. Dashed lines represent obsolete benchmarks because they are no longer industrially relevant. MLCommons
According to MLPerf calculations, AI training on the new benchmark suite is now approximately twice as fast as expected from Moore’s Law. As the years passed, performance plateaued earlier than at the beginning of MLPerf’s reign. Kanter believes this is largely due to companies figuring out how to benchmark test on very large systems. Over time, Nvidia, Google, and others have developed software and network technologies that enable near-linear scaling. Doubling the processors cuts training time by about half.
First Nvidia Blackwell training results
This round marked the first training test for Nvidia’s upcoming GPU architecture, called Blackwell. With GPT-3 training and LLM fine-tuning, Blackwell (B200) roughly doubled the performance of H100 per GPU. The increase was slightly less robust, but still significant for recommender systems and image generation, at 64 percent and 62 percent, respectively.
The Blackwell architecture built into the Nvidia B200 GPU continues the trend of increasingly less precise numbers used to accelerate AI. For certain parts of transformer neural networks, such as ChatGPT, Llama2, and stable diffusion, the Nvidia H100 and H200 use 8-bit floating point numbers. The B200 reduces that to just 4 bits.
Google announces 6th generation hardware
Google showed first results for its 6th generation TPU, called Trillium, which it announced just last month, as well as second round results for its 5th generation variant, Cloud TPU v5p. For the 2023 edition, the search giant introduced v5e, another version of its 5th generation TPU designed for efficiency over performance. Compared to the latter, Trillium improves the performance of the GPT-3 training task by a factor of 3.8.
But things weren’t so rosy for everyone’s biggest rival, Nvidia. A system comprised of 6,144 TPU v5ps reached the GPT-3 training checkpoint in 11.77 minutes, placing second behind the 11,616 Nvidia H100 system, which completed the task in approximately 3.44 minutes. Its top-of-the-line TPU system was only about 25 seconds faster than an H100 computer half its size.
Dell Technologies computers used approximately 75 cents worth of power to fine-tune the Llama 2 70B large language model.
In the closest direct comparison between v5p and Trillium, where each system is configured with 2048 TPUs, the upcoming Trillium reduced GPT-3 training time by 2 minutes, an improvement of nearly 8% from v5p’s 29.6 minutes. Another difference between Trillium and the v5p entry is that Trillium is paired with an AMD Epyc CPU rather than v5p’s Intel Xeon.
Google also trained its image generator, Stable Diffusion, on Cloud TPU v5p. With 2.6 billion parameters, Stable Diffusion is a light enough lift that MLPerf contestants have to train to converge, not just reach checkpoints like GPT-3. The 1024 TPU system ranked second, finishing the job in 2 minutes and 26 seconds. This is about a minute slower than a similarly sized system consisting of an Nvidia H100.
Training ability is still unclear
The enormous energy costs of training neural networks have long been a source of concern. MLPerf is just beginning to measure this. Dell Technologies was the only entrant into the energy category with eight server systems including 64 Nvidia H100 GPUs and 16 Intel Xeon Platinum CPUs. The only measurements made were on the LLM fine-tuning task (Llama2 70B). The system consumed 16.4 megajoules during the 5-minute run, resulting in an average power of 5.4 kilowatts. This means approximately 75 cents of electricity at the average cost in the United States.
Although this result in itself is not telling, it can provide an indication of the power consumption of similar systems. For example, Oracle reported close performance results of 4 minutes and 45 seconds using the same number and type of CPUs and GPUs.
From an article on your site
Related articles on the web