In the AI hardware world, almost everyone is talking about inference.
Nvidia Chief Financial Officer Colette Kress said on the company’s Wednesday earnings call that inference accounted for about 40% of the company’s $26.3 billion in second-quarter data center revenue. AWS CEO Matt Gurman recently said on the No Priors podcast that half of the work done on AI computing servers today is inference. And that share could grow, drawing in competitors eager to dethrone Nvidia.
That being said, many of the companies looking to take market share from Nvidia are starting with speculation.
A founding team of former Google employees founded Groq, which focuses on inference hardware, and raised $640 million in August at a valuation of $2.8 billion.
In December 2023, Positron AI unveiled an inference chip that it claims can perform the same calculations as Nvidia’s H100, but at a fifth of the price. Amazon is developing both training and inference chips, which it has named Trainium and Inferentia, respectively.
“I think the more diversity we have, the better for us,” Gurman said on the same podcast.
And Cerebras, a California company known for its oversized AI training chips, announced last week that it has developed the fastest inference chip of that size on the market, according to CEO Andrew Feldman.
Not all inference chips are created equal
Chips designed for artificial intelligence workloads need to be optimized for training and inference.
Training is the first phase of AI tool development: inputting labeled and annotated data into the model so that it learns how to produce accurate and useful results. Inference is the act of generating those outputs after the model has been trained.
Training chips tend to optimize for pure computing power. Inference chips don’t require as much computing power; in fact, some inference can be performed by a traditional CPU. Chipmakers tasked with this task are more concerned about latency, because the difference between addictive and annoying AI tools often comes down to speed. And that’s exactly what Cerebras CEO Andrew Feldman is hoping for.
The company claims that the Cerebras chip has 7,000 times the memory bandwidth of Nvidia’s H100, which allows for what Feldman calls “incredible speeds.”
The company, which has begun the IPO process, is also rolling out inference as a service with multiple tiers, including a free tier.
“Inference is a memory bandwidth problem,” Feldman told Business Insider.
To make AI profitable, scale your inference workloads
Choosing to optimize a chip design for training or inference is not just a technical decision, but also a market decision: Most companies building AI tools will need both at some point, but the bulk of their needs will likely be concentrated in one area or the other, depending on where the company is in the build cycle.
Massive training workloads can be considered the research and development phase of AI. As companies move primarily to inference, it means that, at least in theory, the products they build are working for their end customers.
As more AI projects and startups mature, inference is expected to become a larger part of computing tasks. In fact, AWS’s Garman said it’s necessary to reap the unrealized benefits of hundreds of billions of dollars in AI infrastructure investments.
“Inference workloads need to dominate, otherwise the investments in these large-scale models are not going to pay off,” Gurman told No Priors.
But the simple dichotomy between training and inference for chip designers may not last forever.
“Some of the clusters in our data centers are being used by customers for both purposes,” said Raul Martinek, CEO of Databank, the data center’s owner.
Nvidia’s recent acquisition of Run.ai may support Martynek’s prediction that the wall between inference and training may soon disappear.
Nvidia agreed to acquire Israeli company Run:ai in April, but the deal has not yet closed and is under investigation by the Department of Justice, according to Politico. Run:ai’s technology makes GPUs work more efficiently, allowing them to do more work with fewer chips.
“I think most companies will merge. There will be clusters doing training and inference,” Martinek said.
Nvidia declined to comment on the report.