ReDrafter delivers 2.7x more tokens per second compared to traditional autoregression ReDrafter has the potential to reduce latency for users while using less GPU No word on when it will be introduced to competitive AI GPUs
Apple has announced that it is working with Nvidia to accelerate large-scale language model inference using the company’s open source technology, Recurrent Drafter (ReDrafter for short).
This partnership aims to address the computational challenges of autoregressive token generation, which is critical to improving efficiency and reducing latency in real-time LLM applications.
Introduced by Apple in November 2024, ReDrafter takes a speculative decoding approach by combining a recurrent neural network (RNN) draft model with beam search and dynamic tree attention. According to Apple benchmarks, this method generates 2.7x more tokens per second compared to traditional autoregression.
Can it scale beyond Nvidia?
Through its integration into Nvidia’s TensorRT-LLM framework, ReDrafter extends its impact by enabling faster LLM inference on Nvidia GPUs, which are widely used in production environments.
To accommodate ReDrafter’s algorithms, Nvidia is introducing new operators and fine-tuning existing operators within TensorRT-LLM, making the technology available to developers who want to optimize the performance of large models. I made it.
In addition to speed improvements, Apple says ReDrafter has the potential to reduce latency for users while reducing GPU requirements. This efficiency not only reduces computational costs, but also reduces power consumption, a critical factor for organizations managing large-scale AI deployments.
For now, the focus of this collaboration remains on Nvidia’s infrastructure, but similar performance benefits could be extended to competing GPUs from AMD or Intel at some point in the future.
Such breakthroughs can help improve the efficiency of machine learning. As Nvidia states, “This collaboration makes TensorRT-LLM more powerful and more flexible, allowing the LLM community to innovate more sophisticated models and easily deploy them using TensorRT-LLM.” Now you can achieve unparalleled performance on Nvidia GPUs. These new features open up exciting possibilities, and we are leveraging TensorRT-LLM capabilities. We eagerly look forward to the next generation of advanced models from the community that will drive further improvements to our workloads.”
For more information about our collaboration with Apple, please visit the Nvidia Developer Technical Blog.