Since autoregressive token generation is computationally expensive and relatively time-consuming, accelerating LLM inference is an important challenge in ML research, as improving inference efficiency can reduce user waiting time. In addition to our continued efforts to accelerate inference on Apple silicon, we recently made significant progress in accelerating LLM inference on NVIDIA GPUs, which are widely used in production applications across the industry.
Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a new approach to speculative decoding that delivers state-of-the-art performance. ReDrafter uses an RNN draft model and combines beam search with dynamic tree attention to speed up LLM token generation for open source models by up to 3.5 tokens per generation step, outperforming previous speculative decoding techniques.
Productization of ReDrafter to accelerate NVIDIA TensorRT-LLM
Although this research work showed strong results, it will have a larger impact when applied to production environments to speed up LLM inference. To operationalize this advancement on NVIDIA GPUs, we worked with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.
Although TensorRT-LLM supports a number of open source LLMs and Medusa speculative decoding techniques, ReDrafter’s beam search and tree attention algorithms rely on operators that have not been used in previous applications. . To enable ReDrafter integration, NVIDIA has added new operators or exposed existing ones. This significantly increases TensorRT-LLM’s ability to accommodate sophisticated models and decoding techniques. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for production LLM applications using TensorRT-LLM.
Using the NVIDIA TensorRT-LLM inference acceleration framework and ReDrafter, we benchmarked tens of billions of parameter generation models on NVIDIA GPUs and found a 2.7x increase in the rate of generated tokens per second with greedy decoding. Ok (see Figure 1). These benchmark results show that this technology can significantly reduce the latency that users may experience, while also reducing the number of GPUs used and power consumption.
For more information, see this post on the NVIDIA Developer Blog.
conclusion
LLM is increasingly used to power production applications, where increased inference efficiency can impact computational costs and reduce user latency. ReDrafter’s new approach to speculative decoding is integrated into the NVIDIA TensorRT-LLM framework, allowing developers to benefit from faster token generation on NVIDIA GPUs for production LLM applications.
Acknowledgment
Many people contributed to this project, including Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and collaborators at NVIDIA.