To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
Running AI involves many costs, but one of the most fundamental costs is providing the GPU power needed for inference.
Until now, organizations that needed to deliver AI inference had to run long-running cloud instances or provision hardware on-premise. Today, Google Cloud is previewing a new approach that could revolutionise the landscape for deploying AI applications: Google Cloud Run serverless service now integrates Nvidia L4 GPUs, enabling organisations to run serverless inference.
The advantage of serverless is that the service runs only when needed and users only pay for what they use, in contrast to typical cloud instances that run for a fixed time as a permanent service and are always available. In this case, with a serverless service, GPUs for inference are only started and used when needed.
Serverless inference can be deployed as Nvidia NIM as well as other frameworks such as VLLM, Pytorch, Ollama, etc. The addition of Nvidia L4 GPUs is currently in preview.
“As more customers adopt AI, they want to run AI workloads, such as inference, on a platform they’re familiar with,” Sagar Randive, product manager for Google Cloud Serverless, told VentureBeat. “Cloud Run users appreciate the efficiency and flexibility of the platform and have been asking Google to add GPU support.”
Bringing AI to a Serverless World
Google’s fully managed serverless platform, Cloud Run, has become a favorite among developers because it simplifies the deployment and management of containers, but the growing demand for AI workloads, especially those requiring real-time processing, has highlighted the need for more robust compute resources.
Integrated GPU support enables a broader range of use cases for Cloud Run developers, including:
Create highly responsive custom chatbots and on-the-fly document summarization tools with real-time inference using lightweight open models like Gemma 2B/7B and Llama3 (8B). Deliver custom, fine-tuned generative AI models, such as brand-specific image generation applications that scale on demand. Accelerate compute-intensive services like image recognition, video transcoding, and 3D rendering, and scale to zero when not in use.
Serverless performance scales to meet your AI inference needs
A common concern with serverless is around performance: after all, if a service isn’t running all the time, performance often suffers just from running the service from a so-called cold start.
Google Cloud aims to allay these performance concerns with impressive metrics for its new GPU-enabled Cloud Run instances: According to Google, cold start times range from 11 to 35 seconds across a range of models, including Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama 3.1 8b, demonstrating the responsiveness of the platform.
Each Cloud Run instance can be equipped with one Nvidia L4 GPU with up to 24GB of vRAM, providing a sufficient level of resources for many common AI inference tasks. Google Cloud also aims to be model agnostic in terms of what can be run, but to some extent it is risk averse.
“There are no limitations to LLM, so users can run any model they like,” says Randive, “but for best performance we recommend running the model with the 13B parameters.”
Is it cheaper to run serverless AI inference?
A major benefit of serverless is better utilization of hardware, which should also reduce costs.
Whether it is actually cheaper for an organization to provision AI inference as a serverless approach or as a long-running server approach is a somewhat nuanced question.
“This will vary depending on your application and expected traffic patterns,” Randive said. “We plan to update our pricing calculator to reflect Cloud Run’s new GPU pricing, at which point customers will be able to compare their total operational costs across different platforms.”