Google Cloud Run adopts Nvidia GPUs for serverless AI inference

To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more

Running AI involves many costs, but one of the most fundamental costs is providing the GPU power needed for inference.

Until now, organizations that needed to deliver AI inference had to run long-running cloud instances or provision hardware on-premise. Today, Google Cloud is previewing a new approach that could revolutionise the landscape for deploying AI applications: Google Cloud Run serverless service now integrates Nvidia L4 GPUs, enabling organisations to run serverless inference.

The advantage of serverless is that the service runs only when needed and users only pay for what they use, in contrast to typical cloud instances that run for a fixed time as a permanent service and are always available. In this case, with a serverless service, GPUs for inference are only started and used when needed.

Serverless inference can be deployed as Nvidia NIM as well as other frameworks such as VLLM, Pytorch, Ollama, etc. The addition of Nvidia L4 GPUs is currently in preview.

“As more customers adopt AI, they want to run AI workloads, such as inference, on a platform they’re familiar with,” Sagar Randive, product manager for Google Cloud Serverless, told VentureBeat. “Cloud Run users appreciate the efficiency and flexibility of the platform and have been asking Google to add GPU support.”

Bringing AI to a Serverless World

Google’s fully managed serverless platform, Cloud Run, has become a favorite among developers because it simplifies the deployment and management of containers, but the growing demand for AI workloads, especially those requiring real-time processing, has highlighted the need for more robust compute resources.

Integrated GPU support enables a broader range of use cases for Cloud Run developers, including:

Create highly responsive custom chatbots and on-the-fly document summarization tools with real-time inference using lightweight open models like Gemma 2B/7B and Llama3 (8B). Deliver custom, fine-tuned generative AI models, such as brand-specific image generation applications that scale on demand. Accelerate compute-intensive services like image recognition, video transcoding, and 3D rendering, and scale to zero when not in use.

Serverless performance scales to meet your AI inference needs

A common concern with serverless is around performance: after all, if a service isn’t running all the time, performance often suffers just from running the service from a so-called cold start.

Google Cloud aims to allay these performance concerns with impressive metrics for its new GPU-enabled Cloud Run instances: According to Google, cold start times range from 11 to 35 seconds across a range of models, including Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama 3.1 8b, demonstrating the responsiveness of the platform.

Each Cloud Run instance can be equipped with one Nvidia L4 GPU with up to 24GB of vRAM, providing a sufficient level of resources for many common AI inference tasks. Google Cloud also aims to be model agnostic in terms of what can be run, but to some extent it is risk averse.

“There are no limitations to LLM, so users can run any model they like,” says Randive, “but for best performance we recommend running the model with the 13B parameters.”

Is it cheaper to run serverless AI inference?

A major benefit of serverless is better utilization of hardware, which should also reduce costs.

Whether it is actually cheaper for an organization to provision AI inference as a serverless approach or as a long-running server approach is a somewhat nuanced question.

“This will vary depending on your application and expected traffic patterns,” Randive said. “We plan to update our pricing calculator to reflect Cloud Run’s new GPU pricing, at which point customers will be able to compare their total operational costs across different platforms.”

VB Daily

Stay up to date! Get the latest news every day by email

By subscribing, you agree to VentureBeat’s Terms of Use.

Thanks for subscribing! Check out other VB newsletters here.

An error has occurred.

What's Hot

‘Godfather of AI’ shortens the chance that technology will wipe out humanity over the next 30 years | Artificial Intelligence (AI)

Slowing AI spending is a real risk for chips, NVIDIA supplier says

AI in 2024: The world has changed, but not for the better

Slowing AI spending is a real risk for chips, NVIDIA supplier says

NVIDIA’s GeForce RTX 5090 rumored to feature 16+6+7 power phase design and 14-layer PCB configuration

Nvidia trade tug of war | Global Finance Magazine

Nvidia’s big day is here: What to expect when the AI giant reports after the bell

Can Nvidia’s bull market continue? Timothy Arcuri predicts

Nvidia shows off progress on Blackwell server installation — AI and datacenter roadmap sees Blackwell Ultra coming next year, Vera CPUs and Rubin GPUs coming in 2026

Most Popular