Google Cloud Run embraces Nvidia GPUs for serverless AI inference


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


There are a number of different costs associated with running AI, one of the most fundamental is providing the GPU power needed for inference.

To date, organizations that need to provide AI inference have had to run long running cloud instances, or provision hardware on-premises. Today, Google Cloud is previewing a new approach, and it’s one that could reshape the landscape of AI application deployment. The Google Cloud Run serverless offering is now integrating Nvidia L4 GPUs, effectively enabling organizations to run serverless inference. 

The promise of serverless is that a service only runs when needed and users only pay for what is used. That’s in contrast to a typical cloud instance which will run for a set amount of time as a persistent service and is always available. With a serverless service, in this case a GPU for inference, only fires up and is used when needed.

The serverless inference can be deployed as an Nvidia NIM,  as well as other frameworks such as  VLLM, Pytorch and Ollama. The addition of Nvidia L4 GPU’s is currently in preview.

“As customers increasingly adopt AI, they are seeking to run AI workloads like inference on platforms they are familiar with and start up on,” Sagar Randive, Product Manager, Google Cloud Serverless, told VentureBeat. “Cloud Run users prefer the efficiency and flexibility of the platform and have been asking for Google to add GPU support.”

Bringing AI into the serverless world

Cloud Run, Google’s fully managed serverless platform, has been a popular platform with developers thanks to its ability to simplify container deployment and management. However, the escalating demands of AI workloads, particularly those requiring real-time processing, have highlighted the need for more robust computational resources. 

The integration of GPU support opens up a wide array of use cases for Cloud Run developers including:

  • Real-time inference with lightweight open models such as Gemma 2B/7B or Llama3 (8B), enabling the creation of responsive custom chatbots and on-the-fly document summarization tools.
  • Serving custom fine-tuned generative AI models, including brand-specific image generation applications that can scale based on demand.
  • Accelerating compute-intensive services like image recognition, video transcoding, and 3D rendering, with the ability to scale to zero when not in use.

Serverless performance can scale to meet AI inference needs

A common concern with serverless is about performance. After all, if a service is not always running, there is often a performance hit just to get the service running from a so-called cold start.

Google Cloud is aiming to allay any such performance fears citing some impressive metrics for the new GPU-enabled Cloud Run instances. According to Google, cold start times range from 11 to 35 seconds for various models, including Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama 3.1 8b, showcasing the platform’s responsiveness.

Each Cloud Run instance can be equipped with one Nvidia L4 GPU, with up to 24GB of vRAM, providing  a solid level of resources for many common  AI inference tasks. Google Cloud is also aiming to be model agnostic in terms of what can run, though it is hedging its bets somewhat.

“We do not restrict any LLMs, users can run any models they want,” Randive said. “However for best performance, it is recommended that they run models under 13B parameters.”

Will running serverless AI inference be cheaper?

A key promise of serverless is better utilization of hardware, which is supposed to also translate to lower costs.

As to whether or not it is actually cheaper for an organization to provision AI inference as a serverless or as a long running server approach is a somewhat nuanced question.

“This depends on the application and the traffic pattern expected,” Randive said. “We will be updating our pricing calculator to reflect the new GPU prices with Cloud Run at which point customers will be able to compare their total cost of operations on various platforms.”



Source link

About The Author