Announcing GA for Azure Container Apps Serverless GPUs

Cary_Chai

Microsoft

Mar 18, 2025

Azure Container Apps Serverless GPUs are now GA. Serverless GPUs enable you to seamlessly run your AI workloads on-demand with automatic scaling, optimized cold start, per-second billing, and reduced operational overhead.

Azure Container Apps Serverless GPUs accelerated by NVIDIA are now generally available. Serverless GPUs enable you to seamlessly run AI workloads with per-second billing and scale down to zero when not in use. Thus, reducing operational overhead to support easy real-time custom model inferencing and other GPU-accelerated workloads.

Serverless GPUs accelerate the speed of AI development teams by allowing customers to focus on core AI code and less on managing infrastructure when using GPUs. This provides an excellent middle layer option between Azure AI Model Catalog's serverless APIs and hosting custom models on managed compute. Now customers can build their own serverless API endpoints for inferencing AI models including custom models. Customers can also provision on-demand GPU-powered Jupyter Notebooks or run other compute-intensive AI workloads that are ephemeral in nature. It provides full data governance as customer’s data never leaves the boundaries of the container while still providing a managed, serverless platform from which to build your applications.

This GA release of Serverless GPUs also adds support for NVIDIA NIM microservices, NVIDIA NIM™, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing at scale. Supporting a wide range of AI models, including open-source community and NVIDIA AI Foundation models, NVIDIA NIM ensures seamless, scalable AI inferencing leveraging industry-standard APIs.

Create an Azure Container App with Serverless GPUs and run stable diffusion.

Key benefits of serverless GPUs

Scale-to zero GPUs: Support for serverless scaling of NVIDIA A100 and T4 GPUs.
Per-second billing: Pay only for the GPU compute you use.
Built-in data governance: Your data never leaves the container boundary.
Flexible compute options: Choose between NVIDIA A100 and T4 GPUs.
Middle-layer for AI development: Bring your own model on a managed, serverless compute platform and easily run your AI applications alongside your existing apps.

Scenarios

Our customers have been running a wide range of workloads on serverless GPUs. Below are some common use cases.

NVIDIA T4

Real-time and batch inferencing: Using custom open-source models with fast startup times, automatic scaling, and a per-second billing model, serverless GPUs are ideal for dynamic applications that don't already have a serverless API in the model catalog.

NVIDIA A100

Compute intensive machine learning scenarios: Significantly speed up applications that implement fine-tuned custom generative AI models, deep learning, or neural networks.
High performance computing (HPC) and data analytics: Applications that require complex calculations or simulations, such as scientific computing and financial modeling as well as accelerated data processing and analysis among massive datasets.

Serverless GPUs with NVIDIA NIM

Serverless GPUs now support NVIDIA NIM microservices, which simplify and accelerate the development of AI applications and agentic AI workflows with pre-packaged, scalable, and performance-tuned models that can be deployed as secure inference endpoints on Azure Container Apps.

In order to leverage the power of NVIDIA’s NIM, go to NVIDIA’s API catalog: Try NVIDIA NIM APIs, and select the NIM you wish to run with the ‘Run Anywhere’ NIM type. You will need to set your NGC_API_KEY as an environment variable when deploying Azure Container Apps. For a full set of instructions on how to add a NIM to your container app, follow the instructions here.

(Note: Each NIM model has certain hardware requirements, Azure Container Apps serverless GPUs support A100 and T4 GPUs. Please ensure the NIM you are selecting is supported by the hardware.)

Quota changes for GA

With GA, we are introducing default GPU quotas for enterprise and pay-as-you-go customers. All enterprise agreement customers will have quota for A100 and T4 GPUs.

The feature is supported in West US 3, Australia East, and Sweden Central.

Get started with serverless GPUs

From the portal, you can select to enable GPUs for your Consumption app in the container tab when creating your Container App or your Container App Job.

Note: In order to achieve the best performance with serverless GPUs, use an Azure Container Registry (ACR) with artifact streaming enabled for your image tag. Follow steps here to enable artifact streaming on your ACR.

To learn more about getting started with serverless GPUs, see our quickstart.

You can also add a new consumption GPU workload profile to your existing Container App environment through the workload profiles UX in portal or through the CLI commands for managing workload profiles.

Learn more about serverless GPUs and NIMs

With serverless GPUs, Azure Container Apps now simplifies the development of your AI applications by providing scale-to-zero compute, pay-as you go pricing, reduced infrastructure management, and more. To learn more, visit:

Updated Mar 27, 2025

Version 3.0

application modernization

Microsoft