As AI becomes more deeply integrated into applications, managing and governing Large Language Models (LLMs) is more important than ever. Today, we’re excited to announce several major updates to AI Gateway in Azure API Management, including the general availability of LLM policies, expanded real-time API support, new integrations for semantic caching and content safety, and a streamlined UI experience to make it even easier to get started. Plus, you can now opt in to early access for the latest AI Gateway features. Let’s dive into what’s new!
General Availability of LLM policies
AI Gateway has already proven essential in addressing the unique challenges of managing and governing access to Large Language Models (LLMs), such as Azure OpenAI, specifically around tracking and controlling token usage. Last year, we expanded AI Gateway’s capabilities by introducing a preview of new LLM-focused policies for models available through the Azure AI Model Inference API.
Today, we’re excited to announce the general availability of these LLM policies: llm-token-limit, llm-emit-metric, llm-content-safety, and semantic caching policies. With this release, you can now apply these policies not only to models listed in the Azure AI Foundry model catalog but also to OpenAI-compatible models served through third-party inference providers or even to models hosted in your own infrastructure.
These capabilities give you greater flexibility, improved governance, and deeper insights across your LLM deployments, regardless of where your models are running.
Integration with Azure AI Content Safety
As LLMs are increasingly integrated into customer-facing applications, ensuring content safety has become essential for maintaining ethical and legal standards. To help safeguard users from harmful, offensive, or misleading content. With the integration of Azure AI Content Safety into AI Gateway, you can now automatically moderate all incoming user requests by configuring the llm-content-safety policy with specific sensitive categories. Learn more about llm-content-safety policy.
Track and limit token consumption for Azure OpenAI Real Time API
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o audio Realtime API is designed to handle real-time, low-latency conversational interactions, making it a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.
We’re adding extending models supported in llm-token-limit and llm-token-metric policies to support WebSocket-based Realtime API. This enhancement allows you to safely provide access to the GPT-4o Realtime API for your developers while maintaining full visibility and control over token consumption across your organization.
General Availability of Semantic Caching
After months of working closely with customers during the preview, we’re excited to announce the general availability of semantic caching capabilities in AI Gateway. We’ve seen a wide range of real-world usage scenarios, and this feedback has helped us refine the experience.
With this GA release, we’re also extending support to Azure Managed Redis (currently in preview), which can now be used as the backend for semantic caching in AI Gateway. Learn more about llm-semantic-caching policies.
Import Azure OpenAI Improvements
As we continue to add more powerful capabilities to AI Gateway in Azure API Management, we recognize that not everyone is deeply familiar with configuring policies manually. To simplify adoption, we previously introduced a one-click import experience that allows you to bring in existing Azure OpenAI endpoints with basic configuration for token limiting and token tracking policies – all directly from the UI.
Today, we’re making that experience even more seamless. The import workflow now supports automatic configuration for semantic caching policies. You can easily select a compatible Redis cache and an embeddings model from a dropdown, and Azure APIM will configure the semantic caching policies for you.
AI Gateway Update Group
Keep an eye out as we roll out these updates across all regions where Azure API Management is available.
Don’t want to wait for the latest AI Gateway features to reach your instance? You can now configure service update settings for your Azure API Management instances. By selecting the AI Gateway Early (GenAI Release) update group, you’ll get early access to the newest AI Gateway features and improvements – before they reach the broader update groups. Learn more about update groups.