azure ai studio
223 TopicsIntroducing Azure AI Models: The Practical, Hands-On Course for Real Azure AI Skills
Hello everyone, Today, I’m excited to share something close to my heart. After watching so many developers, including myself—get lost in a maze of scattered docs and endless tutorials, I knew there had to be a better way to learn Azure AI. So, I decided to build a guide from scratch, with a goal to break things down step by step—making it easy for beginners to get started with Azure, My aim was to remove the guesswork and create a resource where anyone could jump in, follow along, and actually see results without feeling overwhelmed. Introducing Azure AI Models Guide. This is a brand new, solo-built, open-source repo aimed at making Azure AI accessible for everyone—whether you’re just getting started or want to build real, production-ready apps using Microsoft’s latest AI tools. The idea is simple: bring all the essentials into one place. You’ll find clear lessons, hands-on projects, and sample code in Python, JavaScript, C#, and REST—all structured so you can learn step by step, at your own pace. I wanted this to be the resource I wish I’d had when I started: straightforward, practical, and friendly to beginners and pros alike. It’s early days for the project, but I’m excited to see it grow. If you’re curious.. Check out the repo at https://github.com/DrHazemAli/Azure-AI-Models Your feedback—and maybe even your contributions—will help shape where it goes next!98Views1like2CommentsDeploy Open Web UI on Azure VM via Docker: A Step-by-Step Guide with Custom Domain Setup.
Introductions Open Web UI (often referred to as "Ollama Web UI" in the context of LLM frameworks like Ollama) is an open-source, self-hostable interface designed to simplify interactions with large language models (LLMs) such as GPT-4, Llama 3, Mistral, and others. It provides a user-friendly, browser-based environment for deploying, managing, and experimenting with AI models, making advanced language model capabilities accessible to developers, researchers, and enthusiasts without requiring deep technical expertise. This article will delve into the step-by-step configurations on hosting OpenWeb UI on Azure. Requirements: Azure Portal Account - For students you can claim $USD100 Azure Cloud credits from this URL. Azure Virtual Machine - with a Linux of any distributions installed. Domain Name and Domain Host Caddy Open WebUI Image Step One: Deploy a Linux – Ubuntu VM from Azure Portal Search and Click on “Virtual Machine” on the Azure portal search bar and create a new VM by clicking on the “+ Create” button > “Azure Virtual Machine”. Fill out the form and select any Linux Distribution image – In this demo, we will deploy Open WebUI on Ubuntu Pro 24.04. Click “Review + Create” > “Create” to create the Virtual Machine. Tips: If you plan to locally download and host open source AI models via Open on your VM, you could save time by increasing the size of the OS disk / attach a large disk to the VM. You may also need a higher performance VM specification since large resources are needed to run the Large Language Model (LLM) locally. Once the VM has been successfully created, click on the “Go to resource” button. You will be redirected to the VM’s overview page. Jot down the public IP Address and access the VM using the ssh credentials you have setup just now. Step Two: Deploy the Open WebUI on the VM via Docker Once you are logged into the VM via SSH, run the Docker Command below: docker run -d --name open-webui --network=host --add-host=host.docker.internal:host-gateway -e PORT=8080 -v open-webui:/app/backend/data --restart always ghcr.io/open-webui/open-webui:dev This Docker command will download the Open WebUI Image into the VM and will listen for Open Web UI traffic on port 8080. Wait for a few minutes and the Web UI should be up and running. If you had setup an inbound Network Security Group on Azure to allow port 8080 on your VM from the public Internet, you can access them by typing into the browser: [PUBLIC_IP_ADDRESS]:8080 Step Three: Setup custom domain using Caddy Now, we can setup a reverse proxy to map a custom domain to [PUBLIC_IP_ADDRESS]:8080 using Caddy. The reason why Caddy is useful here is because they provide automated HTTPS solutions – you don’t have to worry about expiring SSL certificate anymore, and it’s free! You must download all Caddy’s dependencies and set up the requirements to install it using this command: sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list sudo apt update && sudo apt install caddy Once Caddy is installed, edit Caddy’s configuration file at: /etc/caddy/Caddyfile , delete everything else in the file and add the following lines: yourdomainname.com { reverse_proxy localhost:8080 } Restart Caddy using this command: sudo systemctl restart caddy Next, create an A record on your DNS Host and point them to the public IP of the server. Step Four: Update the Network Security Group (NSG) To allow public access into the VM via HTTPS, you need to ensure the NSG/Firewall of the VM allow for port 80 and 443. Let’s add these rules into Azure by heading to the VM resources page you created for Open WebUI. Under the “Networking” Section > “Network Settings” > “+ Create port rule” > “Inbound port rule” On the “Destination port ranges” field, type in 443 and Click “Add”. Repeat these steps with port 80. Additionally, to enhance security, you should avoid external users from directly interacting with Open Web UI’s port - port 8080. You should add an inbound deny rule to that port. With that, you should be able to access the Open Web UI from the domain name you setup earlier. Conclusion And just like that, you’ve turned a blank Azure VM into a sleek, secure home for your Open Web UI, no magic required! By combining Docker’s simplicity with Caddy’s “set it and forget it” HTTPS magic, you’ve not only made your app accessible via a custom domain but also locked down security by closing off risky ports and keeping traffic encrypted. Azure’s cloud muscle handles the heavy lifting, while you get to enjoy the perks of a pro setup without the headache. If you are interested in using AI models deployed on Azure AI Foundry on OpenWeb UI via API, kindly read my other article: Step-by-step: Integrate Ollama Web UI to use Azure Open AI API with LiteLLM Proxy1.9KViews1like1CommentCreate Stunning AI Videos with Sora on Azure AI Foundry!
Special credit to Rory Preddy for creating the GitHub resource that enable us to learn more about Azure Sora. Reach him out on LinkedIn to say thanks. Introduction Artificial Intelligence (AI) is revolutionizing content creation, and video generation is at the forefront of this transformation. OpenAI's Sora, a groundbreaking text-to-video model, allows creators to generate high-quality videos from simple text prompts. When paired with the powerful infrastructure of Azure AI Foundry, you can harness Sora's capabilities with scalability and efficiency, whether on a local machine or a remote setup. In this blog post, I’ll walk you through the process of generating AI videos using Sora on Azure AI Foundry. We’ll cover the setup for both local and remote environments. Requirements: Azure AI Foundry with sora model access A Linux Machine/VM. Make sure that the machine already has the package below: Java JRE 17 (Recommended) OR later Maven Step Zero – Deploying the Azure Sora model on AI Foundry Navigate to the Azure AI Foundry portal and head to the “Models + Endpoints” section (found on the left side of the Azure AI Foundry portal) > Click on the “Deploy Model” button > “Deploy base model” > Search for Sora > Click on “Confirm”. Give a deployment name and specify the Deployment type > Click “Deploy” to finalize the configuration. You should receive an API endpoint and Key after successful deploying Sora on Azure AI Foundry. Store these in a safe place because we will be using them in the next steps. Step one – Setting up the Sora Video Generator in the local/remote machine. Clone the roryp/sora repository on your machine by running the command below: git clone https://github.com/roryp/sora.git cd sora Then, edit the application.properties file in the src/main/resources/ folder to include your Azure OpenAI Credentials. Change the configuration below: azure.openai.endpoint=https://your-openai-resource.cognitiveservices.azure.com azure.openai.api-key=your_api_key_here If port 8080 is used for another application, and you want to change the port for which the web app will run, change the “server.port” configuration to include the desired port. Allow appropriate permissions to run the “mvnw” script file. chmod +x mvnw Run the application ./mvnw spring-boot:run Open your browser and type in your localhost/remote host IP (format: [host-ip:port]) in the browser search bar. If you are running a remote host, please do not forget to update your firewall/NSG to allow inbound connection to the configured port. You should see the web app to generate video with Sora AI using the API provided on Azure AI Foundry. Now, let’s generate a video with Sora Video Generator. Enter a prompt in the first text field, choose the video pixel resolution, and set the video duration. (Due to technical limitation, Sora can only generate video of a maximum of 20 seconds). Click on the “Generate video” button to proceed. The cost to generate the video should be displayed below the “Generate Video” button, for transparency purposes. You can click on the “View Breakdown” button to learn more about the cost breakdown. The video should be ready to download after a maximum of 5 minutes. You can check the status of the video by clicking on the “Check Status” button on the web app. The web app will inform you once the download is ready and the page should refresh every 10 seconds to fetch real-time update from Sora. Once it is ready, click on the “Download Video” button to download the video. Conclusion Generating AI videos with Sora on Azure AI Foundry is a game-changer for content creators, marketers, and developers. By following the steps outlined in this guide, you can set up your environment, integrate Sora, and start creating stunning AI-generated videos. Experiment with different prompts, optimize your workflow, and let your imagination run wild! Have you tried generating AI videos with Sora or Azure AI Foundry? Share your experiences or questions in the comments below. Don’t forget to subscribe for more AI and cloud computing tutorials!294Views0likes0CommentsMastering Model Context Protocol (MCP): Building Multi Server MCP with Azure OpenAI
Create complex Multi MCP AI Agentic applications. Deep dive into Multi Server MCP implementation, connecting both local custom and ready MCP Servers in a single client session through a custom chatbot interface.3.1KViews7likes3CommentsIntroduction to OCR Free Vision RAG using Colpali For Complex Documents
Explore the cutting-edge world of document retrieval with "From Pixels to Intelligence: Introduction to OCR Free Vision RAG using ColPali for Complex Documents." This blog post delves into how ColPali revolutionizes the way we interact with documents by leveraging Vision Language Models (VLMs) to enhance Retrieval-Augmented Generation (RAG) processes.9.4KViews1like1CommentAI Automation in Azure Foundry through turnkey MCP Integration and Computer Use Agent Models
The Fashion Trends Discovery Scenario In this walkthrough, we'll explore a sample application that demonstrates the power of combining Computer Use (CUA) models with Playwright browser automation to autonomously compile trend information from the internet, while leveraging MCP integration to intelligently catalog and store insights in Azure Blob Storage. The User Experience A fashion analyst simply provides a query like "latest trends in sustainable fashion" to our command-line interface. What happens next showcases the power of agentic AI—the system requires no further human intervention to: Autonomous Web Navigation: The agent launches Pinterest, intelligently locates search interfaces, and performs targeted queries Intelligent Content Discovery: Systematically identifies and interacts with trend images, navigating to detailed pages Advanced Content Analysis: Applies computer vision to analyze fashion elements, colors, patterns, and design trends Intelligent Compilation: Consolidates findings into comprehensive, professionally formatted markdown reports Contextual Storage: Recognizes the value of preserving insights and autonomously offers cloud storage options Technical capabilities leveraged Behind this seamless experience lies a coordination of AI models: Pinterest Navigation: The CUA model visually understands Pinterest's interface layout, identifying search boxes and navigation elements with pixel-perfect precision Search Results Processing: Rather than relying on traditional DOM parsing, our agent uses visual understanding to identify trend images and calculate precise interaction coordinates Content Analysis: Each discovered trend undergoes detailed analysis using GPT-4o's advanced vision capabilities, extracting insights about fashion elements, seasonal trends, and style patterns Autonomous Decision Making: The agent contextually understands when information should be preserved and automatically engages with cloud storage systems Technology Stack Overview At the heart of this solution lies an orchestration of several AI technologies, each serving a specific purpose in creating a truly autonomous agent. The architecture used ``` ┌─────────────────────────────────────────────────────────────────┐ │ Azure AI Foundry │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Responses API │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │ │ │ │ CUA Model │ │ GPT-4o │ │ Built-in MCP │ │ │ │ │ │ (Interface) │ │ (Content) │ │ Client │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ Function Calling Layer │ │ (Workflow Orchestration) │ └─────────────────────────────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────┐ │ Playwright │◄──────────────► │ Trends Compiler │ │ Automation │ │ Engine │ └─────────────────┘ └──────────────────┘ │ ▼ ┌─────────────────────┐ │ Azure Blob │ │ Storage (MCP) │ └─────────────────────┘ ``` Azure OpenAI Responses API At the core of the agentic architecture in this solution, the Responses API provides intelligent decision-making capabilities that determine when to invoke Computer Use models for web crawling versus when to engage MCP servers for data persistence. This API serves as the brain of our agent, contextually understanding user intent and autonomously choosing the appropriate tools to fulfill complex multi-step workflows. Computer Use (CUA) Model Our specialized CUA model excels at visual understanding of web interfaces, providing precise coordinate mapping for browser interactions, layout analysis, and navigation planning. Unlike general-purpose language models, the CUA model is specifically trained to understand web page structures, identify interactive elements, and provide actionable coordinates for automated browser control. Playwright Browser Automation Acting as the hands of our agent, Playwright executes the precise actions determined by the CUA model. This robust automation framework translates AI insights into real-world browser interactions, handling everything from clicking and typing to screenshot capture and page navigation with pixel-perfect accuracy. GPT-4o Vision Model for Content Analysis While the CUA model handles interface understanding, GPT-4o provides domain-specific content reasoning. This powerful vision model analyzes fashion trends, extracts meaningful insights from images, and provides rich semantic understanding of visual content—capabilities that complement rather than overlap with the CUA model's interface-focused expertise. Model Context Protocol (MCP) Integration The application showcases the power of agentic AI through its autonomous decision-making around data persistence. The agent intelligently recognizes when compiled information needs to be stored and automatically engages with Azure Blob Storage through MCP integration, without requiring explicit user instruction for each storage operation. Unlike traditional function calling patterns where custom applications must relay MCP calls through client libraries, the Responses API includes a built-in MCP client that directly communicates with MCP servers. This eliminates the need for complex relay logic, making MCP integration as simple as defining tool configurations. Function Calling Orchestration Function calling orchestrates the complex workflow between CUA model insights and Playwright actions. Each step is verified and validated before proceeding, ensuring robust autonomous operation without human intervention throughout the entire trend discovery and analysis process. Let me walk you through the code used in the Application. Agentic Decision Making in Action Let's examine how our application demonstrates true agentic behavior through the main orchestrator in `app.py`: async def main() -> str: """Main entry point demonstrating agentic decision making.""" conversation_history = [] generated_reports = [] while True: user_query = input("Enter your query for fashion trends:-> ") # Add user input to conversation context new_user_message = { "role": "user", "content": [{"type": "input_text", "text": user_query}], } conversation_history.append(new_user_message) # The agent analyzes context and decides on appropriate actions response = ai_client.create_app_response( instructions=instructions, conversation_history=conversation_history, mcp_server_url=config.mcp_server_url, available_functions=available_functions, ) # Process autonomous function calls and MCP tool invocations for output in response.output: if output.type == "function_call": # Agent decides to compile trends function_to_call = available_functions[output.name] function_args = json.loads(output.arguments) function_response = await function_to_call(**function_args) elif output.type == "mcp_tool_call": # Agent decides to use MCP tools for storage print(f"MCP tool call: {output.name}") # MCP calls handled automatically by Responses API Key Agentic Behaviors Demonstrated: Contextual Analysis: The agent examines conversation history to understand whether the user wants trend compilation or storage operations Autonomous Tool Selection: Based on context, the agent chooses between function calls (for trend compilation) and MCP tools (for storage) State Management: The agent maintains conversation context across multiple interactions, enabling sophisticated multi-turn workflows Function Calling Orchestration: Autonomous Web Intelligence The `TrendsCompiler` class in `compiler.py` demonstrates sophisticated autonomous workflow orchestration: class TrendsCompiler: """Autonomous trends compilation with multi-step verification.""" async def compile_trends(self, user_query: str) -> str: """Main orchestration loop with autonomous step progression.""" async with LocalPlaywrightComputer() as computer: state = {"trends_compiled": False} step = 0 while not state["trends_compiled"]: try: if step == 0: # Step 1: Autonomous Pinterest navigation await self._launch_pinterest(computer) step += 1 elif step == 1: # Step 2: CUA-driven search and coordinate extraction coordinates = await self._search_and_get_coordinates( computer, user_query ) if coordinates: step += 1 elif step == 2: # Step 3: Autonomous content analysis and compilation await self._process_image_results( computer, coordinates, user_query ) markdown_report = await self._generate_markdown_report( user_query ) state["trends_compiled"] = True except Exception as e: print(f"Autonomous error handling in step {step}: {e}") state["trends_compiled"] = True return markdown_report Autonomous Operation Highlights: Self-Verifying Steps: Each step validates completion before advancing Error Recovery: Autonomous error handling without human intervention State-Driven Progression: The agent maintains its own execution state No User Prompts: Complete automation from query to final report Pinterest's Unique Challenge: Visual Coordinate Intelligence One of the most impressive demonstrations of CUA model capabilities lies in solving Pinterest's hidden URL challenge: async def _detect_search_results(self, computer) -> List[Tuple[int, int, int, int]]: """Use CUA model to extract image coordinates from search results.""" # Take screenshot for CUA analysis screenshot_bytes = await computer.screenshot() screenshot_b64 = base64.b64encode(screenshot_bytes).decode() # CUA model analyzes visual layout and identifies image boundaries prompt = """ Analyze this Pinterest search results page and identify all trend/fashion images displayed. For each image, provide the exact bounding box coordinates in the format: <click>x1,y1,x2,y2</click> Focus on the main content images, not navigation or advertisement elements. """ response = await self.ai_client.create_cua_response( prompt=prompt, screenshot_b64=screenshot_b64 ) # Extract coordinates using specialized parser coordinates = self.coordinate_parser.extract_coordinates(response.content) print(f"CUA model identified {len(coordinates)} image regions") return coordinates The Coordinate Calculation: def calculate_centers(self, coordinates: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int]]: """Calculate center coordinates for precise clicking.""" centers = [] for x1, y1, x2, y2 in coordinates: center_x = (x1 + x2) // 2 center_y = (y1 + y2) // 2 centers.append((center_x, center_y)) return centers key take aways with this approach: No DOM Dependency: Pinterest's hover-based URL revelation becomes irrelevant Visual Understanding: The CUA model sees what humans see—image boundaries Pixel-Perfect Targeting: Calculated center coordinates ensure reliable clicking Robust Navigation: Works regardless of Pinterest's frontend implementation changes Model Specialization: The Right AI for the Right Job Our solution demonstrates sophisticated AI model specialization: async def _analyze_trend_page(self, computer, user_query: str) -> Dict[str, Any]: """Use GPT-4o for domain-specific content analysis.""" # Capture the detailed trend page screenshot_bytes = await computer.screenshot() screenshot_b64 = base64.b64encode(screenshot_bytes).decode() # GPT-4o analyzes fashion content semantically analysis_prompt = f""" Analyze this fashion trend page for the query: "{user_query}" Provide detailed analysis of: 1. Fashion elements and style characteristics 2. Color palettes and patterns 3. Seasonal relevance and trend timing 4. Target demographics and style categories 5. Design inspiration and cultural influences Format as structured markdown with clear sections. """ # Note: Using GPT-4o instead of CUA model for content reasoning response = await self.ai_client.create_vision_response( model=self.config.vision_model_name, # GPT-4o prompt=analysis_prompt, screenshot_b64=screenshot_b64 ) return { "analysis": response.content, "timestamp": datetime.now().isoformat(), "query_context": user_query } Model Selection Rationale: CUA Model: Perfect for understanding "Where to click" and "How to navigate" GPT-4o: Excels at "What does this mean" and "How is this relevant" Specialized Strengths: Each model operates in its domain of expertise Complementary Intelligence: Combined capabilities exceed individual model limitations Compilation and Consolidation async def _generate_markdown_report(self, user_query: str) -> str: """Consolidate all analyses into comprehensive markdown report.""" if not self.image_analyses: return "No trend data collected for analysis." # Intelligent report structuring report_sections = [ f"# Fashion Trends Analysis: {user_query}", f"*Generated on {datetime.now().strftime('%B %d, %Y')}*", "", "## Executive Summary", await self._generate_executive_summary(), "", "## Detailed Trend Analysis" ] # Process each analyzed trend with intelligent categorization for idx, analysis in enumerate(self.image_analyses, 1): trend_section = [ f"### Trend Item {idx}", analysis.get('analysis', 'No analysis available'), f"*Analysis timestamp: {analysis.get('timestamp', 'Unknown')}*", "" ] report_sections.extend(trend_section) # Add intelligent trend synthesis report_sections.extend([ "## Trend Synthesis and Insights", await self._generate_trend_synthesis(), "", "## Recommendations", await self._generate_recommendations() ]) return "\n".join(report_sections) Intelligent Compilation Features: Automatic Structuring: Creates professional report formats automatically Content Synthesis: Combines individual analyses into coherent insights Temporal Context: Maintains timestamp and query context Executive Summaries: Generates high-level insights from detailed data Autonomous Storage Intelligence Note that there is no MCP Client code that needs to be implemented here. The integration is completely turnkey, through configuration alone. # In app_client.py - MCP tool configuration def create_app_tools(self, mcp_server_url: str, available_functions: Dict[str, Any]) -> List[Dict[str, Any]]: """Configure tools with automatic MCP integration.""" tools = [ { "type": "mcp", "server_label": "azure-storage-mcp-server", "server_url": mcp_server_url, "require_approval": "never", # Autonomous operation "allowed_tools": ["create_container", "list_containers", "upload_blob"], } ] return tools # Agent instructions demonstrate contextual intelligence instructions = f""" Step1: Compile trends based on user query using computer use agent. Step2: Prompt user to store trends report in Azure Blob Storage. Use MCP Server tools to perform this action autonomously. IMPORTANT: Maintain context of previously generated reports. If user asks to store a report, use the report generated in this session. """ Turnkey MCP Integration: Direct API Calls: MCP tools called directly by Responses API No Relay Logic: No custom MCP client implementation required Autonomous Tool Selection: Agent chooses appropriate MCP tools based on context Contextual Storage: Agent understands what to store and when Demo and Code reference Here is the GitHub Repo of the Application described in this post. See a demo of this application in action: Conclusion: Entering the Age of Practical Agentic AI The Fashion Trends Compiler Agent represents Agentic AI applications that work autonomously in real-world scenarios. By combining Azure AI Foundry's turnkey MCP integration with specialized AI models and robust automation frameworks, we've created an agent that doesn't just follow instructions but intelligently navigates complex multi-step workflows with minimal human oversight. Ready to build your own agentic AI solutions? Start exploring Azure AI Foundry's MCP integration and Computer Use capabilities to create the next generation of intelligent automation.955Views3likes0CommentsRAFT: A new way to teach LLMs to be better at RAG
In this article, we will look at the limitations of RAG and domain-specific Fine-tuning to adapt LLMs to existing knowledge and how a team of UC Berkeley researchers, Tianjun Zhang and Shishir G. Patil, may have just discovered a better approach.104KViews7likes5CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs: GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)902Views3likes7CommentsConfigure Embedding Models on Azure AI Foundry with Open Web UI
Introduction Let’s take a closer look at an exciting development in the AI space. Embedding models are the key to transforming complex data into usable insights, driving innovations like smarter chatbots and tailored recommendations. With Azure AI Foundry, Microsoft’s powerful platform, you’ve got the tools to build and scale these models effortlessly. Add in Open Web UI, a intuitive interface for engaging with AI systems, and you’ve got a winning combo that’s hard to beat. In this article, we’ll explore how embedding models on Azure AI Foundry, paired with Open Web UI, are paving the way for accessible and impactful AI solutions for developers and businesses. Let’s dive in! To proceed with configuring the embedding model from Azure AI Foundry on Open Web UI, please firstly configure the requirements below. Requirements: Setup Azure AI Foundry Hub/Projects Deploy Open Web UI – refer to my previous article on how you can deploy Open Web UI on Azure VM. Optional: Deploy LiteLLM with Azure AI Foundry models to work on Open Web UI - refer to my previous article on how you can do this as well. Deploying Embedding Models on Azure AI Foundry Navigate to the Azure AI Foundry site and deploy an embedding model from the “Model + Endpoint” section. For the purpose of this demonstration, we will deploy the “text-embedding-3-large” model by OpenAI. You should be receiving a URL endpoint and API Key to the embedding model deployed just now. Take note of that credential because we will be using it in Open Web UI. Configuring the embedding models on Open Web UI Now head to the Open Web UI Admin Setting Page > Documents and Select Azure Open AI as the Embedding Model Engine. Copy and Paste the Base URL, API Key, the Embedding Model deployed on Azure AI Foundry and the API version (not the model version) into the fields below: Click “Save” to reflect the changes. Expected Output Now let us look into the scenario for when the embedding model configured on Open Web UI and when it is not. Without Embedding Models configured. With Azure Open AI Embedding models configured. Conclusion And there you have it! Embedding models on Azure AI Foundry, combined with the seamless interaction offered by Open Web UI, are truly revolutionizing how we approach AI solutions. This powerful duo not only simplifies the process of building and deploying intelligent systems but also makes cutting-edge technology more accessible to developers and businesses of all sizes. As we move forward, it’s clear that such integrations will continue to drive innovation, breaking down barriers and unlocking new possibilities in the AI landscape. So, whether you’re a seasoned developer or just stepping into this exciting field, now’s the time to explore what Azure AI Foundry and Open Web UI can do for you. Let’s keep pushing the boundaries of what’s possible!483Views0likes0CommentsA Recap of the Build AI Agents with Custom Tools Live Session
Artificial Intelligence is evolving, and so are the ways we build intelligent agents. On a recent Microsoft YouTube Live session, developers and AI enthusiasts gathered to explore the power of custom tools in AI agents using Azure AI Studio. The session walked through concepts, use cases, and a live demo that showed how integrating custom tools can bring a new level of intelligence and adaptability to your applications. 🎥 Watch the full session here: https://www.youtube.com/live/MRpExvcdxGs?si=X03wsQxQkkshEkOT What Are AI Agents with Custom Tools? AI agents are essentially smart workflows that can reason, plan, and act — powered by large language models (LLMs). While built-in tools like search, calculator, or web APIs are helpful, custom tools allow developers to tailor agents for business-specific needs. For example: Calling internal APIs Accessing private databases Triggering backend operations like ticket creation or document generation Learn Module Overview: Build Agents with Custom Tools To complement the session, Microsoft offers a self-paced Microsoft Learn module that gives step-by-step guidance: Explore the module Key Learning Objectives: Understand why and when to use custom tools in agents Learn how to define, integrate, and test tools using Azure AI Studio Build an end-to-end agent scenario using custom capabilities Hands-On Exercise: The module includes a guided lab where you: Define a tool schema Register the tool within Azure AI Studio Build an AI agent that uses your custom logic Test and validate the agent’s response Highlights from the Live Session Here are some gems from the session: Real-World Use Cases – Automating customer support, connecting to CRMs, and more Tool Manifest Creation – Learn how to describe a tool in a machine-understandable way Live Azure Demo – See exactly how to register tools and invoke them from an AI agent Tips & Troubleshooting – Best practices and common pitfalls when designing agents Want to Get Started? If you're a developer, AI enthusiast, or product builder looking to elevate your agent’s capabilities — custom tools are the next step. Start building your own AI agents by combining the power of: Microsoft Learn Module YouTube Live Session Final Thoughts The future of AI isn't just about smart responses — it's about intelligent actions. Custom tools enable your AI agent to do things, not just say things. With Azure AI Studio, building a practical, action-oriented AI assistant is more accessible than ever. Learn More and Join the Community Learn more about AI Agents with https://aka.ms/ai-agents-beginnersOpen Source Course and Building Agents. Join the Azure AI Foundry Discord Channel. Continue the discussion and learning: https://aka.ms/AI/discord Have questions or want to share what you're building? Let’s connect on LinkedIn or drop a comment under the YouTube video!165Views0likes0Comments