Blog Post

Azure Storage Blog
6 MIN READ

Building a Scalable Web Crawling and Indexing Pipeline with Azure storage and AI Search

navprsingh's avatar
navprsingh
Icon for Microsoft rankMicrosoft
Apr 17, 2025

This blog post covers a customer implementation of knowledge base ingestion into Azure Blob Storage, and an AI Search based indexing pipeline along with links to the source code. A future post will cover how the curated knowledge base and search index are combined into a Retrieval Augmented Generation (RAG) template for natural-language based enterprise knowledge retrieval and response.

A leading telecommunications sector customer in Europe was facing challenges with the existing search functionality on their public website. End users could search for information, but the results were often scattered across multiple links, lacked summarization, and did not consistently provide relevant or accurate results. This inefficiency resulted in poor user experience, making it difficult for customers to quickly find the required information. Additionally, the organization's knowledge assets were spread across multiple sources, including public websites, On-premises SharePoint repositories and specific internal locations containing various document types such as PDFs, Word documents, and text files. The lack of a centralized search mechanism hindered efficient information retrieval and prevented the organization from leveraging its data effectively. 
 
To address these challenges, we developed an advanced search solution built upon knowledge base data ingestion and consolidation into Azure Storage, and automated blob-trigger based indexing and vectorization with Azure AI Search. 

Implementation of the search solution 

Data Ingestion:  

Data Ingestion involves collecting structured and unstructured data from multiple sources, including the public website, on-premises SharePoint, and designated internal locations. The following diagram explains how the different types of data (PDF/ text / HTML) are scraped from public websites using an Azure Function app and written to Blob storage. 

      

Figure: Azure function app for crawling and ingesting data into an Azure storage container

Data Processing & Indexing:  

Involves transforming, generating vector embeddings and indexing the data ingested into Azure Storage to enhance searchability, while ensuring that the most relevant and up to date results appear first. For this purpose, we built an AI Search Indexer Pipeline integrated with Azure Blob storage. The pipeline runs with automated blob triggers, to ensure updates to the ingested knowledge base trigger the pipeline for index freshness and accuracy.  

                                            Figure: AI Search Indexer - Data processing

AI Search Indexer Pipeline integration with Azure Storage 

Azure Blob Storage provides a scalable and cost-effective storage platform to manage unstructured data ingested from the web-crawler. Blob Storage integrates with Azure AI Search - a powerful cognitive search service that transforms unstructured data into searchable content. Indexer pipeline leverages a blob-triggered Azure Function to update an Azure AI Search indexer with a skillset—enabling the processing of diverse file types stored in Azure Blob Storage. This implementation choice enables the following outcomes:  

Triggering an Azure Function whenever a new file is uploaded, modified, or deleted in Blob Storage. 
Updating the Azure AI Search indexer dynamically so the latest data is always available for AI-driven search and retrieval. 
Applying cognitive skills (OCR, entity recognition, translation, etc.) to extract valuable insights from unstructured files.  

                                     Figure: Data processing and indexing workflow 

Solution Details - Key Components 

Web Crawler with Scrapy 

  • Scrapy, a popular Python web crawling framework, is used to extract unstructured data from public websites. 
  • Configurable spiders allow users to customize what data is scraped and how it's stored in Azure blob storage. 

Azure Blob Storage account 

  • Acts as the primary storage for the crawled unstructured data. 
  • Ensures scalability, secure storage and access of unstructured data. 

Blob Trigger with Azure Functions 

  • Monitors changes in the data stored in Blob (create, update, delete). 
  • Triggers an action to re-index the data in Azure AI Search whenever a change is detected. 

Azure AI Search Indexer 

  • Indexes the crawled data to make it searchable. 
  • Automatically syncs with the Blob data source, ensuring up-to-date content in the search index. 

 Step-by-Step Implementation  

  1. Data ingestion: Az function for ingestion which crawls public website 
    • Crawling: Set up an Azure Function for periodic crawling based on the starting page. Parse different file types (PDF / Word / HTML / Text) and place them in an Azure Blob Storage account for further indexing by AI Search. 
    • Code Example: Function APP

       

  1. Data Processing: Setting up Blob Trigger for Crawled data
    • Configuration: Set up a blob trigger within the AI Search indexer to fetch the data from Blob storage to a search index. Whenever a new file is created in Blob storage, the trigger will fire to update the index.  
    • Data Indexing: Create AI Search Index, Indexer, Data Source & Skillset 
    • Create an index and schema. The index serves as the schema for the documents that Azure AI Search processes. This schema includes the fields you want to make searchable, such as title, content, and metadata. Create a data source connecting the Blob Storage account with the index, create skillset for different skill types – for example, text extraction, language detection, OCR, or entity recognition. Configure the indexer schedule to run on-demand or on a schedule.
    • Code Example for creating Index, Indexer, Skillset, Data source: Index Code, Data Source, Indexer 
    • The configuration of the blob-trigger based indexer, including connection to the Blob Storage account along with indexer skillsets to extract relevant information and finally, the use of AI Search’s built-in Azure OpenAI embedding skill to generate and store vector embeddings in AI Search, concludes the final piece of the indexing pipeline. 

       

Key Benefits 

This solution streamlines data ingestion and indexing by automating the process through an Azure Function with Blob triggers. Unlike the conventional approaches, which require manually connecting each service and uploading data to the index, this method significantly reduces effort while enabling data and index freshness.  

Conventional vs. AI Search Indexer with Blob Trigger Approach 

Feature 

Conventional Azure Function Approach 

AI Search Indexer / Blob Trigger based Approach 

Architecture Complexity 

High, with multiple Azure Functions and API calls 

Low, with integrated data source and skillset in Azure AI Search 

Document Processing 

Manual triggering with custom error handling 

Automated with skillset-based processing and error handling 

Scalability 

Limited by Azure Function execution times and scaling rules 

Highly scalable with built-in support in Azure AI Search 

Skillset and Enrichment 

Requires external API calls to OpenAI 

Built-in skillset in Azure AI Search 

Code Maintenance 

High, due to separate workflows for OpenAI and Search AI calls 

Minimal, with configuration-driven skillset and indexing 

Cost Efficiency 

Potentially high due to multiple API calls 

Cost-effective with a single integrated indexing and enrichment flow 

 
More specifically, the solution delivers the following outcomes:  

  1. Real-Time Indexing and Updates 
    • The automated pipeline efficiently crawls public websites, extracts content, and stores it in Azure Blob Storage. Any modifications to the data source trigger updates in Azure AI Search, ensuring real-time indexing. This enables seamless content discovery and retrieval with minimal latency. 
  2. Enhanced Security & Resilience 
    • This solution enhances security and resilience by operating entirely within Azure’s internal services, eliminating the need for external function app calls. By leveraging Azure’s built-in security controls, it minimizes exposure to external threats while ensuring high availability. 
  3. Cost-Efficient & Scalable 
    • The solution is designed to optimize costs while providing scalability. Azure AI Search’s pay-as-you-go pricing ensures cost-effectiveness, and the architecture supports automatic scaling to handle varying workloads without manual intervention. 
  4. Simplified Configuration and Management 
    • With a publicly available GitHub repository containing the necessary code for Azure AI Search and deployment pipelines, this solution is easy to configure and deploy. It is ideal for organizations managing dynamic content, as it reduces operational complexity and accelerates production time. 
  5. Built-in AI-Powered Enrichment 
    • The indexing pipeline incorporates AI-powered enrichment capabilities, such as OCR, entity recognition, language detection, sentiment analysis and even the Azure OpenAI embedding skill. These skills simplify implementation and enable deeper insights into the knowledge base. 

Conclusion 

Using a Blob Trigger with an AI Search Indexer is a modern, efficient, and scalable approach to dynamically index data. It eliminates the limitations of traditional indexing methods by automating data ingestion, enriching content with AI capabilities, and ensuring secure, cost-effective, and resilient operations. This approach is particularly suited for scenarios with dynamic or high-volume data requirements, offering significant advantages to storage admins and developers alike 

 

Links & Reference for detailed implementation 

  • GitHub Repository Details: 
    The complete implementation can be found in the GitHub repository. 
    Key files include: 
    • spider.py: Defines Scrapy spiders for web crawling used within Az function. 
    • azure_function.py: Azure Function push data to Blob Trigger. 
    • indexer.py, datasource.py, skillset.py: Python scripts for defining the Azure Cognitive Search index, data source, and skillset. 
    • Bicep/main/main. Json: Infrastructure as Code (IaC) template for deploying the entire solution. 
    • Instruction Manual: This consists of complete instructions end to end on Infra resources created via Bicep template. Then AZ function deployment and then setting up the Azure AI search indexer.  
        

 #AzureSearch #CognitiveSearch #AIIndexer #AzureBlobStorage #DocumentIndexing #AIIntegration #Azure Storage #Storage 

Updated Apr 03, 2025
Version 1.0

10 Comments

OSZAR »