AI-Powered PDF Processing in C#: Summarize, Extract, and Analyze Documents with IronPDF

Updated:February 4, 2026

PDFs have become the universal language of business documentation, legal contracts, financial reports, and academic research. Yet for decades, the wealth of information locked inside these documents remained largely inaccessible to automated systems. Traditional PDF processing could extract raw text, but understanding what that text meant—identifying key clauses in contracts, extracting structured financial data, or answering complex questions about document content—required painstaking manual review.

The AI revolution has changed everything. Modern large language models (LLMs) can now read, understand, and analyze PDFs with human-like comprehension. Anthropic's newly released Claude Opus 4.6 (February 2026) introduces agent teams and a 1M-token context window for processing entire document libraries in a single session. Combined with GPT-5's established 272,000-token context window and unified reasoning capabilities, these models can identify contract obligations, summarize PDF documents into executive briefings, extract data into JSON format, and answer nuanced questions about document content with unprecedented accuracy.

IronPDF rises to this challenge by providing seamless integration between C# applications and AI services like OpenAI and Azure OpenAI. Through the IronPdf.Extensions.AI package, developers can add document summarization, intelligent querying, and structured data extraction to their .NET applications with just a few lines of code. Built on Microsoft Semantic Kernel, this integration handles the complexity of preparing PDFs for AI consumption, managing context windows, and orchestrating multi-step AI workflows—letting you focus on building powerful document intelligence features rather than wrestling with infrastructure.

In this tutorial, we'll explore how to harness AI for PDF processing in C#. You'll learn to summarize documents, extract structured data, build question-answering systems, implement RAG (Retrieval-Augmented Generation) patterns for long documents, and process document libraries at scale. Whether you're building legal discovery tools, financial analysis systems, or research automation platforms, you'll discover how IronPDF and the latest AI models can transform your document workflows.

TL;DR: Quickstart Guide

Transform any PDF into an actionable summary after initializing the AI engine. IronPDF's AI extension connects seamlessly to Azure OpenAI through Microsoft Semantic Kernel, allowing you to extract key insights from documents.

Install IronPDF with NuGet Package Manager
PM > Install-Package IronPdf

Copy and run this code snippet.

await IronPdf.AI.PdfAIEngine.Summarize("contract.pdf", "summary.txt", azureEndpoint, azureApiKey);

Deploy to test on your live environment
Start using IronPDF in your project today with a free trial
Free 30 day Trial

Before using AI features, you must initialize the Semantic Kernel with your Azure OpenAI credentials (see Initializing the AI Engine below). After you've purchased or signed up for a 30-day trial of IronPDF, add your license key at the start of your application.

IronPdf.License.LicenseKey = "KEY";

IronPdf.License.LicenseKey = "KEY";

Imports IronPdf

IronPdf.License.LicenseKey = "KEY"

$vbLabelText $csharpLabel

TL;DR: Quickstart Guide

The AI + PDF Opportunity
IronPDF's Built-in AI Integration
Document Summarization
Intelligent Data Extraction
Question-Answering Over Documents
Batch AI Processing
Real-World Use Cases
Troubleshooting & Technical Support

The AI + PDF Opportunity

Why PDFs Are the Biggest Untapped Data Source

PDFs represent one of the largest repositories of structured business knowledge in the modern enterprise. Professional documents—contracts, financial statements, compliance reports, legal briefs, and research papers—are predominantly stored in PDF format. These documents contain critical business intelligence: contract terms that define obligations and liabilities, financial metrics that drive investment decisions, regulatory requirements that ensure compliance, and research findings that guide strategy.

Yet traditional approaches to PDF processing have been severely limited. Basic text extraction tools can pull raw characters from a page, but they lose crucial context: table structures collapse into jumbled text, multi-column layouts become nonsensical, and the semantic relationships between sections disappear.

The breakthrough comes from AI's ability to understand context and structure. Modern LLMs don't just see words—they comprehend document organization, recognize patterns like contract clauses or financial tables, and can extract meaning even from complex layouts. GPT-5's unified reasoning system with its real-time router and Claude Sonnet 4.5's enhanced agentic capabilities both demonstrate significantly reduced hallucination rates compared to earlier models, making them reliable for professional document analysis.

How LLMs Understand Document Structure

Large language models bring sophisticated natural language processing capabilities to PDF analysis. GPT-5's hybrid architecture features multiple sub-models (main, mini, thinking, nano) with a real-time router that dynamically selects the optimal variant based on task complexity—simple questions route to faster models while complex reasoning tasks engage the full model.

Claude Opus 4.6 excels particularly at long-running agentic tasks, with agent teams that coordinate directly on segmented jobs and a 1M-token context window that can process entire document libraries without chunking.

How AI models analyze PDF document structure and identify elements

This contextual understanding enables LLMs to perform tasks that require genuine comprehension. When analyzing a contract, an LLM can identify not just clauses containing the word "termination," but understand the specific conditions under which termination is permitted, the notice requirements involved, and the liabilities that result. The technical foundation enabling this capability is the transformer architecture that powers modern LLMs, with GPT-5's context window supporting up to 272,000 input tokens and Claude Sonnet 4.5's 200K token window providing comprehensive document coverage.

IronPDF's Built-in AI Integration

Installing IronPDF and AI Extensions

Getting started with AI-powered PDF processing requires the core IronPDF library, the AI extensions package, and Microsoft Semantic Kernel dependencies.

Install IronPDF with NuGet Package Manager:

PM > Install-Package IronPdf
PM > Install-Package IronPdf.Extensions.AI
PM > Install-Package Microsoft.SemanticKernel
PM > Install-Package Microsoft.SemanticKernel.Plugins.Memory

PM > Install-Package IronPdf
PM > Install-Package IronPdf.Extensions.AI
PM > Install-Package Microsoft.SemanticKernel
PM > Install-Package Microsoft.SemanticKernel.Plugins.Memory

SHELL

These packages work together to provide a complete solution. IronPDF handles all PDF-related operations—text extraction, page rendering, format conversion—while the AI extension manages the integration with language models through Microsoft Semantic Kernel.

Please noteThe Semantic Kernel packages include experimental APIs. Add <NoWarn>$(NoWarn);SKEXP0001;SKEXP0010;SKEXP0050</NoWarn> to your .csproj PropertyGroup to suppress compiler warnings.

Configuring Your OpenAI/Azure API Key

Before you can leverage AI features, you need to configure access to an AI service provider. IronPDF's AI extension supports both OpenAI and Azure OpenAI. Azure OpenAI is often preferred for enterprise applications because it provides enhanced security features, compliance certifications, and the ability to keep data within specific geographic regions.

To configure Azure OpenAI, you'll need your Azure endpoint URL, API key, and deployment names for both chat and embedding models from the Azure portal.

Initializing the AI Engine

IronPDF's AI extension uses Microsoft Semantic Kernel under the hood. Before using any AI features, you must initialize the kernel with your Azure OpenAI credentials and configure the memory store for document processing.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/configure-azure-credentials.cs

// Initialize IronPDF AI with Azure OpenAI credentials
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel with Azure OpenAI
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

// Create memory store for document embeddings
var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

// Initialize IronPDF AI
IronDocumentAI.Initialize(kernel, memory);

Console.WriteLine("IronPDF AI initialized successfully with Azure OpenAI");

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel with Azure OpenAI
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

' Create memory store for document embeddings
Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

' Initialize IronPDF AI
IronDocumentAI.Initialize(kernel, memory)

Console.WriteLine("IronPDF AI initialized successfully with Azure OpenAI")

$vbLabelText $csharpLabel

The initialization creates two key components:

Kernel: Handles chat completions and text embedding generation through Azure OpenAI
Memory: Stores document embeddings for semantic search and retrieval operations

Once initialized with IronDocumentAI.Initialize(), you can use AI features throughout your application. For production applications, storing credentials in environment variables or Azure Key Vault is strongly recommended.

How IronPDF Prepares PDFs for AI Context

One of the most challenging aspects of AI-powered PDF processing is preparing documents for consumption by language models. While GPT-5 supports up to 272,000 input tokens and Claude Opus 4.6 now offers a 1M token context window, a single legal contract or financial report can still easily exceed older models' limits.

IronPDF's AI extension handles this complexity through intelligent document preparation. When you call an AI method, IronPDF first extracts text from the PDF while preserving structural information—identifying paragraphs, preserving table structures, and maintaining the relationships between sections.

For documents that exceed context limits, IronPDF implements strategic chunking at semantic breakpoints—natural divisions in document structure like section headers, page breaks, or paragraph boundaries.

Document Summarization

Single Document Summaries

Document summarization delivers immediate value by condensing lengthy documents into digestible insights. The Summarize method handles the entire workflow: extracting text, preparing it for AI consumption, requesting a summary from the language model, and saving the results.

Input

The code loads a PDF using PdfDocument.FromFile() and calls pdf.Summarize() to generate a concise summary, then saves the result to a text file.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/single-document-summary.cs

// Summarize a PDF document using IronPDF AI
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

// Load and summarize PDF
var pdf = PdfDocument.FromFile("sample-report.pdf");
string summary = await pdf.Summarize();

Console.WriteLine("Document Summary:");
Console.WriteLine(summary);

File.WriteAllText("report-summary.txt", summary);
Console.WriteLine("\nSummary saved to report-summary.txt");

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

' Load and summarize PDF
Dim pdf = PdfDocument.FromFile("sample-report.pdf")
Dim summary As String = Await pdf.Summarize()

Console.WriteLine("Document Summary:")
Console.WriteLine(summary)

File.WriteAllText("report-summary.txt", summary)
Console.WriteLine(vbCrLf & "Summary saved to report-summary.txt")

$vbLabelText $csharpLabel

Console Output

Console output showing PDF document summarization results in C#

The summarization process uses sophisticated prompting to ensure high-quality results. Both GPT-5 and Claude Sonnet 4.5 in 2026 feature significantly improved instruction following capabilities, ensuring summaries capture essential information while remaining concise and readable.

For a more detailed explanation of document summarization techniques and advanced options, please refer to our how-to guide.

Multi-Document Synthesis

Many real-world scenarios require synthesizing information across multiple documents. A legal team might need to identify common clauses across a portfolio of contracts, or a financial analyst might want to compare metrics across quarterly reports.

The approach to multi-document synthesis involves processing each document individually to extract key information, then aggregating these insights for final synthesis.

This example iterates through multiple PDFs, calling pdf.Summarize() on each, then uses pdf.Query() with the combined summaries to generate a unified synthesis.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/multi-document-synthesis.cs

// Synthesize insights across multiple related documents (e.g., quarterly reports into annual summary)
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

// Define documents to synthesize
string[] documentPaths = {
    "Q1-report.pdf",
    "Q2-report.pdf",
    "Q3-report.pdf",
    "Q4-report.pdf"
};

var documentSummaries = new List<string>();

// Summarize each document
foreach (string path in documentPaths)
{
    var pdf = PdfDocument.FromFile(path);
    string summary = await pdf.Summarize();
    documentSummaries.Add($"=== {Path.GetFileName(path)} ===\n{summary}");
    Console.WriteLine($"Processed: {path}");
}

// Combine and synthesize across all documents
string combinedSummaries = string.Join("\n\n", documentSummaries);

var synthesisDoc = PdfDocument.FromFile(documentPaths[0]);

string synthesisQuery = @"Based on the quarterly summaries below, provide an annual synthesis:
1. Overall trends across quarters
2. Key achievements and challenges
3. Year-over-year patterns

Summaries:
" + combinedSummaries;

string synthesis = await synthesisDoc.Query(synthesisQuery);

Console.WriteLine("\n=== Annual Synthesis ===");
Console.WriteLine(synthesis);

File.WriteAllText("annual-synthesis.txt", synthesis);

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.IO

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

' Define documents to synthesize
Dim documentPaths As String() = {
    "Q1-report.pdf",
    "Q2-report.pdf",
    "Q3-report.pdf",
    "Q4-report.pdf"
}

Dim documentSummaries = New List(Of String)()

' Summarize each document
For Each path As String In documentPaths
    Dim pdf = PdfDocument.FromFile(path)
    Dim summary As String = Await pdf.Summarize()
    documentSummaries.Add($"=== {Path.GetFileName(path)} ==={vbCrLf}{summary}")
    Console.WriteLine($"Processed: {path}")
Next

' Combine and synthesize across all documents
Dim combinedSummaries As String = String.Join(vbCrLf & vbCrLf, documentSummaries)

Dim synthesisDoc = PdfDocument.FromFile(documentPaths(0))

Dim synthesisQuery As String = "Based on the quarterly summaries below, provide an annual synthesis:" & vbCrLf &
    "1. Overall trends across quarters" & vbCrLf &
    "2. Key achievements and challenges" & vbCrLf &
    "3. Year-over-year patterns" & vbCrLf & vbCrLf &
    "Summaries:" & vbCrLf & combinedSummaries

Dim synthesis As String = Await synthesisDoc.Query(synthesisQuery)

Console.WriteLine(vbCrLf & "=== Annual Synthesis ===")
Console.WriteLine(synthesis)

File.WriteAllText("annual-synthesis.txt", synthesis)

$vbLabelText $csharpLabel

This pattern scales effectively to large document sets. By processing documents in parallel and managing intermediate results, you can analyze hundreds or thousands of documents while maintaining coherent synthesis.

Executive Summary Generation

Executive summaries require a different approach than standard summarization. Rather than simply condensing content, an executive summary must identify the most business-critical information, highlight key decisions or recommendations, and present findings in a format suitable for leadership review.

The code uses pdf.Query() with a structured prompt requesting key decisions, critical findings, financial impact, and risk assessment in business language.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/executive-summary.cs

// Generate executive summary from strategic documents for C-suite leadership
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("strategic-plan.pdf");

string executiveQuery = @"Create an executive summary for C-suite leadership. Include:

**Key Decisions Required:**
- List any decisions needing executive approval

**Critical Findings:**
- Top 3-5 most important findings (bullet points)

**Financial Impact:**
- Revenue/cost implications if mentioned

**Risk Assessment:**
- High-priority risks identified

**Recommended Actions:**
- Immediate next steps

Keep under 500 words. Use business language appropriate for board presentation.";

string executiveSummary = await pdf.Query(executiveQuery);

File.WriteAllText("executive-summary.txt", executiveSummary);
Console.WriteLine("Executive summary saved to executive-summary.txt");

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("strategic-plan.pdf")

Dim executiveQuery As String = "Create an executive summary for C-suite leadership. Include:" & vbCrLf & vbCrLf & 
"**Key Decisions Required:**" & vbCrLf & 
"- List any decisions needing executive approval" & vbCrLf & vbCrLf & 
"**Critical Findings:**" & vbCrLf & 
"- Top 3-5 most important findings (bullet points)" & vbCrLf & vbCrLf & 
"**Financial Impact:**" & vbCrLf & 
"- Revenue/cost implications if mentioned" & vbCrLf & vbCrLf & 
"**Risk Assessment:**" & vbCrLf & 
"- High-priority risks identified" & vbCrLf & vbCrLf & 
"**Recommended Actions:**" & vbCrLf & 
"- Immediate next steps" & vbCrLf & vbCrLf & 
"Keep under 500 words. Use business language appropriate for board presentation."

Dim executiveSummary As String = Await pdf.Query(executiveQuery)

File.WriteAllText("executive-summary.txt", executiveSummary)
Console.WriteLine("Executive summary saved to executive-summary.txt")

$vbLabelText $csharpLabel

The resulting executive summary prioritizes actionable information over comprehensive coverage, delivering exactly what decision-makers need without overwhelming detail.

Intelligent Data Extraction

Extracting Structured Data to JSON

One of the most powerful applications of AI-powered PDF processing is extracting structured data from unstructured documents. The key to successful structured extraction in 2026 is using JSON schemas with structured output modes. GPT-5 introduces improved structured outputs, while Claude Sonnet 4.5 offers enhanced tool orchestration for reliable data extraction.

Input

The code calls pdf.Query() with a JSON schema prompt, then uses JsonSerializer.Deserialize() to parse and validate the extracted invoice data.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/extract-invoice-json.cs

// Extract structured invoice data as JSON from PDF
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text.Json;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("sample-invoice.pdf");

// Define JSON schema for extraction
string extractionQuery = @"Extract invoice data and return as JSON with this exact structure:
{
    ""invoiceNumber"": ""string"",
    ""invoiceDate"": ""YYYY-MM-DD"",
    ""dueDate"": ""YYYY-MM-DD"",
    ""vendor"": {
        ""name"": ""string"",
        ""address"": ""string"",
        ""taxId"": ""string or null""
    },
    ""customer"": {
        ""name"": ""string"",
        ""address"": ""string""
    },
    ""lineItems"": [
        {
            ""description"": ""string"",
            ""quantity"": number,
            ""unitPrice"": number,
            ""total"": number
        }
    ],
    ""subtotal"": number,
    ""taxRate"": number,
    ""taxAmount"": number,
    ""total"": number,
    ""currency"": ""string""
}

Return ONLY valid JSON, no additional text.";

string jsonResponse = await pdf.Query(extractionQuery);

// Parse and save JSON
try
{
    var invoiceData = JsonSerializer.Deserialize<JsonElement>(jsonResponse);
    string formattedJson = JsonSerializer.Serialize(invoiceData, new JsonSerializerOptions { WriteIndented = true });

    Console.WriteLine("Extracted Invoice Data:");
    Console.WriteLine(formattedJson);

    File.WriteAllText("invoice-data.json", formattedJson);
}
catch (JsonException)
{
    Console.WriteLine("Unable to parse JSON response");
    File.WriteAllText("invoice-raw-response.txt", jsonResponse);
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Text.Json

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("sample-invoice.pdf")

' Define JSON schema for extraction
Dim extractionQuery As String = "Extract invoice data and return as JSON with this exact structure:" & vbCrLf & _
"{" & vbCrLf & _
"    ""invoiceNumber"": ""string""," & vbCrLf & _
"    ""invoiceDate"": ""YYYY-MM-DD""," & vbCrLf & _
"    ""dueDate"": ""YYYY-MM-DD""," & vbCrLf & _
"    ""vendor"": {" & vbCrLf & _
"        ""name"": ""string""," & vbCrLf & _
"        ""address"": ""string""," & vbCrLf & _
"        ""taxId"": ""string or null""" & vbCrLf & _
"    }," & vbCrLf & _
"    ""customer"": {" & vbCrLf & _
"        ""name"": ""string""," & vbCrLf & _
"        ""address"": ""string""" & vbCrLf & _
"    }," & vbCrLf & _
"    ""lineItems"": [" & vbCrLf & _
"        {" & vbCrLf & _
"            ""description"": ""string""," & vbCrLf & _
"            ""quantity"": number," & vbCrLf & _
"            ""unitPrice"": number," & vbCrLf & _
"            ""total"": number" & vbCrLf & _
"        }" & vbCrLf & _
"    ]," & vbCrLf & _
"    ""subtotal"": number," & vbCrLf & _
"    ""taxRate"": number," & vbCrLf & _
"    ""taxAmount"": number," & vbCrLf & _
"    ""total"": number," & vbCrLf & _
"    ""currency"": ""string""" & vbCrLf & _
"}" & vbCrLf & _
vbCrLf & _
"Return ONLY valid JSON, no additional text."

Dim jsonResponse As String = Await pdf.Query(extractionQuery)

' Parse and save JSON
Try
    Dim invoiceData = JsonSerializer.Deserialize(Of JsonElement)(jsonResponse)
    Dim formattedJson As String = JsonSerializer.Serialize(invoiceData, New JsonSerializerOptions With {.WriteIndented = True})

    Console.WriteLine("Extracted Invoice Data:")
    Console.WriteLine(formattedJson)

    File.WriteAllText("invoice-data.json", formattedJson)
Catch ex As JsonException
    Console.WriteLine("Unable to parse JSON response")
    File.WriteAllText("invoice-raw-response.txt", jsonResponse)
End Try

$vbLabelText $csharpLabel

Partial Screenshot of Generated JSON File

Extracted invoice data as structured JSON from PDF

Modern AI models in 2026 support structured output modes that guarantee valid JSON responses conforming to provided schemas. This eliminates the need for complex error handling around malformed responses.

Contract Clause Identification

Legal contracts contain specific types of clauses that carry particular importance: termination provisions, liability limitations, indemnification requirements, intellectual property assignments, and confidentiality obligations. AI-powered clause identification automates this analysis while maintaining high accuracy.

This example uses pdf.Query() with a clause-focused JSON schema to extract contract type, parties, critical dates, and individual clauses with risk levels.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/contract-clause-analysis.cs

// Analyze contract clauses and identify key terms, risks, and critical dates
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text.Json;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("contract.pdf");

// Define JSON schema for contract analysis
string clauseQuery = @"Analyze this contract and identify key clauses. Return JSON:
{
    ""contractType"": ""string"",
    ""parties"": [""string""],
    ""effectiveDate"": ""string"",
    ""clauses"": [
        {
            ""type"": ""Termination|Liability|Indemnification|Confidentiality|IP|Payment|Warranty|Other"",
            ""title"": ""string"",
            ""summary"": ""string"",
            ""riskLevel"": ""Low|Medium|High"",
            ""keyTerms"": [""string""]
        }
    ],
    ""criticalDates"": [
        {
            ""description"": ""string"",
            ""date"": ""string""
        }
    ],
    ""overallRiskAssessment"": ""Low|Medium|High"",
    ""recommendations"": [""string""]
}

Focus on: termination rights, liability caps, indemnification, IP ownership, confidentiality, payment terms.
Return ONLY valid JSON.";

string analysisJson = await pdf.Query(clauseQuery);

try
{
    var analysis = JsonSerializer.Deserialize<JsonElement>(analysisJson);
    string formatted = JsonSerializer.Serialize(analysis, new JsonSerializerOptions { WriteIndented = true });

    Console.WriteLine("Contract Clause Analysis:");
    Console.WriteLine(formatted);

    File.WriteAllText("contract-analysis.json", formatted);

    // Display high-risk clauses
    Console.WriteLine("\n=== High Risk Clauses ===");
    foreach (var clause in analysis.GetProperty("clauses").EnumerateArray())
    {
        if (clause.GetProperty("riskLevel").GetString() == "High")
        {
            Console.WriteLine($"- {clause.GetProperty("type")}: {clause.GetProperty("summary")}");
        }
    }
}
catch (JsonException)
{
    Console.WriteLine("Unable to parse contract analysis");
    File.WriteAllText("contract-analysis-raw.txt", analysisJson);
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Text.Json

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("contract.pdf")

' Define JSON schema for contract analysis
Dim clauseQuery As String = "Analyze this contract and identify key clauses. Return JSON:
{
    ""contractType"": ""string"",
    ""parties"": [""string""],
    ""effectiveDate"": ""string"",
    ""clauses"": [
        {
            ""type"": ""Termination|Liability|Indemnification|Confidentiality|IP|Payment|Warranty|Other"",
            ""title"": ""string"",
            ""summary"": ""string"",
            ""riskLevel"": ""Low|Medium|High"",
            ""keyTerms"": [""string""]
        }
    ],
    ""criticalDates"": [
        {
            ""description"": ""string"",
            ""date"": ""string""
        }
    ],
    ""overallRiskAssessment"": ""Low|Medium|High"",
    ""recommendations"": [""string""]
}

Focus on: termination rights, liability caps, indemnification, IP ownership, confidentiality, payment terms.
Return ONLY valid JSON."

Dim analysisJson As String = Await pdf.Query(clauseQuery)

Try
    Dim analysis = JsonSerializer.Deserialize(Of JsonElement)(analysisJson)
    Dim formatted As String = JsonSerializer.Serialize(analysis, New JsonSerializerOptions With {.WriteIndented = True})

    Console.WriteLine("Contract Clause Analysis:")
    Console.WriteLine(formatted)

    File.WriteAllText("contract-analysis.json", formatted)

    ' Display high-risk clauses
    Console.WriteLine(vbCrLf & "=== High Risk Clauses ===")
    For Each clause In analysis.GetProperty("clauses").EnumerateArray()
        If clause.GetProperty("riskLevel").GetString() = "High" Then
            Console.WriteLine($"- {clause.GetProperty("type")}: {clause.GetProperty("summary")}")
        End If
    Next
Catch ex As JsonException
    Console.WriteLine("Unable to parse contract analysis")
    File.WriteAllText("contract-analysis-raw.txt", analysisJson)
End Try

$vbLabelText $csharpLabel

This capability transforms contract review from a sequential, manual process into an automated, scalable workflow. Legal teams can quickly identify high-risk provisions across hundreds of contracts.

Financial Data Parsing

Financial documents contain critical quantitative data embedded in complex narratives and tables. AI-powered parsing excels at financial documents because it understands context—distinguishing between historical results and forward projections, identifying whether numbers are in thousands or millions, and understanding relationships between different metrics.

The code uses pdf.Query() with a financial JSON schema to extract income statement data, balance sheet metrics, and forward guidance into structured output.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/financial-data-extraction.cs

// Extract financial metrics from annual reports and earnings documents
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text.Json;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("annual-report.pdf");

// Define JSON schema for financial extraction (numbers in millions)
string financialQuery = @"Extract financial metrics from this document. Return JSON:
{
    ""reportPeriod"": ""string"",
    ""company"": ""string"",
    ""currency"": ""string"",
    ""incomeStatement"": {
        ""revenue"": number,
        ""costOfRevenue"": number,
        ""grossProfit"": number,
        ""operatingExpenses"": number,
        ""operatingIncome"": number,
        ""netIncome"": number,
        ""eps"": number
    },
    ""balanceSheet"": {
        ""totalAssets"": number,
        ""totalLiabilities"": number,
        ""shareholdersEquity"": number,
        ""cash"": number,
        ""totalDebt"": number
    },
    ""keyMetrics"": {
        ""revenueGrowthYoY"": ""string"",
        ""grossMargin"": ""string"",
        ""operatingMargin"": ""string"",
        ""netMargin"": ""string"",
        ""debtToEquity"": number
    },
    ""guidance"": {
        ""nextQuarterRevenue"": ""string"",
        ""fullYearRevenue"": ""string"",
        ""notes"": ""string""
    }
}

Use null for unavailable data. Numbers in millions unless stated.
Return ONLY valid JSON.";

string financialJson = await pdf.Query(financialQuery);

try
{
    var financials = JsonSerializer.Deserialize<JsonElement>(financialJson);
    string formatted = JsonSerializer.Serialize(financials, new JsonSerializerOptions { WriteIndented = true });

    Console.WriteLine("Extracted Financial Data:");
    Console.WriteLine(formatted);

    File.WriteAllText("financial-data.json", formatted);
}
catch (JsonException)
{
    Console.WriteLine("Unable to parse financial data");
    File.WriteAllText("financial-raw.txt", financialJson);
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Text.Json

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("annual-report.pdf")

' Define JSON schema for financial extraction (numbers in millions)
Dim financialQuery As String = "Extract financial metrics from this document. Return JSON:
{
    ""reportPeriod"": ""string"",
    ""company"": ""string"",
    ""currency"": ""string"",
    ""incomeStatement"": {
        ""revenue"": number,
        ""costOfRevenue"": number,
        ""grossProfit"": number,
        ""operatingExpenses"": number,
        ""operatingIncome"": number,
        ""netIncome"": number,
        ""eps"": number
    },
    ""balanceSheet"": {
        ""totalAssets"": number,
        ""totalLiabilities"": number,
        ""shareholdersEquity"": number,
        ""cash"": number,
        ""totalDebt"": number
    },
    ""keyMetrics"": {
        ""revenueGrowthYoY"": ""string"",
        ""grossMargin"": ""string"",
        ""operatingMargin"": ""string"",
        ""netMargin"": ""string"",
        ""debtToEquity"": number
    },
    ""guidance"": {
        ""nextQuarterRevenue"": ""string"",
        ""fullYearRevenue"": ""string"",
        ""notes"": ""string""
    }
}

Use null for unavailable data. Numbers in millions unless stated.
Return ONLY valid JSON."

Dim financialJson As String = Await pdf.Query(financialQuery)

Try
    Dim financials = JsonSerializer.Deserialize(Of JsonElement)(financialJson)
    Dim formatted As String = JsonSerializer.Serialize(financials, New JsonSerializerOptions With {.WriteIndented = True})

    Console.WriteLine("Extracted Financial Data:")
    Console.WriteLine(formatted)

    File.WriteAllText("financial-data.json", formatted)
Catch ex As JsonException
    Console.WriteLine("Unable to parse financial data")
    File.WriteAllText("financial-raw.txt", financialJson)
End Try

$vbLabelText $csharpLabel

The extracted structured data can feed directly into financial models, time-series databases, or analytics platforms, enabling automated tracking of metrics across reporting periods.

Custom Extraction Prompts

Many organizations have unique extraction requirements based on their specific domain, document formats, or business processes. IronPDF's AI integration fully supports custom extraction prompts, allowing you to define exactly what information should be extracted and how it should be structured.

This example demonstrates pdf.Query() with a research-focused schema extracting methodology, key findings with confidence levels, and limitations from academic papers.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/custom-research-extraction.cs

// Extract structured research metadata from academic papers
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text.Json;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("research-paper.pdf");

// Define JSON schema for research paper extraction
string researchQuery = @"Extract structured information from this research paper. Return JSON:
{
    ""title"": ""string"",
    ""authors"": [""string""],
    ""institution"": ""string"",
    ""publicationDate"": ""string"",
    ""abstract"": ""string"",
    ""researchQuestion"": ""string"",
    ""methodology"": {
        ""type"": ""Quantitative|Qualitative|Mixed Methods"",
        ""approach"": ""string"",
        ""sampleSize"": ""string"",
        ""dataCollection"": ""string""
    },
    ""keyFindings"": [
        {
            ""finding"": ""string"",
            ""significance"": ""string"",
            ""confidence"": ""High|Medium|Low""
        }
    ],
    ""limitations"": [""string""],
    ""futureWork"": [""string""],
    ""keywords"": [""string""]
}

Focus on extracting verifiable claims and noting uncertainty.
Return ONLY valid JSON.";

string extractionResult = await pdf.Query(researchQuery);

try
{
    var research = JsonSerializer.Deserialize<JsonElement>(extractionResult);
    string formatted = JsonSerializer.Serialize(research, new JsonSerializerOptions { WriteIndented = true });

    Console.WriteLine("Research Paper Extraction:");
    Console.WriteLine(formatted);

    File.WriteAllText("research-extraction.json", formatted);

    // Display key findings with confidence levels
    Console.WriteLine("\n=== Key Findings ===");
    foreach (var finding in research.GetProperty("keyFindings").EnumerateArray())
    {
        string confidence = finding.GetProperty("confidence").GetString() ?? "Unknown";
        Console.WriteLine($"[{confidence}] {finding.GetProperty("finding")}");
    }
}
catch (JsonException)
{
    Console.WriteLine("Unable to parse research extraction");
    File.WriteAllText("research-raw.txt", extractionResult);
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Text.Json

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("research-paper.pdf")

' Define JSON schema for research paper extraction
Dim researchQuery As String = "Extract structured information from this research paper. Return JSON:
{
    ""title"": ""string"",
    ""authors"": [""string""],
    ""institution"": ""string"",
    ""publicationDate"": ""string"",
    ""abstract"": ""string"",
    ""researchQuestion"": ""string"",
    ""methodology"": {
        ""type"": ""Quantitative|Qualitative|Mixed Methods"",
        ""approach"": ""string"",
        ""sampleSize"": ""string"",
        ""dataCollection"": ""string""
    },
    ""keyFindings"": [
        {
            ""finding"": ""string"",
            ""significance"": ""string"",
            ""confidence"": ""High|Medium|Low""
        }
    ],
    ""limitations"": [""string""],
    ""futureWork"": [""string""],
    ""keywords"": [""string""]
}

Focus on extracting verifiable claims and noting uncertainty.
Return ONLY valid JSON."

Dim extractionResult As String = Await pdf.Query(researchQuery)

Try
    Dim research = JsonSerializer.Deserialize(Of JsonElement)(extractionResult)
    Dim formatted As String = JsonSerializer.Serialize(research, New JsonSerializerOptions With {.WriteIndented = True})

    Console.WriteLine("Research Paper Extraction:")
    Console.WriteLine(formatted)

    File.WriteAllText("research-extraction.json", formatted)

    ' Display key findings with confidence levels
    Console.WriteLine(vbCrLf & "=== Key Findings ===")
    For Each finding In research.GetProperty("keyFindings").EnumerateArray()
        Dim confidence As String = finding.GetProperty("confidence").GetString() OrElse "Unknown"
        Console.WriteLine($"[{confidence}] {finding.GetProperty("finding")}")
    Next
Catch ex As JsonException
    Console.WriteLine("Unable to parse research extraction")
    File.WriteAllText("research-raw.txt", extractionResult)
End Try

$vbLabelText $csharpLabel

Custom prompts transform AI-powered extraction from a generic tool into a specialized solution tailored to your specific needs.

Question-Answering Over Documents

Building a PDF Q&A System

Question-answering systems enable users to interact with PDF documents conversationally, asking questions in natural language and receiving accurate, contextual answers. The basic pattern involves extracting text from the PDF, combining it with the user's question in a prompt, and requesting an answer from the AI.

Input

The code calls pdf.Memorize() to index the document for semantic search, then enters an interactive loop using pdf.Query() to answer user questions.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/pdf-question-answering.cs

// Interactive Q&A system for querying PDF documents
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("sample-legal-document.pdf");

// Memorize document to enable persistent querying
await pdf.Memorize();

Console.WriteLine("PDF Q&A System - Type 'exit' to quit\n");
Console.WriteLine($"Document loaded and memorized: {pdf.PageCount} pages\n");

// Interactive Q&A loop
while (true)
{
    Console.Write("Your question: ");
    string? question = Console.ReadLine();

    if (string.IsNullOrWhiteSpace(question) || question.ToLower() == "exit")
        break;

    string answer = await pdf.Query(question);

    Console.WriteLine($"\nAnswer: {answer}\n");
    Console.WriteLine(new string('-', 50) + "\n");
}

Console.WriteLine("Q&A session ended.");

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("sample-legal-document.pdf")

' Memorize document to enable persistent querying
Await pdf.Memorize()

Console.WriteLine("PDF Q&A System - Type 'exit' to quit" & vbCrLf)
Console.WriteLine($"Document loaded and memorized: {pdf.PageCount} pages" & vbCrLf)

' Interactive Q&A loop
While True
    Console.Write("Your question: ")
    Dim question As String = Console.ReadLine()

    If String.IsNullOrWhiteSpace(question) OrElse question.ToLower() = "exit" Then
        Exit While
    End If

    Dim answer As String = Await pdf.Query(question)

    Console.WriteLine($"{vbCrLf}Answer: {answer}{vbCrLf}")
    Console.WriteLine(New String("-"c, 50) & vbCrLf)
End While

Console.WriteLine("Q&A session ended.")

$vbLabelText $csharpLabel

Console Output

PDF question-answering system console output in C#

The key to effective Q&A in 2026 is constraining the AI to answer based solely on document content. GPT-5's "safe completions" training method and Claude Sonnet 4.5's improved alignment substantially reduce hallucination rates.

Chunking Long Documents for Context Windows

Most real-world documents exceed AI context windows. Effective chunking strategies are essential for processing these documents. Chunking involves dividing documents into segments small enough to fit within context windows while preserving semantic coherence.

This code iterates through pdf.Pages, creating DocumentChunk objects with configurable maxChunkTokens and overlapTokens for context continuity.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/semantic-document-chunking.cs

// Split long documents into overlapping chunks for RAG systems
using IronPdf;

var pdf = PdfDocument.FromFile("long-document.pdf");

// Chunking configuration
int maxChunkTokens = 4000;      // Leave room for prompts and responses
int overlapTokens = 200;        // Overlap for context continuity
int approxCharsPerToken = 4;    // Rough estimate for tokenization

int maxChunkChars = maxChunkTokens * approxCharsPerToken;
int overlapChars = overlapTokens * approxCharsPerToken;

var chunks = new List<DocumentChunk>();
var currentChunk = new System.Text.StringBuilder();
int chunkStartPage = 1;
int currentPage = 1;

for (int i = 0; i < pdf.PageCount; i++)
{
    string pageText = pdf.Pages[i].Text;
    currentPage = i + 1;

    if (currentChunk.Length + pageText.Length > maxChunkChars && currentChunk.Length > 0)
    {
        chunks.Add(new DocumentChunk
        {
            Text = currentChunk.ToString(),
            StartPage = chunkStartPage,
            EndPage = currentPage - 1,
            ChunkIndex = chunks.Count
        });

        // Create overlap with previous chunk for continuity
        string overlap = currentChunk.Length > overlapChars
            ? currentChunk.ToString().Substring(currentChunk.Length - overlapChars)
            : currentChunk.ToString();

        currentChunk.Clear();
        currentChunk.Append(overlap);
        chunkStartPage = currentPage - 1;
    }

    currentChunk.AppendLine($"\n--- Page {currentPage} ---\n");
    currentChunk.Append(pageText);
}

if (currentChunk.Length > 0)
{
    chunks.Add(new DocumentChunk
    {
        Text = currentChunk.ToString(),
        StartPage = chunkStartPage,
        EndPage = currentPage,
        ChunkIndex = chunks.Count
    });
}

Console.WriteLine($"Document chunked into {chunks.Count} segments");
foreach (var chunk in chunks)
{
    Console.WriteLine($"  Chunk {chunk.ChunkIndex + 1}: Pages {chunk.StartPage}-{chunk.EndPage} ({chunk.Text.Length} chars)");
}

// Save chunk metadata for RAG indexing
File.WriteAllText("chunks-metadata.json", System.Text.Json.JsonSerializer.Serialize(
    chunks.Select(c => new { c.ChunkIndex, c.StartPage, c.EndPage, Length = c.Text.Length }),
    new System.Text.Json.JsonSerializerOptions { WriteIndented = true }
));

public class DocumentChunk
{
    public string Text { get; set; } = "";
    public int StartPage { get; set; }
    public int EndPage { get; set; }
    public int ChunkIndex { get; set; }
}

Imports IronPdf
Imports System.Text
Imports System.Text.Json
Imports System.IO

' Split long documents into overlapping chunks for RAG systems
Dim pdf = PdfDocument.FromFile("long-document.pdf")

' Chunking configuration
Dim maxChunkTokens As Integer = 4000      ' Leave room for prompts and responses
Dim overlapTokens As Integer = 200        ' Overlap for context continuity
Dim approxCharsPerToken As Integer = 4    ' Rough estimate for tokenization

Dim maxChunkChars As Integer = maxChunkTokens * approxCharsPerToken
Dim overlapChars As Integer = overlapTokens * approxCharsPerToken

Dim chunks As New List(Of DocumentChunk)()
Dim currentChunk As New StringBuilder()
Dim chunkStartPage As Integer = 1
Dim currentPage As Integer = 1

For i As Integer = 0 To pdf.PageCount - 1
    Dim pageText As String = pdf.Pages(i).Text
    currentPage = i + 1

    If currentChunk.Length + pageText.Length > maxChunkChars AndAlso currentChunk.Length > 0 Then
        chunks.Add(New DocumentChunk With {
            .Text = currentChunk.ToString(),
            .StartPage = chunkStartPage,
            .EndPage = currentPage - 1,
            .ChunkIndex = chunks.Count
        })

        ' Create overlap with previous chunk for continuity
        Dim overlap As String = If(currentChunk.Length > overlapChars,
            currentChunk.ToString().Substring(currentChunk.Length - overlapChars),
            currentChunk.ToString())

        currentChunk.Clear()
        currentChunk.Append(overlap)
        chunkStartPage = currentPage - 1
    End If

    currentChunk.AppendLine(vbCrLf & "--- Page " & currentPage & " ---" & vbCrLf)
    currentChunk.Append(pageText)
Next

If currentChunk.Length > 0 Then
    chunks.Add(New DocumentChunk With {
        .Text = currentChunk.ToString(),
        .StartPage = chunkStartPage,
        .EndPage = currentPage,
        .ChunkIndex = chunks.Count
    })
End If

Console.WriteLine($"Document chunked into {chunks.Count} segments")
For Each chunk In chunks
    Console.WriteLine($"  Chunk {chunk.ChunkIndex + 1}: Pages {chunk.StartPage}-{chunk.EndPage} ({chunk.Text.Length} chars)")
Next

' Save chunk metadata for RAG indexing
File.WriteAllText("chunks-metadata.json", JsonSerializer.Serialize(
    chunks.Select(Function(c) New With {Key .ChunkIndex = c.ChunkIndex, Key .StartPage = c.StartPage, Key .EndPage = c.EndPage, Key .Length = c.Text.Length}),
    New JsonSerializerOptions With {.WriteIndented = True}
))

Public Class DocumentChunk
    Public Property Text As String = ""
    Public Property StartPage As Integer
    Public Property EndPage As Integer
    Public Property ChunkIndex As Integer
End Class

$vbLabelText $csharpLabel

Comparison of fixed chunking versus semantic chunking for PDF documents

Overlapping chunks provide continuity across boundaries, ensuring the AI has sufficient context even when relevant information spans chunk boundaries.

RAG (Retrieval-Augmented Generation) Patterns

Retrieval-Augmented Generation represents a powerful pattern for AI-powered document analysis in 2026. Rather than feeding entire documents to the AI, RAG systems first retrieve only the relevant portions for a given query, then use those portions as context for generating answers.

The RAG workflow has three main phases: document preparation (chunking and creating embeddings), retrieval (searching for relevant chunks), and generation (using retrieved chunks as context for AI responses).

The code indexes multiple PDFs by calling pdf.Memorize() on each, then uses pdf.Query() to retrieve answers from the combined document memory.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/rag-system-implementation.cs

// Retrieval-Augmented Generation (RAG) system for querying across multiple indexed documents
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

// Index all documents in folder
string[] documentPaths = Directory.GetFiles("documents/", "*.pdf");

Console.WriteLine($"Indexing {documentPaths.Length} documents...\n");

// Memorize each document (creates embeddings for retrieval)
foreach (string path in documentPaths)
{
    var pdf = PdfDocument.FromFile(path);
    await pdf.Memorize();
    Console.WriteLine($"Indexed: {Path.GetFileName(path)} ({pdf.PageCount} pages)");
}

Console.WriteLine("\n=== RAG System Ready ===\n");

// Query across all indexed documents
string query = "What are the key compliance requirements for data retention?";

Console.WriteLine($"Query: {query}\n");

var searchPdf = PdfDocument.FromFile(documentPaths[0]);
string answer = await searchPdf.Query(query);

Console.WriteLine($"Answer: {answer}");

// Interactive query loop
Console.WriteLine("\n--- Enter questions (type 'exit' to quit) ---\n");

while (true)
{
    Console.Write("Question: ");
    string? userQuery = Console.ReadLine();

    if (string.IsNullOrWhiteSpace(userQuery) || userQuery.ToLower() == "exit")
        break;

    string response = await searchPdf.Query(userQuery);
    Console.WriteLine($"\nAnswer: {response}\n");
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.IO

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

' Index all documents in folder
Dim documentPaths As String() = Directory.GetFiles("documents/", "*.pdf")

Console.WriteLine($"Indexing {documentPaths.Length} documents..." & vbCrLf)

' Memorize each document (creates embeddings for retrieval)
For Each path As String In documentPaths
    Dim pdf = PdfDocument.FromFile(path)
    Await pdf.Memorize()
    Console.WriteLine($"Indexed: {Path.GetFileName(path)} ({pdf.PageCount} pages)")
Next

Console.WriteLine(vbCrLf & "=== RAG System Ready ===" & vbCrLf)

' Query across all indexed documents
Dim query As String = "What are the key compliance requirements for data retention?"

Console.WriteLine($"Query: {query}" & vbCrLf)

Dim searchPdf = PdfDocument.FromFile(documentPaths(0))
Dim answer As String = Await searchPdf.Query(query)

Console.WriteLine($"Answer: {answer}")

' Interactive query loop
Console.WriteLine(vbCrLf & "--- Enter questions (type 'exit' to quit) ---" & vbCrLf)

While True
    Console.Write("Question: ")
    Dim userQuery As String = Console.ReadLine()

    If String.IsNullOrWhiteSpace(userQuery) OrElse userQuery.ToLower() = "exit" Then
        Exit While
    End If

    Dim response As String = Await searchPdf.Query(userQuery)
    Console.WriteLine(vbCrLf & $"Answer: {response}" & vbCrLf)
End While

$vbLabelText $csharpLabel

RAG systems excel at handling large document collections—legal case databases, technical documentation libraries, research archives. By retrieving only relevant portions, they maintain response quality while scaling to effectively unlimited document sizes.

Citing Sources from PDF Pages

For professional applications, AI answers must be verifiable. The citation approach involves maintaining metadata about chunk origins during chunking and retrieval. Each chunk stores not just text content but also its source page numbers, section headers, and position in the document.

Input

The code uses pdf.Query() with citation instructions, then calls ExtractCitedPages() with regex to parse page references and verify sources using pdf.Pages[pageNum - 1].Text.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/answer-with-citations.cs

// Answer questions with page citations and source verification
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text.RegularExpressions;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

var pdf = PdfDocument.FromFile("sample-legal-document.pdf");
await pdf.Memorize();

string question = "What are the termination conditions in this agreement?";

// Request citations in query
string citationQuery = $@"{question}

IMPORTANT: Include specific page citations in your answer using the format (Page X) or (Pages X-Y).
Only cite information that appears in the document.";

string answerWithCitations = await pdf.Query(citationQuery);

Console.WriteLine("Question: " + question);
Console.WriteLine("\nAnswer with Citations:");
Console.WriteLine(answerWithCitations);

// Extract cited page numbers using regex
var citedPages = ExtractCitedPages(answerWithCitations);
Console.WriteLine($"\nCited pages: {string.Join(", ", citedPages)}");

// Verify citations with page excerpts
Console.WriteLine("\n=== Source Verification ===");
foreach (int pageNum in citedPages.Take(3))
{
    if (pageNum <= pdf.PageCount && pageNum > 0)
    {
        string pageText = pdf.Pages[pageNum - 1].Text;
        string excerpt = pageText.Length > 200 ? pageText.Substring(0, 200) + "..." : pageText;
        Console.WriteLine($"\nPage {pageNum} excerpt:\n{excerpt}");
    }
}

// Extract page numbers from citation format (Page X) or (Pages X-Y)
List<int> ExtractCitedPages(string text)
{
    var pages = new HashSet<int>();
    var matches = Regex.Matches(text, @"\(Pages?\s*(\d+)(?:\s*-\s*(\d+))?\)", RegexOptions.IgnoreCase);

    foreach (Match match in matches)
    {
        int startPage = int.Parse(match.Groups[1].Value);
        pages.Add(startPage);

        if (match.Groups[2].Success)
        {
            int endPage = int.Parse(match.Groups[2].Value);
            for (int p = startPage; p <= endPage; p++)
                pages.Add(p);
        }
    }
    return pages.OrderBy(p => p).ToList();
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Text.RegularExpressions

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

Dim pdf = PdfDocument.FromFile("sample-legal-document.pdf")
Await pdf.Memorize()

Dim question As String = "What are the termination conditions in this agreement?"

' Request citations in query
Dim citationQuery As String = $"{question}

IMPORTANT: Include specific page citations in your answer using the format (Page X) or (Pages X-Y).
Only cite information that appears in the document."

Dim answerWithCitations As String = Await pdf.Query(citationQuery)

Console.WriteLine("Question: " & question)
Console.WriteLine(vbCrLf & "Answer with Citations:")
Console.WriteLine(answerWithCitations)

' Extract cited page numbers using regex
Dim citedPages = ExtractCitedPages(answerWithCitations)
Console.WriteLine(vbCrLf & "Cited pages: " & String.Join(", ", citedPages))

' Verify citations with page excerpts
Console.WriteLine(vbCrLf & "=== Source Verification ===")
For Each pageNum As Integer In citedPages.Take(3)
    If pageNum <= pdf.PageCount AndAlso pageNum > 0 Then
        Dim pageText As String = pdf.Pages(pageNum - 1).Text
        Dim excerpt As String = If(pageText.Length > 200, pageText.Substring(0, 200) & "...", pageText)
        Console.WriteLine(vbCrLf & "Page " & pageNum & " excerpt:" & vbCrLf & excerpt)
    End If
Next

' Extract page numbers from citation format (Page X) or (Pages X-Y)
Function ExtractCitedPages(text As String) As List(Of Integer)
    Dim pages = New HashSet(Of Integer)()
    Dim matches = Regex.Matches(text, "\((Pages?)\s*(\d+)(?:\s*-\s*(\d+))?\)", RegexOptions.IgnoreCase)

    For Each match As Match In matches
        Dim startPage As Integer = Integer.Parse(match.Groups(2).Value)
        pages.Add(startPage)

        If match.Groups(3).Success Then
            Dim endPage As Integer = Integer.Parse(match.Groups(3).Value)
            For p As Integer = startPage To endPage
                pages.Add(p)
            Next
        End If
    Next
    Return pages.OrderBy(Function(p) p).ToList()
End Function

$vbLabelText $csharpLabel

Console Output

Console output showing AI answers with page citations from PDF

Citations transform AI-generated answers from opaque outputs into transparent, verifiable information. Users can review source material to validate answers and build confidence in AI-assisted analysis.

Batch AI Processing

Processing Document Libraries at Scale

Enterprise document processing often involves thousands or millions of PDFs. The foundation of scalable batch processing is parallelization. IronPDF is thread-safe, allowing concurrent PDF processing without interference.

This code uses SemaphoreSlim with configurable maxConcurrency to process PDFs in parallel, calling pdf.Summarize() on each while tracking results in a ConcurrentBag.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/batch-document-processing.cs

// Process multiple documents in parallel with rate limiting
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Collections.Concurrent;
using System.Text;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

// Configure parallel processing with rate limiting
int maxConcurrency = 3;
string inputFolder = "documents/";
string outputFolder = "summaries/";

Directory.CreateDirectory(outputFolder);

string[] pdfFiles = Directory.GetFiles(inputFolder, "*.pdf");
Console.WriteLine($"Processing {pdfFiles.Length} documents...\n");

var results = new ConcurrentBag<ProcessingResult>();
var semaphore = new SemaphoreSlim(maxConcurrency);

var tasks = pdfFiles.Select(async filePath =>
{
    await semaphore.WaitAsync();
    var result = new ProcessingResult { FilePath = filePath };

    try
    {
        var stopwatch = System.Diagnostics.Stopwatch.StartNew();

        var pdf = PdfDocument.FromFile(filePath);
        string summary = await pdf.Summarize();

        string outputPath = Path.Combine(outputFolder,
            Path.GetFileNameWithoutExtension(filePath) + "-summary.txt");
        await File.WriteAllTextAsync(outputPath, summary);

        stopwatch.Stop();
        result.Success = true;
        result.ProcessingTime = stopwatch.Elapsed;
        result.OutputPath = outputPath;

        Console.WriteLine($"[OK] {Path.GetFileName(filePath)} ({stopwatch.ElapsedMilliseconds}ms)");
    }
    catch (Exception ex)
    {
        result.Success = false;
        result.ErrorMessage = ex.Message;
        Console.WriteLine($"[ERROR] {Path.GetFileName(filePath)}: {ex.Message}");
    }
    finally
    {
        semaphore.Release();
        results.Add(result);
    }
}).ToArray();

await Task.WhenAll(tasks);

// Generate processing report
var successful = results.Where(r => r.Success).ToList();
var failed = results.Where(r => !r.Success).ToList();

var report = new StringBuilder();
report.AppendLine("=== Batch Processing Report ===");
report.AppendLine($"Successful: {successful.Count}");
report.AppendLine($"Failed: {failed.Count}");

if (successful.Any())
{
    var avgTime = TimeSpan.FromMilliseconds(successful.Average(r => r.ProcessingTime.TotalMilliseconds));
    report.AppendLine($"Average processing time: {avgTime.TotalSeconds:F1}s");
}

if (failed.Any())
{
    report.AppendLine("\nFailed documents:");
    foreach (var fail in failed)
        report.AppendLine($"  - {Path.GetFileName(fail.FilePath)}: {fail.ErrorMessage}");
}

string reportText = report.ToString();
Console.WriteLine($"\n{reportText}");
File.WriteAllText(Path.Combine(outputFolder, "processing-report.txt"), reportText);

class ProcessingResult
{
    public string FilePath { get; set; } = "";
    public bool Success { get; set; }
    public TimeSpan ProcessingTime { get; set; }
    public string OutputPath { get; set; } = "";
    public string ErrorMessage { get; set; } = "";
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Collections.Concurrent
Imports System.Text

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

' Configure parallel processing with rate limiting
Dim maxConcurrency As Integer = 3
Dim inputFolder As String = "documents/"
Dim outputFolder As String = "summaries/"

Directory.CreateDirectory(outputFolder)

Dim pdfFiles As String() = Directory.GetFiles(inputFolder, "*.pdf")
Console.WriteLine($"Processing {pdfFiles.Length} documents..." & vbCrLf)

Dim results = New ConcurrentBag(Of ProcessingResult)()
Dim semaphore = New SemaphoreSlim(maxConcurrency)

Dim tasks = pdfFiles.Select(Async Function(filePath)
    Await semaphore.WaitAsync()
    Dim result = New ProcessingResult With {.FilePath = filePath}

    Try
        Dim stopwatch = System.Diagnostics.Stopwatch.StartNew()

        Dim pdf = PdfDocument.FromFile(filePath)
        Dim summary As String = Await pdf.Summarize()

        Dim outputPath = Path.Combine(outputFolder, Path.GetFileNameWithoutExtension(filePath) & "-summary.txt")
        Await File.WriteAllTextAsync(outputPath, summary)

        stopwatch.Stop()
        result.Success = True
        result.ProcessingTime = stopwatch.Elapsed
        result.OutputPath = outputPath

        Console.WriteLine($"[OK] {Path.GetFileName(filePath)} ({stopwatch.ElapsedMilliseconds}ms)")
    Catch ex As Exception
        result.Success = False
        result.ErrorMessage = ex.Message
        Console.WriteLine($"[ERROR] {Path.GetFileName(filePath)}: {ex.Message}")
    Finally
        semaphore.Release()
        results.Add(result)
    End Try
End Function).ToArray()

Await Task.WhenAll(tasks)

' Generate processing report
Dim successful = results.Where(Function(r) r.Success).ToList()
Dim failed = results.Where(Function(r) Not r.Success).ToList()

Dim report = New StringBuilder()
report.AppendLine("=== Batch Processing Report ===")
report.AppendLine($"Successful: {successful.Count}")
report.AppendLine($"Failed: {failed.Count}")

If successful.Any() Then
    Dim avgTime = TimeSpan.FromMilliseconds(successful.Average(Function(r) r.ProcessingTime.TotalMilliseconds))
    report.AppendLine($"Average processing time: {avgTime.TotalSeconds:F1}s")
End If

If failed.Any() Then
    report.AppendLine(vbCrLf & "Failed documents:")
    For Each fail In failed
        report.AppendLine($"  - {Path.GetFileName(fail.FilePath)}: {fail.ErrorMessage}")
    Next
End If

Dim reportText As String = report.ToString()
Console.WriteLine(vbCrLf & reportText)
File.WriteAllText(Path.Combine(outputFolder, "processing-report.txt"), reportText)

Class ProcessingResult
    Public Property FilePath As String = ""
    Public Property Success As Boolean
    Public Property ProcessingTime As TimeSpan
    Public Property OutputPath As String = ""
    Public Property ErrorMessage As String = ""
End Class

$vbLabelText $csharpLabel

Robust error handling is critical at scale. Production systems implement retry logic with exponential backoff, separate error logging for failed documents, and resumable processing.

Cost Management and Token Usage

AI API costs are typically charged per token. In 2026, GPT-5 is priced at $1.25 per million input tokens and $10 per million output tokens, while Claude Sonnet 4.5 costs $3 per million input tokens and $15 per million output tokens. The primary cost optimization strategy is minimizing unnecessary token usage.

OpenAI's Batch API provides 50% discounts on token costs in exchange for longer processing times (up to 24 hours). For overnight processing or periodic analysis, batch processing delivers substantial savings.

The code extracts text using pdf.ExtractAllText(), creates JSONL batch requests, uploads via HttpClient to the OpenAI files endpoint, and submits to the batch API.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/batch-api-processing.cs

// Use OpenAI Batch API for 50% cost savings on large-scale document processing
using IronPdf;
using System.Text.Json;
using System.Net.Http.Headers;

string openAiApiKey = "your-openai-api-key";
string inputFolder = "documents/";

// Prepare batch requests in JSONL format
var batchRequests = new List<string>();
string[] pdfFiles = Directory.GetFiles(inputFolder, "*.pdf");

Console.WriteLine($"Preparing batch for {pdfFiles.Length} documents...\n");

foreach (string filePath in pdfFiles)
{
    var pdf = PdfDocument.FromFile(filePath);
    string pdfText = pdf.ExtractAllText();

    // Truncate to stay within batch API limits
    if (pdfText.Length > 100000)
        pdfText = pdfText.Substring(0, 100000) + "\n[Truncated...]";

    var request = new
    {
        custom_id = Path.GetFileNameWithoutExtension(filePath),
        method = "POST",
        url = "/v1/chat/completions",
        body = new
        {
            model = "gpt-4o",
            messages = new[]
            {
                new { role = "system", content = "Summarize the following document concisely." },
                new { role = "user", content = pdfText }
            },
            max_tokens = 1000
        }
    };

    batchRequests.Add(JsonSerializer.Serialize(request));
}

// Create JSONL file
string batchFilePath = "batch-requests.jsonl";
File.WriteAllLines(batchFilePath, batchRequests);
Console.WriteLine($"Created batch file with {batchRequests.Count} requests");

// Upload file to OpenAI
using var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", openAiApiKey);

using var fileContent = new MultipartFormDataContent();
fileContent.Add(new ByteArrayContent(File.ReadAllBytes(batchFilePath)), "file", "batch-requests.jsonl");
fileContent.Add(new StringContent("batch"), "purpose");

var uploadResponse = await httpClient.PostAsync("https://api.openai.com/v1/files", fileContent);
var uploadResult = JsonSerializer.Deserialize<JsonElement>(await uploadResponse.Content.ReadAsStringAsync());
string fileId = uploadResult.GetProperty("id").GetString()!;
Console.WriteLine($"Uploaded file: {fileId}");

// Create batch job (24-hour completion window for 50% discount)
var batchJobRequest = new
{
    input_file_id = fileId,
    endpoint = "/v1/chat/completions",
    completion_window = "24h"
};

var batchResponse = await httpClient.PostAsync(
    "https://api.openai.com/v1/batches",
    new StringContent(JsonSerializer.Serialize(batchJobRequest), System.Text.Encoding.UTF8, "application/json")
);

var batchResult = JsonSerializer.Deserialize<JsonElement>(await batchResponse.Content.ReadAsStringAsync());
string batchId = batchResult.GetProperty("id").GetString()!;

Console.WriteLine($"\nBatch job created: {batchId}");
Console.WriteLine("Job will complete within 24 hours");
Console.WriteLine($"Check status: GET https://api.openai.com/v1/batches/{batchId}");

File.WriteAllText("batch-job-id.txt", batchId);
Console.WriteLine("\nBatch ID saved to batch-job-id.txt");

Imports IronPdf
Imports System.Text.Json
Imports System.Net.Http.Headers

Module Program
    Sub Main()
        Dim openAiApiKey As String = "your-openai-api-key"
        Dim inputFolder As String = "documents/"

        ' Prepare batch requests in JSONL format
        Dim batchRequests As New List(Of String)()
        Dim pdfFiles As String() = Directory.GetFiles(inputFolder, "*.pdf")

        Console.WriteLine($"Preparing batch for {pdfFiles.Length} documents..." & vbCrLf)

        For Each filePath As String In pdfFiles
            Dim pdf = PdfDocument.FromFile(filePath)
            Dim pdfText As String = pdf.ExtractAllText()

            ' Truncate to stay within batch API limits
            If pdfText.Length > 100000 Then
                pdfText = pdfText.Substring(0, 100000) & vbCrLf & "[Truncated...]"
            End If

            Dim request = New With {
                .custom_id = Path.GetFileNameWithoutExtension(filePath),
                .method = "POST",
                .url = "/v1/chat/completions",
                .body = New With {
                    .model = "gpt-4o",
                    .messages = New Object() {
                        New With {.role = "system", .content = "Summarize the following document concisely."},
                        New With {.role = "user", .content = pdfText}
                    },
                    .max_tokens = 1000
                }
            }

            batchRequests.Add(JsonSerializer.Serialize(request))
        Next

        ' Create JSONL file
        Dim batchFilePath As String = "batch-requests.jsonl"
        File.WriteAllLines(batchFilePath, batchRequests)
        Console.WriteLine($"Created batch file with {batchRequests.Count} requests")

        ' Upload file to OpenAI
        Using httpClient As New HttpClient()
            httpClient.DefaultRequestHeaders.Authorization = New AuthenticationHeaderValue("Bearer", openAiApiKey)

            Using fileContent As New MultipartFormDataContent()
                fileContent.Add(New ByteArrayContent(File.ReadAllBytes(batchFilePath)), "file", "batch-requests.jsonl")
                fileContent.Add(New StringContent("batch"), "purpose")

                Dim uploadResponse = Await httpClient.PostAsync("https://api.openai.com/v1/files", fileContent)
                Dim uploadResult = JsonSerializer.Deserialize(Of JsonElement)(Await uploadResponse.Content.ReadAsStringAsync())
                Dim fileId As String = uploadResult.GetProperty("id").GetString()
                Console.WriteLine($"Uploaded file: {fileId}")

                ' Create batch job (24-hour completion window for 50% discount)
                Dim batchJobRequest = New With {
                    .input_file_id = fileId,
                    .endpoint = "/v1/chat/completions",
                    .completion_window = "24h"
                }

                Dim batchResponse = Await httpClient.PostAsync(
                    "https://api.openai.com/v1/batches",
                    New StringContent(JsonSerializer.Serialize(batchJobRequest), System.Text.Encoding.UTF8, "application/json")
                )

                Dim batchResult = JsonSerializer.Deserialize(Of JsonElement)(Await batchResponse.Content.ReadAsStringAsync())
                Dim batchId As String = batchResult.GetProperty("id").GetString()

                Console.WriteLine(vbCrLf & $"Batch job created: {batchId}")
                Console.WriteLine("Job will complete within 24 hours")
                Console.WriteLine($"Check status: GET https://api.openai.com/v1/batches/{batchId}")

                File.WriteAllText("batch-job-id.txt", batchId)
                Console.WriteLine(vbCrLf & "Batch ID saved to batch-job-id.txt")
            End Using
        End Using
    End Sub
End Module

$vbLabelText $csharpLabel

Monitoring token usage in production is essential. Many organizations discover that 80% of their documents can be processed with smaller, cheaper models, reserving expensive models only for complex cases.

Caching and Incremental Processing

For document collections that update incrementally, intelligent caching and incremental processing strategies dramatically reduce costs. Document-level caching stores results along with a hash of the source PDF, preventing unnecessary reprocessing of unchanged documents.

The DocumentCacheManager class uses ComputeFileHash() with SHA256 to detect changes, storing results in CacheEntry objects with LastAccessed timestamps.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/incremental-caching.cs

// Cache AI processing results using file hashes to avoid reprocessing unchanged documents
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Security.Cryptography;
using System.Text.Json;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

// Configure caching
string cacheFolder = "ai-cache/";
string documentsFolder = "documents/";

Directory.CreateDirectory(cacheFolder);

var cacheManager = new DocumentCacheManager(cacheFolder);

// Process documents with caching
string[] pdfFiles = Directory.GetFiles(documentsFolder, "*.pdf");
int cached = 0, processed = 0;

foreach (string filePath in pdfFiles)
{
    string fileName = Path.GetFileName(filePath);
    string fileHash = cacheManager.ComputeFileHash(filePath);

    var cachedResult = cacheManager.GetCachedResult(fileName, fileHash);

    if (cachedResult != null)
    {
        Console.WriteLine($"[CACHE HIT] {fileName}");
        cached++;
        continue;
    }

    Console.WriteLine($"[PROCESSING] {fileName}");
    var pdf = PdfDocument.FromFile(filePath);
    string summary = await pdf.Summarize();

    cacheManager.CacheResult(fileName, fileHash, summary);
    processed++;
}

Console.WriteLine($"\nProcessing complete: {cached} cached, {processed} newly processed");
Console.WriteLine($"Cost savings: {(cached * 100.0 / Math.Max(1, cached + processed)):F1}% served from cache");

// Hash-based cache manager with JSON index
class DocumentCacheManager
{
    private readonly string _cacheFolder;
    private readonly string _indexPath;
    private Dictionary<string, CacheEntry> _index;

    public DocumentCacheManager(string cacheFolder)
    {
        _cacheFolder = cacheFolder;
        _indexPath = Path.Combine(cacheFolder, "cache-index.json");
        _index = LoadIndex();
    }

    private Dictionary<string, CacheEntry> LoadIndex()
    {
        if (File.Exists(_indexPath))
        {
            string json = File.ReadAllText(_indexPath);
            return JsonSerializer.Deserialize<Dictionary<string, CacheEntry>>(json) ?? new();
        }
        return new Dictionary<string, CacheEntry>();
    }

    private void SaveIndex()
    {
        string json = JsonSerializer.Serialize(_index, new JsonSerializerOptions { WriteIndented = true });
        File.WriteAllText(_indexPath, json);
    }

    // SHA256 hash to detect file changes
    public string ComputeFileHash(string filePath)
    {
        using var sha256 = SHA256.Create();
        using var stream = File.OpenRead(filePath);
        byte[] hash = sha256.ComputeHash(stream);
        return Convert.ToHexString(hash);
    }

    public string? GetCachedResult(string fileName, string currentHash)
    {
        if (_index.TryGetValue(fileName, out var entry))
        {
            if (entry.FileHash == currentHash && File.Exists(entry.CachePath))
            {
                entry.LastAccessed = DateTime.UtcNow;
                SaveIndex();
                return File.ReadAllText(entry.CachePath);
            }
        }
        return null;
    }

    public void CacheResult(string fileName, string fileHash, string result)
    {
        string cachePath = Path.Combine(_cacheFolder, $"{Path.GetFileNameWithoutExtension(fileName)}-{fileHash[..8]}.txt");
        File.WriteAllText(cachePath, result);

        _index[fileName] = new CacheEntry
        {
            FileHash = fileHash,
            CachePath = cachePath,
            CreatedAt = DateTime.UtcNow,
            LastAccessed = DateTime.UtcNow
        };

        SaveIndex();
    }
}

class CacheEntry
{
    public string FileHash { get; set; } = "";
    public string CachePath { get; set; } = "";
    public DateTime CreatedAt { get; set; }
    public DateTime LastAccessed { get; set; }
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Security.Cryptography
Imports System.Text.Json

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

' Configure caching
Dim cacheFolder As String = "ai-cache/"
Dim documentsFolder As String = "documents/"

Directory.CreateDirectory(cacheFolder)

Dim cacheManager = New DocumentCacheManager(cacheFolder)

' Process documents with caching
Dim pdfFiles As String() = Directory.GetFiles(documentsFolder, "*.pdf")
Dim cached As Integer = 0, processed As Integer = 0

For Each filePath As String In pdfFiles
    Dim fileName As String = Path.GetFileName(filePath)
    Dim fileHash As String = cacheManager.ComputeFileHash(filePath)

    Dim cachedResult = cacheManager.GetCachedResult(fileName, fileHash)

    If cachedResult IsNot Nothing Then
        Console.WriteLine($"[CACHE HIT] {fileName}")
        cached += 1
        Continue For
    End If

    Console.WriteLine($"[PROCESSING] {fileName}")
    Dim pdf = PdfDocument.FromFile(filePath)
    Dim summary As String = Await pdf.Summarize()

    cacheManager.CacheResult(fileName, fileHash, summary)
    processed += 1
Next

Console.WriteLine($"\nProcessing complete: {cached} cached, {processed} newly processed")
Console.WriteLine($"Cost savings: {(cached * 100.0 / Math.Max(1, cached + processed)):F1}% served from cache")

' Hash-based cache manager with JSON index
Class DocumentCacheManager
    Private ReadOnly _cacheFolder As String
    Private ReadOnly _indexPath As String
    Private _index As Dictionary(Of String, CacheEntry)

    Public Sub New(cacheFolder As String)
        _cacheFolder = cacheFolder
        _indexPath = Path.Combine(cacheFolder, "cache-index.json")
        _index = LoadIndex()
    End Sub

    Private Function LoadIndex() As Dictionary(Of String, CacheEntry)
        If File.Exists(_indexPath) Then
            Dim json As String = File.ReadAllText(_indexPath)
            Return JsonSerializer.Deserialize(Of Dictionary(Of String, CacheEntry))(json) ?? New Dictionary(Of String, CacheEntry)()
        End If
        Return New Dictionary(Of String, CacheEntry)()
    End Function

    Private Sub SaveIndex()
        Dim json As String = JsonSerializer.Serialize(_index, New JsonSerializerOptions With {.WriteIndented = True})
        File.WriteAllText(_indexPath, json)
    End Sub

    ' SHA256 hash to detect file changes
    Public Function ComputeFileHash(filePath As String) As String
        Using sha256 = SHA256.Create()
            Using stream = File.OpenRead(filePath)
                Dim hash As Byte() = sha256.ComputeHash(stream)
                Return Convert.ToHexString(hash)
            End Using
        End Using
    End Function

    Public Function GetCachedResult(fileName As String, currentHash As String) As String
        If _index.TryGetValue(fileName, entry) Then
            If entry.FileHash = currentHash AndAlso File.Exists(entry.CachePath) Then
                entry.LastAccessed = DateTime.UtcNow
                SaveIndex()
                Return File.ReadAllText(entry.CachePath)
            End If
        End If
        Return Nothing
    End Function

    Public Sub CacheResult(fileName As String, fileHash As String, result As String)
        Dim cachePath As String = Path.Combine(_cacheFolder, $"{Path.GetFileNameWithoutExtension(fileName)}-{fileHash.Substring(0, 8)}.txt")
        File.WriteAllText(cachePath, result)

        _index(fileName) = New CacheEntry With {
            .FileHash = fileHash,
            .CachePath = cachePath,
            .CreatedAt = DateTime.UtcNow,
            .LastAccessed = DateTime.UtcNow
        }

        SaveIndex()
    End Sub
End Class

Class CacheEntry
    Public Property FileHash As String = ""
    Public Property CachePath As String = ""
    Public Property CreatedAt As DateTime
    Public Property LastAccessed As DateTime
End Class

$vbLabelText $csharpLabel

GPT-5 and Claude Sonnet 4.5 in 2026 also feature automatic prompt caching that can reduce effective token consumption by 50-90% for repeated patterns—a significant cost savings for large-scale operations.

Real-World Use Cases

Legal Discovery and Contract Analysis

Legal discovery traditionally required armies of junior attorneys manually reviewing hundreds of thousands of pages. AI-powered discovery transforms this process, enabling rapid identification of relevant documents, automatic privilege review, and extraction of key evidentiary facts.

IronPDF's AI integration enables sophisticated legal workflows: privilege detection, relevance scoring, issue identification, and key date extraction. Law firms report reducing discovery review times by 70-80%, enabling them to handle larger cases with smaller teams.

With GPT-5 and Claude Sonnet 4.5's improved accuracy and reduced hallucination rates in 2026, legal professionals can trust AI-assisted analysis for increasingly critical decisions.

Financial Report Analysis

Financial analysts spend enormous time extracting data from earnings reports, SEC filings, and analyst presentations. AI-powered financial document processing automates this extraction, enabling analysts to focus on interpretation rather than data collection.

This example processes multiple 10-K filings, using pdf.Query() with a CompanyFinancials JSON schema to extract and compare revenue, margins, and risk factors across companies.

:path=/static-assets/pdf/content-code-examples/tutorials/ai-powered-pdf-processing-csharp/financial-sector-analysis.cs

// Compare financial metrics across multiple company filings for sector analysis
using IronPdf;
using IronPdf.AI;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text.Json;
using System.Text;

// Azure OpenAI configuration
string azureEndpoint = "https://your-resource.openai.azure.com/";
string apiKey = "your-azure-api-key";
string chatDeployment = "gpt-4o";
string embeddingDeployment = "text-embedding-ada-002";

// Initialize Semantic Kernel
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey)
    .Build();

var memory = new MemoryBuilder()
    .WithMemoryStore(new VolatileMemoryStore())
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey)
    .Build();

IronDocumentAI.Initialize(kernel, memory);

// Analyze company filings
string[] companyFilings = {
    "filings/company-a-10k.pdf",
    "filings/company-b-10k.pdf",
    "filings/company-c-10k.pdf"
};

var sectorData = new List<CompanyFinancials>();

foreach (string filing in companyFilings)
{
    Console.WriteLine($"Analyzing: {Path.GetFileName(filing)}");

    var pdf = PdfDocument.FromFile(filing);

    // Define JSON schema for 10-K extraction (numbers in millions USD)
    string extractionQuery = @"Extract key financial metrics from this 10-K filing. Return JSON:
{
    ""companyName"": ""string"",
    ""fiscalYear"": ""string"",
    ""revenue"": number,
    ""revenueGrowth"": number,
    ""grossMargin"": number,
    ""operatingMargin"": number,
    ""netIncome"": number,
    ""eps"": number,
    ""totalDebt"": number,
    ""cashPosition"": number,
    ""employeeCount"": number,
    ""keyRisks"": [""string""],
    ""guidance"": ""string""
}

Numbers in millions USD. Growth/margins as percentages.
Return ONLY valid JSON.";

    string result = await pdf.Query(extractionQuery);

    try
    {
        var financials = JsonSerializer.Deserialize<CompanyFinancials>(result);
        if (financials != null)
            sectorData.Add(financials);
    }
    catch
    {
        Console.WriteLine($"  Warning: Could not parse financials for {filing}");
    }
}

// Generate sector comparison report
var report = new StringBuilder();
report.AppendLine("=== Sector Analysis Report ===\n");

report.AppendLine("Revenue Comparison (millions USD):");
foreach (var company in sectorData.OrderByDescending(c => c.Revenue))
    report.AppendLine($"  {company.CompanyName}: ${company.Revenue:N0} ({company.RevenueGrowth:+0.0;-0.0}% YoY)");

report.AppendLine("\nProfitability Margins:");
foreach (var company in sectorData.OrderByDescending(c => c.OperatingMargin))
    report.AppendLine($"  {company.CompanyName}: {company.GrossMargin:F1}% gross, {company.OperatingMargin:F1}% operating");

report.AppendLine("\nFinancial Health (Debt vs Cash):");
foreach (var company in sectorData)
{
    double netDebt = company.TotalDebt - company.CashPosition;
    string status = netDebt < 0 ? "Net Cash" : "Net Debt";
    report.AppendLine($"  {company.CompanyName}: {status} ${Math.Abs(netDebt):N0}M");
}

string reportText = report.ToString();
Console.WriteLine($"\n{reportText}");
File.WriteAllText("sector-analysis-report.txt", reportText);

// Save full JSON data
string outputJson = JsonSerializer.Serialize(sectorData, new JsonSerializerOptions { WriteIndented = true });
File.WriteAllText("sector-analysis.json", outputJson);

Console.WriteLine("Analysis saved to sector-analysis.json and sector-analysis-report.txt");

class CompanyFinancials
{
    public string CompanyName { get; set; } = "";
    public string FiscalYear { get; set; } = "";
    public double Revenue { get; set; }
    public double RevenueGrowth { get; set; }
    public double GrossMargin { get; set; }
    public double OperatingMargin { get; set; }
    public double NetIncome { get; set; }
    public double Eps { get; set; }
    public double TotalDebt { get; set; }
    public double CashPosition { get; set; }
    public int EmployeeCount { get; set; }
    public List<string> KeyRisks { get; set; } = new();
    public string Guidance { get; set; } = "";
}

Imports IronPdf
Imports IronPdf.AI
Imports Microsoft.SemanticKernel
Imports Microsoft.SemanticKernel.Memory
Imports Microsoft.SemanticKernel.Connectors.OpenAI
Imports System.Text.Json
Imports System.Text
Imports System.IO

' Azure OpenAI configuration
Dim azureEndpoint As String = "https://your-resource.openai.azure.com/"
Dim apiKey As String = "your-azure-api-key"
Dim chatDeployment As String = "gpt-4o"
Dim embeddingDeployment As String = "text-embedding-ada-002"

' Initialize Semantic Kernel
Dim kernel = Kernel.CreateBuilder() _
    .AddAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .AddAzureOpenAIChatCompletion(chatDeployment, azureEndpoint, apiKey) _
    .Build()

Dim memory = New MemoryBuilder() _
    .WithMemoryStore(New VolatileMemoryStore()) _
    .WithAzureOpenAITextEmbeddingGeneration(embeddingDeployment, azureEndpoint, apiKey) _
    .Build()

IronDocumentAI.Initialize(kernel, memory)

' Analyze company filings
Dim companyFilings As String() = {
    "filings/company-a-10k.pdf",
    "filings/company-b-10k.pdf",
    "filings/company-c-10k.pdf"
}

Dim sectorData = New List(Of CompanyFinancials)()

For Each filing As String In companyFilings
    Console.WriteLine($"Analyzing: {Path.GetFileName(filing)}")

    Dim pdf = PdfDocument.FromFile(filing)

    ' Define JSON schema for 10-K extraction (numbers in millions USD)
    Dim extractionQuery As String = "Extract key financial metrics from this 10-K filing. Return JSON:" & vbCrLf & _
    "{" & vbCrLf & _
    "    ""companyName"": ""string""," & vbCrLf & _
    "    ""fiscalYear"": ""string""," & vbCrLf & _
    "    ""revenue"": number," & vbCrLf & _
    "    ""revenueGrowth"": number," & vbCrLf & _
    "    ""grossMargin"": number," & vbCrLf & _
    "    ""operatingMargin"": number," & vbCrLf & _
    "    ""netIncome"": number," & vbCrLf & _
    "    ""eps"": number," & vbCrLf & _
    "    ""totalDebt"": number," & vbCrLf & _
    "    ""cashPosition"": number," & vbCrLf & _
    "    ""employeeCount"": number," & vbCrLf & _
    "    ""keyRisks"": [""string""]," & vbCrLf & _
    "    ""guidance"": ""string""" & vbCrLf & _
    "}" & vbCrLf & _
    "Numbers in millions USD. Growth/margins as percentages." & vbCrLf & _
    "Return ONLY valid JSON."

    Dim result As String = Await pdf.Query(extractionQuery)

    Try
        Dim financials = JsonSerializer.Deserialize(Of CompanyFinancials)(result)
        If financials IsNot Nothing Then
            sectorData.Add(financials)
        End If
    Catch
        Console.WriteLine($"  Warning: Could not parse financials for {filing}")
    End Try
Next

' Generate sector comparison report
Dim report = New StringBuilder()
report.AppendLine("=== Sector Analysis Report ===" & vbCrLf)

report.AppendLine("Revenue Comparison (millions USD):")
For Each company In sectorData.OrderByDescending(Function(c) c.Revenue)
    report.AppendLine($"  {company.CompanyName}: ${company.Revenue:N0} ({company.RevenueGrowth:+0.0;-0.0}% YoY)")
Next

report.AppendLine(vbCrLf & "Profitability Margins:")
For Each company In sectorData.OrderByDescending(Function(c) c.OperatingMargin)
    report.AppendLine($"  {company.CompanyName}: {company.GrossMargin:F1}% gross, {company.OperatingMargin:F1}% operating")
Next

report.AppendLine(vbCrLf & "Financial Health (Debt vs Cash):")
For Each company In sectorData
    Dim netDebt As Double = company.TotalDebt - company.CashPosition
    Dim status As String = If(netDebt < 0, "Net Cash", "Net Debt")
    report.AppendLine($"  {company.CompanyName}: {status} ${Math.Abs(netDebt):N0}M")
Next

Dim reportText As String = report.ToString()
Console.WriteLine(vbCrLf & reportText)
File.WriteAllText("sector-analysis-report.txt", reportText)

' Save full JSON data
Dim outputJson As String = JsonSerializer.Serialize(sectorData, New JsonSerializerOptions With {.WriteIndented = True})
File.WriteAllText("sector-analysis.json", outputJson)

Console.WriteLine("Analysis saved to sector-analysis.json and sector-analysis-report.txt")

Public Class CompanyFinancials
    Public Property CompanyName As String = ""
    Public Property FiscalYear As String = ""
    Public Property Revenue As Double
    Public Property RevenueGrowth As Double
    Public Property GrossMargin As Double
    Public Property OperatingMargin As Double
    Public Property NetIncome As Double
    Public Property Eps As Double
    Public Property TotalDebt As Double
    Public Property CashPosition As Double
    Public Property EmployeeCount As Integer
    Public Property KeyRisks As List(Of String) = New List(Of String)()
    Public Property Guidance As String = ""
End Class

$vbLabelText $csharpLabel

Investment firms use AI-powered analysis to process thousands of documents daily, enabling analysts to monitor broader market coverage and respond faster to emerging opportunities.

Research Paper Summarization

Academic research generates millions of papers annually. AI-powered summarization helps researchers quickly assess paper relevance, understand key findings, and identify papers warranting detailed reading. Effective research summarization must identify the research question, explain methodology, summarize key findings with appropriate caveats, and place results in context.

Research institutions use AI summarization to maintain institutional knowledge bases, automatically processing new publications. With GPT-5's improved scientific reasoning and Claude Sonnet 4.5's enhanced analytical capabilities in 2026, academic summarization reaches new levels of accuracy.

Government Document Processing

Government agencies produce massive document collections—regulations, public comments, environmental impact statements, court filings, audit reports. AI-powered document processing makes government information actionable through regulatory compliance analysis, environmental impact assessment, and legislative tracking.

Public comment analysis presents unique challenges—major regulatory proposals may receive hundreds of thousands of comments. AI systems can categorize comments by topic, identify common themes, detect coordinated campaigns, and extract substantive arguments warranting agency response.

The 2026 generation of AI models brings unprecedented capabilities to government document processing, supporting democratic transparency and informed policymaking.

Troubleshooting & Technical Support

Quick Fixes on Common Errors

Slow first render? Normal. Chrome initializes in 2–3s, then speeds up.
Cloud issues? Use at least Azure B1 or equivalent resources.
Missing assets? Set base paths or embed as base64.
Missing elements? Add RenderDelay for JavaScript execution.
Memory issues? Update to latest IronPDF version for performance fixes.
Form field issues? Ensure unique names and update to latest version.

Get Help From The Engineers Who Built IronPDF, 24/7

IronPDF offers 24/7 engineer support. Having trouble with HTML to PDF conversion or AI integration? Contact us:

Next Steps

Now that you understand AI-powered PDF processing, the next step is exploring IronPDF's broader capabilities. The OpenAI integration guide provides deeper coverage of summarization, querying, and memorization patterns, while the text and image extraction tutorial shows how to preprocess PDFs before AI analysis. For document assembly workflows, learn how to merge and split PDFs for batch processing.

When you're ready to expand beyond AI features, the complete PDF editing tutorial covers watermarks, headers, footers, forms, and annotations. For alternative AI integration approaches, the ChatGPT C# tutorial demonstrates different patterns. Production deployment is covered in the Azure deployment guide for WebApps and Functions, and the C# PDF creation tutorial covers generating PDFs from HTML, URLs, and raw content.

Ready to get started? Begin your free 30-day trial to test in production without watermarks, with flexible licensing that scales with your team. For questions about AI integration or any IronPDF features, our engineering support team is available to help.

Frequently Asked Questions

What are the benefits of using AI for PDF processing in C#?

AI-powered PDF processing in C# allows for advanced capabilities such as document summarization, data extraction to JSON, and building Q&A systems. It enhances efficiency and accuracy in handling large volumes of documents.

How does IronPDF integrate AI for summarizing documents?

IronPDF integrates AI by leveraging models like GPT-5 and Claude, which can analyze and summarize documents, making it easier to derive insights and comprehend large texts quickly.

What is the role of RAG patterns in AI-powered PDF processing?

RAG (Retrieve and Generate) patterns are used in AI-powered PDF processing to improve the quality of information retrieval and generation, allowing for more accurate and contextually relevant document analysis.

How can structured data be extracted from PDFs using IronPDF?

IronPDF enables the extraction of structured data from PDFs into formats like JSON, facilitating seamless data integration and analysis across different applications and systems.

Can IronPDF handle large document libraries with AI?

Yes, IronPDF can process large document libraries efficiently by using AI models to automate tasks such as summarization and data extraction, which scales well with OpenAI and Azure OpenAI integrations.

What AI models are supported by IronPDF for PDF processing?

IronPDF supports advanced AI models like GPT-5 and Claude, which are used for tasks such as document summarization and Q&A system building, enhancing the overall processing capabilities.

How does IronPDF facilitate the building of Q&A systems?

IronPDF aids in building Q&A systems by processing and analyzing documents to extract relevant information, which can then be used to generate accurate responses to user queries.

What are the primary use cases for AI-powered PDF processing in C#?

Primary use cases include document summarization, structured data extraction, Q&A system development, and handling large-scale document processing tasks using AI integrations like OpenAI.

Is it possible to use IronPDF with Azure OpenAI for document processing?

Yes, IronPDF can be integrated with Azure OpenAI to enhance document processing tasks, providing scalable solutions for summarizing, extracting, and analyzing PDF documents.

How does IronPDF improve document analysis with AI?

IronPDF improves document analysis by utilizing AI models to automate and enhance tasks such as summarization, data extraction, and information retrieval, leading to more efficient and accurate document handling.

Ahmad Sohail

Chat with engineering team now

Full Stack Developer

Ahmad is a full-stack developer with a strong foundation in C#, Python, and web technologies. He has a deep interest in building scalable software solutions and enjoys exploring how design and functionality meet in real-world applications.

Before joining the Iron Software team, Ahmad worked on automation projects ...

Ready to Get Started?

Nuget Downloads 17,269,395 | Version: 2026.1 just released

View Licenses

Start Free 30 Day Trial

On This Page

AI-Powered PDF Processing in C#: Summarize, Extract, and Analyze Documents with IronPDF

Get started making PDFs with NuGet now:

Install IronPDF with NuGet Package Manager

Copy and run this code snippet.

Deploy to test on your live environment

The AI + PDF Opportunity

Why PDFs Are the Biggest Untapped Data Source

How LLMs Understand Document Structure

IronPDF's Built-in AI Integration

Installing IronPDF and AI Extensions

Configuring Your OpenAI/Azure API Key

Initializing the AI Engine

How IronPDF Prepares PDFs for AI Context

Document Summarization

Single Document Summaries

Input

Console Output

Multi-Document Synthesis

Executive Summary Generation

Intelligent Data Extraction

Extracting Structured Data to JSON

Input

Partial Screenshot of Generated JSON File

Contract Clause Identification

Financial Data Parsing

Custom Extraction Prompts

Question-Answering Over Documents

Building a PDF Q&A System

Input

Console Output

Chunking Long Documents for Context Windows

RAG (Retrieval-Augmented Generation) Patterns

Citing Sources from PDF Pages

Input

Console Output

Batch AI Processing

Processing Document Libraries at Scale

Cost Management and Token Usage

Caching and Incremental Processing

Real-World Use Cases

Legal Discovery and Contract Analysis

Financial Report Analysis

Research Paper Summarization

Government Document Processing

Troubleshooting & Technical Support

Quick Fixes on Common Errors

Get Help From The Engineers Who Built IronPDF, 24/7

Next Steps

Frequently Asked Questions

What are the benefits of using AI for PDF processing in C#?

How does IronPDF integrate AI for summarizing documents?

What is the role of RAG patterns in AI-powered PDF processing?

How can structured data be extracted from PDFs using IronPDF?

Can IronPDF handle large document libraries with AI?

What AI models are supported by IronPDF for PDF processing?

How does IronPDF facilitate the building of Q&A systems?

What are the primary use cases for AI-powered PDF processing in C#?

Is it possible to use IronPDF with Azure OpenAI for document processing?

How does IronPDF improve document analysis with AI?

Next step: Start free 30-day Trial

Next step: Start free 30-day Trial

Trusted by Millions of Engineers Worldwide