How to Extract Data from PDF in C#

Introduction

Extracting data from PDFs is crucial for saving time on manual inputting. This article explains how developers can use the IronPDF library to extract text and images from PDF documents.

IronPDF: C# PDF Library

IronPDF is a .NET library that can be used to create, edit and convert PDF files. It provides an easy-to-use API for developers to use in their applications. It is one of the most popular libraries for creating, editing, and converting PDF files globally. With IronPDF, you can create a straightforward and quick solution to PDFs. Your text will be customized for each document, your layout will be set up for easy reading, and your graphics will be designed with help from the accompanying .NET program.

The IronPDF library has a fantastic feature for extracting data from PDF files. This article will look at how to extract data using IronPDF. First, a C# Project needs to be created or opened. Let's move on to the next section.

Create or Open a C# Project in Visual Studio

This tutorial recommends to use the latest version of the Visual Studio.

Once Visual Studio is opened, follow the steps below to create a new C# Project. If there is an existing project that you would like to use, then skip these next steps and proceed to the next section directly.

  • Open Visual Studio
  • Click on the "Create a new project" button.

How to Extract Data from PDFs in C#, Figure 1: Visual Studio opening UI Visual Studio opening UI

  • Select the "C# Console Application" from the templates.

How to Extract Data from PDFs in C#, Figure 2: Create a new project Create a new project

  • Give a name to the Project and click on the Next button.

  • Select s .NET Framework according to your project's requirements and click on the Create button.

How to Extract Data from PDFs in C#, Figure 3: .NET Framework selection .NET Framework selection

Visual Studio will now generate a new C# .NET project.

Install the IronPDF Library

The IronPDF library can be installed in multiple ways.

Using Package Manager Console

  • Open the Package Manager Console by going to Tools > NuGet Package Manager > Package Manager Console.
  • Run the following command:
Install-Package IronPdf

How to Extract Data from PDFs in C#, Figure 4: Installation progress in the Package Manager Console tab Installation progress in the Package Manager Console tab

After installation, you will see the IronPDF dependency in the dependencies section of the Solution Explorer, as shown below.

How to Extract Data from PDFs in C#, Figure 5: Reference IronPdf package in Solution Explorer Reference IronPdf package in Solution Explorer

Using the NuGet Package Manager

Another way to install the IronPDF library is by using Visual Studio's integrated NuGet Package Manager UI.

  • Go to the Tools from the main menu. Hover on "NuGet Package Manager" from the drop-down menu and select the "NuGet Package Manager Solution".

How to Extract Data from PDFs in C#, Figure 6: Navigate to NuGet Package Manager Navigate to NuGet Package Manager

  • This will open the NuGet Package Manager window. Go to the Browse tab, write IronPdf in search, and press Enter.
  • Select IronPDF from the search results and click on the "Install" button to begin the installation.

How to Extract Data from PDFs in C#, Figure 7: Install the IronPdf package from the NuGet Package Manager Install the IronPdf package from the NuGet Package Manager

Extract Data from PDF Files

Let's have a look at the following code on how to extract data using IronPDF:

//Rendering PDF documents to Images or Thumbnails
using IronPdf;
using System.Drawing;

//  Extracting Image and Text content from Pdf Documents

// open a 128 bit encrypted PDF
using PdfDocument pdf = PdfDocument.FromFile("encrypted.pdf", "password");

//Get all text to put in a search index
string AllText = pdf.ExtractAllText();

//Get all Images
IEnumerable<System.Drawing.Image> AllImages = pdf.ExtractAllImages();

//Or even find the precise text and images for each page in the document
for (var index = 0; index < pdf.PageCount; index++) {
    int PageNumber = index + 1;
    string Text = pdf.ExtractTextFromPage(index);
    IEnumerable<System.Drawing.Image> Images = pdf.ExtractImagesFromPage(index);
    ///...
}
//Rendering PDF documents to Images or Thumbnails
using IronPdf;
using System.Drawing;

//  Extracting Image and Text content from Pdf Documents

// open a 128 bit encrypted PDF
using PdfDocument pdf = PdfDocument.FromFile("encrypted.pdf", "password");

//Get all text to put in a search index
string AllText = pdf.ExtractAllText();

//Get all Images
IEnumerable<System.Drawing.Image> AllImages = pdf.ExtractAllImages();

//Or even find the precise text and images for each page in the document
for (var index = 0; index < pdf.PageCount; index++) {
    int PageNumber = index + 1;
    string Text = pdf.ExtractTextFromPage(index);
    IEnumerable<System.Drawing.Image> Images = pdf.ExtractImagesFromPage(index);
    ///...
}
'Rendering PDF documents to Images or Thumbnails
Imports IronPdf
Imports System.Drawing

'  Extracting Image and Text content from Pdf Documents

' open a 128 bit encrypted PDF
Private PdfDocument As using

'Get all text to put in a search index
Private AllText As String = pdf.ExtractAllText()

'Get all Images
Private AllImages As IEnumerable(Of System.Drawing.Image) = pdf.ExtractAllImages()

'Or even find the precise text and images for each page in the document
For index = 0 To pdf.PageCount - 1
	Dim PageNumber As Integer = index + 1
	Dim Text As String = pdf.ExtractTextFromPage(index)
	Dim Images As IEnumerable(Of System.Drawing.Image) = pdf.ExtractImagesFromPage(index)
	'''...
Next index
VB   C#

Firstly, the FromFile method is used to load the input PDF document in the program. An encrypted PDF file is provided, needing a password to access the file. Afterward, text data is extracted using the ExtractAllText method to pull all text data into a String variable. From here, PdfDocument offers a lot of functionality: output it as plain text, dump it in a TXT file, store it in a database, etc.

IronPDF can extract text from PDF tables for inclusion in one or more CSV files.

Line 11 uses the ExtractAllImages method to extract all the embedded images from the PDF document.

IronPDF can also extract content from specific PDF pages. The remaining lines of code in the example above demonstrate how to use the ExtractTextFromPage and ExtractImagesFromPage methods to fetch the text and images from a subset of pages. Both methods accept an integer argument that represents the zero-based index of the desired page.

Conclusion

IronPDF allows developers to extract text and images from PDF files in as little as one line of code, using ExtractAllText and ExtractAllImages to extract a PDF file's entire contents instantly. Alternatively, calling ExtractAllImage or ExtractAllText will fetch text and images from just one PDF page in particular. The previous sample code showed how to use both methods to read text and images from a range of pages.

Additionally, IronPDF is also capable of rendering charts in PDFs, adding barcodes, enhancing security with passwords and watermarking, and even handling PDF forms programmatically.

IronPDF is completely free for development. While payment is needed for commercial use, you can access the free trial for production without any payment.

Purchase the full suite of Iron Software's document libraries for the price of two ironPDF Lite Licenses.

Download IronPDF to start extracting data from PDFs today!