Get started with IronPDF

Start using IronPDF in your project today with a free trial.

First Step:
green arrow pointer



Extract Text Example

Text extraction can be performed on both newly rendered and existing PDF documents. Use the ExtractAllText method to extract the embedded text from the document. The method will return a string containing all the text in the given PDF. Pages are separated by four consecutive new line characters. Let's use a sample PDF that I have rendered from the Wikipedia website.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text.cs
using IronPdf;
using System.IO;

// This code snippet demonstrates how to extract text from a PDF file and save it to a .txt file.

// Instantiate a PdfDocument object by loading a PDF file from the specified path.
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract all text from the PDF document.
string text = pdf.ExtractAllText();

// Write the extracted text to a file named "extractedText.txt".
// If the file already exists, it will be overwritten.
File.WriteAllText("extractedText.txt", text);
Imports IronPdf

Imports System.IO



' This code snippet demonstrates how to extract text from a PDF file and save it to a .txt file.



' Instantiate a PdfDocument object by loading a PDF file from the specified path.

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")



' Extract all text from the PDF document.

Private text As String = pdf.ExtractAllText()



' Write the extracted text to a file named "extractedText.txt".

' If the file already exists, it will be overwritten.

File.WriteAllText("extractedText.txt", text)
$vbLabelText   $csharpLabel
Extracted text

Extract Text by Line and Character

Within each PDF page, it is possible to retrieve the coordinates of text lines and characters. First, select a page from the PDF and access the Lines and Characters properties. The coordinates are laid out as Top, Right, Bottom, and Left values, representing the position of the text.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text-by-line-character.cs
using IronPdf;
using System.IO;
using System.Linq;

// Create a PDF document from the file located at "sample.pdf"
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract the lines of text from the first page of the PDF
var lines = pdf.Pages[0].Lines;

// Extract the individual characters of text from the first page
var characters = pdf.Pages[0].Characters;

// Write the extracted line information to "lines.txt".
// This includes the vertical position (Y coordinate) and the content of each line.
// The Y coordinate is formatted to two decimal places.
File.WriteAllLines("lines.txt", lines.Select(l => $"at Y={l.BoundingBox.Top:F2}: {l.Contents}"));

// Note: If you intend to use character extraction, you should handle the 'characters' variable accordingly.
// The 'characters' variable holds detailed character information, useful for operations like text search or
// individual character analysis.
Imports IronPdf

Imports System.IO

Imports System.Linq



' Create a PDF document from the file located at "sample.pdf"

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")



' Extract the lines of text from the first page of the PDF

Private lines = pdf.Pages(0).Lines



' Extract the individual characters of text from the first page

Private characters = pdf.Pages(0).Characters



' Write the extracted line information to "lines.txt".

' This includes the vertical position (Y coordinate) and the content of each line.

' The Y coordinate is formatted to two decimal places.

File.WriteAllLines("lines.txt", lines.Select(Function(l) $"at Y={l.BoundingBox.Top:F2}: {l.Contents}"))



' Note: If you intend to use character extraction, you should handle the 'characters' variable accordingly.

' The 'characters' variable holds detailed character information, useful for operations like text search or

' individual character analysis.
$vbLabelText   $csharpLabel
Extracted text by line and character

Extract Images Example

Use the ExtractAllImages method to extract all images embedded in the document. The method will return the images as a list of AnyBitmap objects. Using the same document from our previous example, we extracted the images and exported them to the 'images' folder.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-image.cs
// Import the necessary IronPdf namespace for handling PDF documents
using IronPdf;

// Load the PDF document from a file
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract all images contained within the PDF document
var images = pdf.ExtractAllImages();

// Check if the "images" directory exists and create it if it does not
string directoryPath = "images";
if (!System.IO.Directory.Exists(directoryPath))
{
    System.IO.Directory.CreateDirectory(directoryPath);
}

// Loop through each extracted image and save it as a PNG file
for (int i = 0; i < images.Count; i++)
{
    // Save the extracted image to a specified directory with a unique file name
    // Use a numbered sequence to name each image based on its position in the list
    images[i].SaveAs($"{directoryPath}/image{i}.png");
}
' Import the necessary IronPdf namespace for handling PDF documents

Imports IronPdf



' Load the PDF document from a file

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")



' Extract all images contained within the PDF document

Private images = pdf.ExtractAllImages()



' Check if the "images" directory exists and create it if it does not

Private directoryPath As String = "images"

If Not System.IO.Directory.Exists(directoryPath) Then

	System.IO.Directory.CreateDirectory(directoryPath)

End If



' Loop through each extracted image and save it as a PNG file

For i As Integer = 0 To images.Count - 1

	' Save the extracted image to a specified directory with a unique file name

	' Use a numbered sequence to name each image based on its position in the list

	images(i).SaveAs($"{directoryPath}/image{i}.png")

Next i
$vbLabelText   $csharpLabel
Extracted images

In addition to the ExtractAllImages method shown above, the user can use the ExtractAllBitmaps and ExtractAllRawImages methods to extract image information from the document. While the ExtractAllBitmaps method will return a List of AnyBitmap, like the code example, the ExtractAllRawImages method extracts all images from a PDF document and returns them as raw data in the form of Byte Arrays (byte[]).


Extract Text and Images on Specific Pages

Both text and image extraction can be performed on single or multiple specified pages. Use the ExtractTextFromPage and ExtractTextFromPages methods to extract text from a single page or multiple pages, respectively. For extracting images, use the ExtractImagesFromPage and ExtractImagesFromPages methods.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text-single-multiple.cs
using IronPdf;

// This program demonstrates how to extract text from specific pages of a PDF document using IronPdf library.
//
// Note: Ensure that the IronPdf library is installed and referenced in your project.

// Load a PDF document from a given file path
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract text from the first page (Note: page index is zero-based, so index 0 refers to the first page)
string textFromPage1 = pdf.ExtractTextFromPage(0);

// Specify pages to extract text from, using zero-based indices
// In this example, we specify the first page (index 0) and the third page (index 2)
int[] pages = new[] { 0, 2 };

// Extract text from the specified pages (in this case, pages 1 and 3)
string textFromPage1_3 = pdf.ExtractTextFromPages(pages);

// (Optional) Output extracted texts to console for verification
Console.WriteLine("Text from Page 1:");
Console.WriteLine(textFromPage1);

Console.WriteLine("\nText from Pages 1 and 3:");
Console.WriteLine(textFromPage1_3);
Imports Microsoft.VisualBasic

Imports IronPdf



' This program demonstrates how to extract text from specific pages of a PDF document using IronPdf library.

'

' Note: Ensure that the IronPdf library is installed and referenced in your project.



' Load a PDF document from a given file path

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")



' Extract text from the first page (Note: page index is zero-based, so index 0 refers to the first page)

Private textFromPage1 As String = pdf.ExtractTextFromPage(0)



' Specify pages to extract text from, using zero-based indices

' In this example, we specify the first page (index 0) and the third page (index 2)

Private pages() As Integer = { 0, 2 }



' Extract text from the specified pages (in this case, pages 1 and 3)

Private textFromPage1_3 As String = pdf.ExtractTextFromPages(pages)



' (Optional) Output extracted texts to console for verification

Console.WriteLine("Text from Page 1:")

Console.WriteLine(textFromPage1)



Console.WriteLine(vbLf & "Text from Pages 1 and 3:")

Console.WriteLine(textFromPage1_3)
$vbLabelText   $csharpLabel

Frequently Asked Questions

How can I extract text from a PDF using IronPdf?

To extract text from a PDF using IronPdf, use the ExtractAllText method after loading the PDF document with PdfDocument.FromFile. This will return all the text as a string.

What method should I use to extract images from a PDF?

Use the ExtractAllImages method to extract images from a PDF document. This will return a list of AnyBitmap objects representing the images.

Can I extract text and images from specific pages of a PDF?

Yes, you can use the ExtractTextFromPage and ExtractImagesFromPage methods to extract text and images from specific pages of a PDF.

How do I save extracted images to disk using IronPdf?

After extracting images using the ExtractAllImages method, iterate through the list of images and use the SaveAs method to save each image to disk in your desired format.

What are the coordinates of text lines used for?

The coordinates of text lines (Top, Right, Bottom, Left) help determine the position of text within a PDF page, which can be useful for precise text extraction and layout analysis.

Is it possible to extract raw image data from a PDF?

Yes, you can use the ExtractAllRawImages method to extract raw image data from a PDF document, which returns images as Byte Arrays.

What is the purpose of the ExtractAllBitmaps method?

The ExtractAllBitmaps method extracts images from a PDF and returns them as a list of AnyBitmap objects, similar to the ExtractAllImages method.

How can I start using IronPdf for text and image extraction?

To start using IronPdf, download the IronPdf C# Library from NuGet, and follow the provided steps to extract text and images from your PDF documents.

What are the advantages of extracting text and images from PDFs?

Extracting text and images from PDFs allows users to repurpose content for editing, searching, conversion to other formats, and image reuse or analysis.

What other methods are available for text and image extraction?

Beside ExtractAllText and ExtractAllImages, IronPdf offers methods like ExtractTextFromPages and ExtractImagesFromPages for extracting content from multiple pages at once.

Chaknith related to Extract Text and Images on Specific Pages
Software Engineer
Chaknith is the Sherlock Holmes of developers. It first occurred to him he might have a future in software engineering, when he was doing code challenges for fun. His focus is on IronXL and IronBarcode, but he takes pride in helping customers with every product. Chaknith leverages his knowledge from talking directly with customers, to help further improve the products themselves. His anecdotal feedback goes beyond Jira tickets and supports product development, documentation and marketing, to improve customer’s overall experience.When he isn’t in the office, he can be found learning about machine learning, coding and hiking.