How to Extract Embedded Text and Images from PDFs
Extracting embedded text and images involves retrieving textual content and graphical elements within the document. This process allows users to access and repurpose the content for editing, searching, or converting text to other formats and saving images for reuse or analysis.
To extract text and images from a PDF, use IronPdf. The extracted image can be saved to the disk or converted to another image format and embedded in the newly rendered document.
Get started with IronPDF
Start using IronPDF in your project today with a free trial.
How to Extract Embedded Text and Images from PDFs
- Download the IronPdf C# Library
- Prepare the PDF document for text and image extraction
- Use the
ExtractAllText
method to extract text - Use the
ExtractAllImages
method to extract images - Specify the particular pages from which to extract text and images
Extract Text Example
Text extraction can be performed on both newly rendered and existing PDF documents. Use the ExtractAllText
method to extract the embedded text from the document. The method will return a string containing all the text in the given PDF. Pages are separated by four consecutive new line characters. Let's use a sample PDF that I have rendered from the Wikipedia website.
:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text.cs
using IronPdf;
using System.IO;
// This code snippet demonstrates how to extract text from a PDF file and save it to a .txt file.
// Instantiate a PdfDocument object by loading a PDF file from the specified path.
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");
// Extract all text from the PDF document.
string text = pdf.ExtractAllText();
// Write the extracted text to a file named "extractedText.txt".
// If the file already exists, it will be overwritten.
File.WriteAllText("extractedText.txt", text);
Imports IronPdf
Imports System.IO
' This code snippet demonstrates how to extract text from a PDF file and save it to a .txt file.
' Instantiate a PdfDocument object by loading a PDF file from the specified path.
Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")
' Extract all text from the PDF document.
Private text As String = pdf.ExtractAllText()
' Write the extracted text to a file named "extractedText.txt".
' If the file already exists, it will be overwritten.
File.WriteAllText("extractedText.txt", text)

Extract Text by Line and Character
Within each PDF page, it is possible to retrieve the coordinates of text lines and characters. First, select a page from the PDF and access the Lines and Characters properties. The coordinates are laid out as Top, Right, Bottom, and Left values, representing the position of the text.
:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text-by-line-character.cs
using IronPdf;
using System.IO;
using System.Linq;
// Create a PDF document from the file located at "sample.pdf"
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");
// Extract the lines of text from the first page of the PDF
var lines = pdf.Pages[0].Lines;
// Extract the individual characters of text from the first page
var characters = pdf.Pages[0].Characters;
// Write the extracted line information to "lines.txt".
// This includes the vertical position (Y coordinate) and the content of each line.
// The Y coordinate is formatted to two decimal places.
File.WriteAllLines("lines.txt", lines.Select(l => $"at Y={l.BoundingBox.Top:F2}: {l.Contents}"));
// Note: If you intend to use character extraction, you should handle the 'characters' variable accordingly.
// The 'characters' variable holds detailed character information, useful for operations like text search or
// individual character analysis.
Imports IronPdf
Imports System.IO
Imports System.Linq
' Create a PDF document from the file located at "sample.pdf"
Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")
' Extract the lines of text from the first page of the PDF
Private lines = pdf.Pages(0).Lines
' Extract the individual characters of text from the first page
Private characters = pdf.Pages(0).Characters
' Write the extracted line information to "lines.txt".
' This includes the vertical position (Y coordinate) and the content of each line.
' The Y coordinate is formatted to two decimal places.
File.WriteAllLines("lines.txt", lines.Select(Function(l) $"at Y={l.BoundingBox.Top:F2}: {l.Contents}"))
' Note: If you intend to use character extraction, you should handle the 'characters' variable accordingly.
' The 'characters' variable holds detailed character information, useful for operations like text search or
' individual character analysis.

Extract Images Example
Use the ExtractAllImages
method to extract all images embedded in the document. The method will return the images as a list of AnyBitmap objects. Using the same document from our previous example, we extracted the images and exported them to the 'images' folder.
:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-image.cs
// Import the necessary IronPdf namespace for handling PDF documents
using IronPdf;
// Load the PDF document from a file
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");
// Extract all images contained within the PDF document
var images = pdf.ExtractAllImages();
// Check if the "images" directory exists and create it if it does not
string directoryPath = "images";
if (!System.IO.Directory.Exists(directoryPath))
{
System.IO.Directory.CreateDirectory(directoryPath);
}
// Loop through each extracted image and save it as a PNG file
for (int i = 0; i < images.Count; i++)
{
// Save the extracted image to a specified directory with a unique file name
// Use a numbered sequence to name each image based on its position in the list
images[i].SaveAs($"{directoryPath}/image{i}.png");
}
' Import the necessary IronPdf namespace for handling PDF documents
Imports IronPdf
' Load the PDF document from a file
Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")
' Extract all images contained within the PDF document
Private images = pdf.ExtractAllImages()
' Check if the "images" directory exists and create it if it does not
Private directoryPath As String = "images"
If Not System.IO.Directory.Exists(directoryPath) Then
System.IO.Directory.CreateDirectory(directoryPath)
End If
' Loop through each extracted image and save it as a PNG file
For i As Integer = 0 To images.Count - 1
' Save the extracted image to a specified directory with a unique file name
' Use a numbered sequence to name each image based on its position in the list
images(i).SaveAs($"{directoryPath}/image{i}.png")
Next i

In addition to the ExtractAllImages
method shown above, the user can use the ExtractAllBitmaps
and ExtractAllRawImages
methods to extract image information from the document. While the ExtractAllBitmaps
method will return a List of AnyBitmap, like the code example, the ExtractAllRawImages
method extracts all images from a PDF document and returns them as raw data in the form of Byte Arrays (byte[]
).
Extract Text and Images on Specific Pages
Both text and image extraction can be performed on single or multiple specified pages. Use the ExtractTextFromPage
and ExtractTextFromPages
methods to extract text from a single page or multiple pages, respectively. For extracting images, use the ExtractImagesFromPage
and ExtractImagesFromPages
methods.
:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text-single-multiple.cs
using IronPdf;
// This program demonstrates how to extract text from specific pages of a PDF document using IronPdf library.
//
// Note: Ensure that the IronPdf library is installed and referenced in your project.
// Load a PDF document from a given file path
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");
// Extract text from the first page (Note: page index is zero-based, so index 0 refers to the first page)
string textFromPage1 = pdf.ExtractTextFromPage(0);
// Specify pages to extract text from, using zero-based indices
// In this example, we specify the first page (index 0) and the third page (index 2)
int[] pages = new[] { 0, 2 };
// Extract text from the specified pages (in this case, pages 1 and 3)
string textFromPage1_3 = pdf.ExtractTextFromPages(pages);
// (Optional) Output extracted texts to console for verification
Console.WriteLine("Text from Page 1:");
Console.WriteLine(textFromPage1);
Console.WriteLine("\nText from Pages 1 and 3:");
Console.WriteLine(textFromPage1_3);
Imports Microsoft.VisualBasic
Imports IronPdf
' This program demonstrates how to extract text from specific pages of a PDF document using IronPdf library.
'
' Note: Ensure that the IronPdf library is installed and referenced in your project.
' Load a PDF document from a given file path
Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")
' Extract text from the first page (Note: page index is zero-based, so index 0 refers to the first page)
Private textFromPage1 As String = pdf.ExtractTextFromPage(0)
' Specify pages to extract text from, using zero-based indices
' In this example, we specify the first page (index 0) and the third page (index 2)
Private pages() As Integer = { 0, 2 }
' Extract text from the specified pages (in this case, pages 1 and 3)
Private textFromPage1_3 As String = pdf.ExtractTextFromPages(pages)
' (Optional) Output extracted texts to console for verification
Console.WriteLine("Text from Page 1:")
Console.WriteLine(textFromPage1)
Console.WriteLine(vbLf & "Text from Pages 1 and 3:")
Console.WriteLine(textFromPage1_3)
Frequently Asked Questions
How can I extract text from a PDF using IronPdf?
To extract text from a PDF using IronPdf, use the ExtractAllText method after loading the PDF document with PdfDocument.FromFile. This will return all the text as a string.
What method should I use to extract images from a PDF?
Use the ExtractAllImages method to extract images from a PDF document. This will return a list of AnyBitmap objects representing the images.
Can I extract text and images from specific pages of a PDF?
Yes, you can use the ExtractTextFromPage and ExtractImagesFromPage methods to extract text and images from specific pages of a PDF.
How do I save extracted images to disk using IronPdf?
After extracting images using the ExtractAllImages method, iterate through the list of images and use the SaveAs method to save each image to disk in your desired format.
What are the coordinates of text lines used for?
The coordinates of text lines (Top, Right, Bottom, Left) help determine the position of text within a PDF page, which can be useful for precise text extraction and layout analysis.
Is it possible to extract raw image data from a PDF?
Yes, you can use the ExtractAllRawImages method to extract raw image data from a PDF document, which returns images as Byte Arrays.
What is the purpose of the ExtractAllBitmaps method?
The ExtractAllBitmaps method extracts images from a PDF and returns them as a list of AnyBitmap objects, similar to the ExtractAllImages method.
How can I start using IronPdf for text and image extraction?
To start using IronPdf, download the IronPdf C# Library from NuGet, and follow the provided steps to extract text and images from your PDF documents.
What are the advantages of extracting text and images from PDFs?
Extracting text and images from PDFs allows users to repurpose content for editing, searching, conversion to other formats, and image reuse or analysis.
What other methods are available for text and image extraction?
Beside ExtractAllText and ExtractAllImages, IronPdf offers methods like ExtractTextFromPages and ExtractImagesFromPages for extracting content from multiple pages at once.