How to Extract Embedded Text and Images from PDFs

Extracting embedded text and images involves retrieving textual content and graphical elements within the document. This process allows users to access and repurpose the content for editing, searching, or converting text to other formats and saving images for reuse or analysis.

To extract text and images from a PDF, use IronPdf. The extracted image can be saved to the disk or converted to another image format and embedded in the newly rendered document.

How to Extract Embedded Text and Images from PDFs

Download the C# library to extract embedded text and images
Prepare the PDF document for text and image extraction
Use the ExtractAllText method to extract text
Use the ExtractAllImages method to extract images
Specify the particular pages from which to extract text and images

Extract Text Example

Text extraction can be performed on both newly rendered and existing PDF documents. Use the ExtractAllText method to extract the embedded text from the document. The method will return a string containing all the text in the given PDF. Pages are separated by four consecutive Environment.NewLinesPages. Let's use a sample PDF that I have rendered from the Wikipedia website.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text.cs

using IronPdf;
using System.IO;

PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract text
string text = pdf.ExtractAllText();

// Export the extracted text to a text file
File.WriteAllText("extractedText.txt", text);

Imports IronPdf
Imports System.IO

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")

' Extract text
Private text As String = pdf.ExtractAllText()

' Export the extracted text to a text file
File.WriteAllText("extractedText.txt", text)

VB C#

Extract Text by Line and Character

Within each PDF page, it is possible to retrieve the coordinates of text lines and characters. First, select a page from the PDF and access the Lines and Characters property. The coordinates are laid out as Top, Right, Bottom, and Left values, representing the position of the text.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text-by-line-character.cs

using IronPdf;
using System.IO;
using System.Linq;

// Open PDF from file
PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract text by lines
var lines = pdf.Pages[0].Lines;

// Extract text by characters
var characters = pdf.Pages[0].Characters;

File.WriteAllLines("lines.txt", lines.Select(l => $"at Y={l.Bottom:F2}: {l.Contents}"));

Imports IronPdf
Imports System.IO
Imports System.Linq

' Open PDF from file
Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")

' Extract text by lines
Private lines = pdf.Pages(0).Lines

' Extract text by characters
Private characters = pdf.Pages(0).Characters

File.WriteAllLines("lines.txt", lines.Select(Function(l) $"at Y={l.Bottom:F2}: {l.Contents}"))

VB C#

Extract Images Example

Use the ExtractAllImages method to extract all images embedded in the document. The method will return the images as a list of AnyBitmap objects. Using the same document from our previous example, we extracted the images and exported them to the 'images' folder.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-image.cs

using IronPdf;

PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract images
var images = pdf.ExtractAllImages();

for(int i = 0; i < images.Count; i++)
{
    // Export the extracted images
    images[i].SaveAs($"images/image{i}.png");
}

Imports IronPdf

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")

' Extract images
Private images = pdf.ExtractAllImages()

For i As Integer = 0 To images.Count - 1
	' Export the extracted images
	images(i).SaveAs($"images/image{i}.png")
Next i

VB C#

In addition to the ExtractAllImages method shown above, the user can use the ExtractAllBitmaps and ExtractAllRawImages methods to extract image information from the document. While the ExtractAllBitmaps method will return a List of AnyBitmap, like the code example, the ExtractAllRawImages method extracts all images from a PDF document and returns them as raw data in the form of Byte Arrays (byte[]).

Extract Text and Images on Specific Pages

Both text and image extraction can be performed on single or multiple specified pages. Use the ExtractTextFromPage and ExtractTextFromPages methods to extract text from a single page or multiple pages, respectively. For extracting images, use the ExtractImagesFromPage and ExtractImagesFromPages methods.

:path=/static-assets/pdf/content-code-examples/how-to/extract-text-and-images-extract-text-single-multiple.cs

using IronPdf;

PdfDocument pdf = PdfDocument.FromFile("sample.pdf");

// Extract text from page 1
string textFromPage1 = pdf.ExtractTextFromPage(0);

int[] pages = new[] { 0, 2 };

// Extract text from pages 1 & 3
string textFromPage1_3 = pdf.ExtractTextFromPages(pages);

Imports IronPdf

Private pdf As PdfDocument = PdfDocument.FromFile("sample.pdf")

' Extract text from page 1
Private textFromPage1 As String = pdf.ExtractTextFromPage(0)

Private pages() As Integer = { 0, 2 }

' Extract text from pages 1 & 3
Private textFromPage1_3 As String = pdf.ExtractTextFromPages(pages)

VB C#

Chaknith Bin

Software Engineer

Chaknith is the Sherlock Holmes of developers. It first occurred to him he might have a future in software engineering, when he was doing code challenges for fun. His focus is on IronXL and IronBarcode, but he takes pride in helping customers with every product. Chaknith leverages his knowledge from talking directly with customers, to help further improve the products themselves. His anecdotal feedback goes beyond Jira tickets and supports product development, documentation and marketing, to improve customer’s overall experience.When he isn’t in the office, he can be found learning about machine learning, coding and hiking.

Ready to get started? Version: 2024.4 just released

View Licenses >

How to Extract Embedded Text and Images from PDFs

How to Extract Embedded Text and Images from PDFs

Extract Text Example

Extract Text by Line and Character

Extract Images Example

Extract Text and Images on Specific Pages

Chaknith Bin

Software Engineer

Ready to get started? Version: 2024.4 just released

Test in a live environment

Fully-functional product

24/5 technical support

Test in a live environment

Fully-functional product

24/5 technical support

Test in a live environment

Fully-functional product

24/5 technical support

IronPDF is a part of IRONSUITE

How to Extract Embedded Text and Images from PDFs

How to Extract Embedded Text and Images from PDFs

Install with NuGet

Download DLL

Manually install into your project

Extract Text Example

Extract Text by Line and Character

Extract Images Example

Extract Text and Images on Specific Pages

Chaknith Bin

Software Engineer

Ready to get started? Version: 2024.4 just released

Get your FREE

The trial form was submittedsuccessfully.

The trial form was submittedsuccessfully.

The trial form was submittedsuccessfully.

The trial form was submittedsuccessfully.

Test in a live environment

Fully-functional product

24/5 technical support

Get your free 30-day Trial Key instantly.

The trial form was submittedsuccessfully.

Trusted by Over 2 Million Engineers Worldwide

Test in a live environment

Fully-functional product

24/5 technical support

Get your free 30-day Trial Key instantly.

Trusted by Over 2 Million Engineers Worldwide

Test in a live environment

Fully-functional product

24/5 technical support

Get your free 30-day Trial Key instantly.

The trial form was submittedsuccessfully.

Trusted by Over 2 Million Engineers Worldwide

IronPDF is a part of IRONSUITE

The trial form was submitted
successfully.

The trial form was submitted
successfully.

The trial form was submitted
successfully.

The trial form was submitted
successfully.

The trial form was submitted
successfully.

The trial form was submitted
successfully.