Read PDF Files in C#
Extracting text and images can facilitate data migration when transitioning from one document format to another. Extracted content can be preserved in a more accessible and editable format, reducing the risk of data loss.
Embedded images and text can be extracted independently of the PDF document. The extracted text will be in a normal string, while the extracted images will be in image buffer format and can then be exported or further processed.
Use the extractText
method to extract text, and the extractRawImages
method to extract images from a PDF document.
Here is a corrected and commented example of how you might do this:
using IronPdf;
using System;
using System.IO;
class PdfExtractor
{
static void Main(string[] args)
{
// Load the PDF document from a file
var pdfDocument = new PdfDocument("path/to/your/document.pdf");
// Extract text from the PDF document
string extractedText = pdfDocument.ExtractText();
Console.WriteLine("Extracted Text:");
Console.WriteLine(extractedText);
// Extract images from the PDF document
var images = pdfDocument.ExtractImages();
Console.WriteLine("\nExtracted Images:");
// Iterate over the extracted images
for (int i = 0; i < images.Count; i++)
{
// Each image is stored as a byte array
var image = images[i];
// Define a file name for the extracted image
string imageFilePath = $"extracted_image_{i + 1}.png";
// Write the image to a file
File.WriteAllBytes(imageFilePath, image.Bytes);
Console.WriteLine($"Image {i + 1} saved to {imageFilePath}");
}
}
}
using IronPdf;
using System;
using System.IO;
class PdfExtractor
{
static void Main(string[] args)
{
// Load the PDF document from a file
var pdfDocument = new PdfDocument("path/to/your/document.pdf");
// Extract text from the PDF document
string extractedText = pdfDocument.ExtractText();
Console.WriteLine("Extracted Text:");
Console.WriteLine(extractedText);
// Extract images from the PDF document
var images = pdfDocument.ExtractImages();
Console.WriteLine("\nExtracted Images:");
// Iterate over the extracted images
for (int i = 0; i < images.Count; i++)
{
// Each image is stored as a byte array
var image = images[i];
// Define a file name for the extracted image
string imageFilePath = $"extracted_image_{i + 1}.png";
// Write the image to a file
File.WriteAllBytes(imageFilePath, image.Bytes);
Console.WriteLine($"Image {i + 1} saved to {imageFilePath}");
}
}
}
Imports Microsoft.VisualBasic
Imports IronPdf
Imports System
Imports System.IO
Friend Class PdfExtractor
Shared Sub Main(ByVal args() As String)
' Load the PDF document from a file
Dim pdfDocument As New PdfDocument("path/to/your/document.pdf")
' Extract text from the PDF document
Dim extractedText As String = pdfDocument.ExtractText()
Console.WriteLine("Extracted Text:")
Console.WriteLine(extractedText)
' Extract images from the PDF document
Dim images = pdfDocument.ExtractImages()
Console.WriteLine(vbLf & "Extracted Images:")
' Iterate over the extracted images
For i As Integer = 0 To images.Count - 1
' Each image is stored as a byte array
Dim image = images(i)
' Define a file name for the extracted image
Dim imageFilePath As String = $"extracted_image_{i + 1}.png"
' Write the image to a file
File.WriteAllBytes(imageFilePath, image.Bytes)
Console.WriteLine($"Image {i + 1} saved to {imageFilePath}")
Next i
End Sub
End Class
In the above C# code:
- We use the IronPDF library to load a PDF document.
ExtractText()
method is invoked to retrieve text from the PDF. This text is output to the console.ExtractImages()
method is used to extract images, which are stored in byte arrays. Each image is then saved to the file system with a specified file name.
For more detailed instructions on how to use these methods, visit the IronPDF Documentation.