Read PDF Files in C#

The PdfDocument.ExtractAllText method from the IronPDF C# PDF library is perfect for vanilla PDF text reading tasks. This method handles whitespace and encoding discrepancies within source PDF documents without any issue.

PdfDocument.ExtractTextFromPage reads the text from specific pages of a PDF. In the example below, we see it used iteratively to retrieve text content from a specific range of pages.

using IronPdf;

class PdfTextExtractor
{
    static void Main(string[] args)
    {
        // Load a PDF document
        PdfDocument pdf = PdfDocument.FromFile("example.pdf");

        // Iterate over pages 1 to 3
        for (int i = 1; i <= 3; i++)
        {
            // Extract text from the current page
            string text = pdf.ExtractTextFromPage(i);

            // Print the text
            Console.WriteLine($"Text from page {i}:");
            Console.WriteLine(text);
        }
    }
}
using IronPdf;

class PdfTextExtractor
{
    static void Main(string[] args)
    {
        // Load a PDF document
        PdfDocument pdf = PdfDocument.FromFile("example.pdf");

        // Iterate over pages 1 to 3
        for (int i = 1; i <= 3; i++)
        {
            // Extract text from the current page
            string text = pdf.ExtractTextFromPage(i);

            // Print the text
            Console.WriteLine($"Text from page {i}:");
            Console.WriteLine(text);
        }
    }
}
Imports IronPdf

Friend Class PdfTextExtractor
	Shared Sub Main(ByVal args() As String)
		' Load a PDF document
		Dim pdf As PdfDocument = PdfDocument.FromFile("example.pdf")

		' Iterate over pages 1 to 3
		For i As Integer = 1 To 3
			' Extract text from the current page
			Dim text As String = pdf.ExtractTextFromPage(i)

			' Print the text
			Console.WriteLine($"Text from page {i}:")
			Console.WriteLine(text)
		Next i
	End Sub
End Class
$vbLabelText   $csharpLabel

IronPDF can also extract raw images from PDFs. For this, use either of the methods from the PdfDocument class below:

  • ExtractAllImages: returns all images embedded in a PDF as IronSoftware.Drawing.AnyBitmap objects.
  • ExtractAllRawImages: retrieves all embedded images as a list of raw bytes (byte[]).
  • ExtractImagesFromPage: extracts the images contained on an indexed page.
  • ExtractImagesFromPages: same as ExtractImagesFromPage, but from a specific page range or a list of individual pages.
  • ExtractRawImagesFromPage and ExtractRawImagesFromPages: works the same as the previous two methods, but returns extracted images as byte arrays instead of as IronSoftware.Drawing.AnyBitmap objects.