How to Read PDF Documents in C# using iTextSharp:

In today's digital age, Portable Document Format (PDF) files have become a standard for document exchange due to their platform-independent nature and consistent formatting. The iTextSharp library emerges as a powerful library for seamlessly interacting with PDFs for developers working with C#. In this article, we will learn the process of reading PDF files using iTextSharp in C#, exploring the essential steps and providing a comprehensive guide to help you unlock the potential of this versatile library.

How to Read PDF Documents in C# using iTextSharp

  1. Open or Create a Visual Studio Project.
  2. Install the iTextSharp Library.
  3. Add the Necessary Namespace.
  4. Select the PDF File to Read.
  5. Create an Instance of a PDF Reader.
  6. Create an Instance of a PDF Document.
  7. Loop through Each Page of the Document to Extract Text.
  8. Print the Extracted Text on the Console.

What is itextSharp?

iText 7, formerly known as iTextSharp, is a powerful and versatile Java and .NET library for creating, manipulating, and extracting content from PDF documents. It provides a comprehensive set of features, including text and image handling, form filling, digital signatures, and watermarking. Whether you’re generating invoices, reports, or interactive forms, iText 7 empowers developers to work with PDFs efficiently.

Reading PDF Files

Let's discuss some examples of Reading PDF Files in C#. To get started, you’ll need to add the iTextSharp library to your project

Install iTextSharp PDF Library

Open your C# Project using Visual Studio. In the top menu, go to "View" and then select "Package Manager Console." This will open the Package Manager Console at the bottom of the Visual Studio window.

In the Package Manager Console, ensure that the "Default project" dropdown is set to the project where you want to install the iTextSharp package.

Run the following command to install the iTextSharp library:

Install-Package itext7

This command fetches the latest version of iTextSharp from the NuGet package repository and installs it in your project. Wait for the installation process to complete. The Package Manager Console will display information about the installation progress.

How to Read PDF Documents in C# using iTextSharp:: Figure 1 - Install iTextSharp using the NuGet Package Manager Console in Visual Studio and adding the following command: "Install-package itext7".

Reading PDF Documents using iTextSharp PDF reader

I will use the following PDF document as input for this example.

How to Read PDF Documents in C# using iTextSharp:: Figure 2 - Original PDF Document

Before begin, add the following namespace:

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using System.Text;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using System.Text;
Imports iText.Kernel.Pdf
Imports iText.Kernel.Pdf.Canvas.Parser
Imports System.Text
VB   C#

The following code will read the above PDF file, extract the content, and print the extracted content to the console.

public static void Main(string[] args)
{
    StringBuilder text = new StringBuilder();
    string fileName = @"D:/What_is_pdf.pdf";
    if (File.Exists(fileName))
    {
        using (PdfReader pdfReader = new PdfReader(fileName))
        {
            using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
            {
                for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
                {
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page)); 
                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                }
            }
        }
    }
    Console.WriteLine(text.ToString());
}
public static void Main(string[] args)
{
    StringBuilder text = new StringBuilder();
    string fileName = @"D:/What_is_pdf.pdf";
    if (File.Exists(fileName))
    {
        using (PdfReader pdfReader = new PdfReader(fileName))
        {
            using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
            {
                for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
                {
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page)); 
                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                }
            }
        }
    }
    Console.WriteLine(text.ToString());
}
Public Shared Sub Main(ByVal args() As String)
	Dim text As New StringBuilder()
	Dim fileName As String = "D:/What_is_pdf.pdf"
	If File.Exists(fileName) Then
		Using pdfReader As New PdfReader(fileName)
			Using pdfDocument As New PdfDocument(pdfReader)
				Dim page As Integer = 1
				Do While page <= pdfDocument.GetNumberOfPages()
					Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page))
					currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))
					text.Append(currentText)
					page += 1
				Loop
			End Using
		End Using
	End If
	Console.WriteLine(text.ToString())
End Sub
VB   C#

The above source code reads a PDF file, extracts the text from each page, converts it to UTF-8, and then prints the entire text content to the console. It’s a basic example of how to extract text from a PDF file using the iTextSharp library in C#.

Code Explanation

1. File Path and Initialization

The code starts by declaring a StringBuilder named text to accumulate the extracted text from the PDF. It also defines a string variable fileName with the path of the document location. In this case, the PDF file is located at "D:/What_is_pdf.pdf".

2. File Existence Check

The if (File.Exists(fileName)) condition checks whether the specified file exists. If the file exists, the subsequent code block is executed.

3. PDF Document Processing

Inside the if block, it opens the PDF file using a PdfReader object. Then, it creates a PdfDocument file instance using the PdfReader. The for loop iterates through each page of the PDF document.

4. Text Extraction from PDF file

For each PDF page, it extracts the text content using the class PdfTextExtractor's GetTextFromPage(pdfDocument.GetPage(page)) method. The extracted text is initially encoded in the default encoding.

It then converts the text from the default encoding to UTF-8 using Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.Default.GetBytes(currentText))). The converted text is then appended to the text string.

5. Displaying the Accumulated Text

Finally, it prints the accumulated text using the Console.WriteLine() method.

Output

The extracted PDF text output is as follows:

How to Read PDF Documents in C# using iTextSharp:: Figure 3 - Console Output: Extracting the text from the PDF document "What_is_pdf.pdf" using iTextSharp and displaying it as plain text in the console.

In this way, we can read the content of the PDF file. This approach is complex, and less efficient as multiple instances are created. Let's explore an alternate way that is more user-friendly, and highly efficient.

Introduction to IronPDF

IronPDF is a versatile and efficient C# library designed to simplify and enhance the creation, manipulation, and rendering of PDF documents within .NET applications. IronPDF enables developers to seamlessly integrate PDF-related functionalities into their projects with a focus on ease of use and feature-rich capabilities. The library supports a wide range of PDF operations, including the creation of PDF documents from scratch, the conversion of HTML content to PDF, and the extraction of text and images from existing PDF files. IronPDF's intuitive API provides developers with a user-friendly experience, allowing them to generate dynamic and interactive PDFs effortlessly. Whether it's adding watermarks, annotations, or encrypting documents, IronPDF empowers developers to tailor PDFs to their specific requirements. As a reliable solution, IronPDF proves instrumental in applications ranging from report generation and document management to web development, offering a comprehensive set of tools to streamline PDF-related tasks in the .NET environment.

Install IronPDF Library

Using Package Manager Console in Visual Studio

Download IronPDF into your project using the NuGet Package Manager Console with the following command.

Install-Package IronPdf

This command will download and install the IronPDF NuGet package, along with its dependencies, into your project.

How to Read PDF Documents in C# using iTextSharp:: Figure 4 - Install IronPDF library using the NuGet Package Manager Console, enter the following command: "Install-package IronPDF"

Using NuGet Manage Packages for Solution

In the browse tab of NuGet, search for the "IronPDF" library and click install.

How to Read PDF Documents in C# using iTextSharp:: Figure 5 - Install IronPDF using the Manage NuGet Package for Solutions by searching "IronPDF" in the search bar of NuGet Package Manager, then select the project and click on the Install button.

Reading a PDF file using IronPDF

Now, Let's read the same PDF File using IronPDF. The following code will extract text from the input PDF document.

using IronPdf;
public static void Main(string[] args)
{
    var pdfDocument = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
     string text = pdfDocument.ExtractAllText();
    Console.WriteLine(text);
}
using IronPdf;
public static void Main(string[] args)
{
    var pdfDocument = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
     string text = pdfDocument.ExtractAllText();
    Console.WriteLine(text);
}
Imports IronPdf
Public Shared Sub Main(ByVal args() As String)
	Dim pdfDocument = PdfDocument.FromFile("D:/What_is_pdf.pdf")
	 Dim text As String = pdfDocument.ExtractAllText()
	Console.WriteLine(text)
End Sub
VB   C#

The above code reads a PDF file named “What_is_pdf.pdf,” extracts all the text content from it, and displays the extracted text in the console

Code Explanation

1. Loading the PDF Document

The code starts by loading a PDF document from a file named "What_is_pdf.pdf". It uses the PdfDocument.FromFile() method to create a PdfDocument object from the specified file.

2. Extracting All Text

Next, it extracts all the text content from the loaded PDF document. The pdfDocument.ExtractAllText() method returns the entire text from the PDF as a single string.

3. Displaying the Extracted Text

Finally, the extracted text is stored in the text variable. The code prints the extracted text to the console using Console.WriteLine(text) method.

Output

How to Read PDF Documents in C# using iTextSharp:: Figure 6 - Console Output: Using IronPDF to extract the text from the PDF document "What_is_pdf.pdf" and display it as plain text in the console.

IronPDF also provides a way to Extract text from a PDF file, page by page.

Read the PDF file, Page by Page

The following code will read a PDF document, page by page using IronPDF.

using IronPdf;
 public static void Main(string[] args)
 {
     StringBuilder sb = new StringBuilder();
     using PdfDocument pdf = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
     for (int index = 0; index < pdf.PageCount; index++)
     {
         sb.Append (pdf.ExtractTextFromPage(index));
     }
     Console.WriteLine(sb.ToString());
 }
using IronPdf;
 public static void Main(string[] args)
 {
     StringBuilder sb = new StringBuilder();
     using PdfDocument pdf = PdfDocument.FromFile(@"D:/What_is_pdf.pdf");
     for (int index = 0; index < pdf.PageCount; index++)
     {
         sb.Append (pdf.ExtractTextFromPage(index));
     }
     Console.WriteLine(sb.ToString());
 }
Imports IronPdf
 Public Shared Sub Main(ByVal args() As String)
	 Dim sb As New StringBuilder()
	 Using pdf As PdfDocument = PdfDocument.FromFile("D:/What_is_pdf.pdf")
		 For index As Integer = 0 To pdf.PageCount - 1
			 sb.Append(pdf.ExtractTextFromPage(index))
		 Next index
		 Console.WriteLine(sb.ToString())
	 End Using
 End Sub
VB   C#

The above code reads a PDF file named “What_is_pdf.pdf,” extracts the text content from each page, and prints the combined text to the console.

Code Explanation

1. Initialization and PDF Loading

A StringBuilder named sb is created to accumulate the extracted text from the PDF. The using statement ensures proper disposal of resources. A PdfDocument object named PDF is created by loading a PDF file from the path "D:/What_is_pdf.pdf" using the PdfDocument.FromFile method.

2. Text Extraction from Pages

The for loop iterates through each page of the loaded PDF document. For each page (indexed by index), it extracts the text content using pdf.ExtractTextFromPage(index). The extracted text is appended to the StringBuilder using sb.Append().

3. Displaying the Accumulated Text

Finally, the accumulated text is converted to a single string using sb.ToString(). The entire extracted text is printed to the console using Console.WriteLine() method.

Conclusion

In conclusion, working with PDFs in C# involves understanding essential elements like a byte array, a document information dictionary, a cross-reference table, a new file instance, and a static byte. The first code using iTextSharp shows a functional approach, while the second with IronPDF offers a simpler and more efficient method. IronPDF's easy-to-use API simplifies tasks involving cross-reference tables, page Dictionary, and indirect reference. Whether dealing with only the xref in document information or private key aspects for secure PDFs, IronPDF is a versatile solution.

Developers seeking to explore IronPDF. Customer satisfaction is at the forefront of IronPDF's offerings, ensuring that developers find value and efficiency in their PDF-related tasks, making it a compelling choice for those in search of a reliable and feature-packed PDF library.

For more information on how to to use IronPDF, please refer to this documentation link.