Extract Text From PDF in C# Using iTextSharp VS IronPDF

In the dynamic landscape of digital document management, the ability to effortlessly extract data from PDF files is a foundational task that underpins a multitude of applications. The extracting text process is vital for purposes such as comprehensive data analysis, content indexing, commercial use, and text manipulation. Among the array of available tools, iTextSharp, a highly regarded C# library, emerges as an exceptional solution for text extraction from PDF files.

In this comprehensive article, we will dive deep into the rich capabilities of using iTextSharp, exploring how this powerful and versatile parser library empowers developers to efficiently extract textual content from PDF documents using the C# programming language. We will unravel the essential methods, sample techniques, and best practices, equipping developers with the knowledge needed to leverage iTextSharp effectively for text extraction. We will also discuss and compare the best and most powerful PDF library IronPDF in this post.

How to Extract Text from PDF C#

  1. Download the C# library for extracting the text from a PDF.
  2. Load an existing PDF by instantiating the PdfReader object.
  3. Extract text from the PdfDocument object using the GetTextFromPage method.
  4. Instantiate the foreach loop to iterate through the lines.
  5. Write the lines into the file using the WriteLine method.

What is IronPDF?

IronPDF, a prominent and feature-rich library in the realm of .NET development, revolutionizes PDF generation and manipulation. Empowering developers with a comprehensive suite of tools, IronPDF facilitates seamless integration into C# applications, allowing for the effortless creation, modification, and rendering of PDF documents. With its intuitive API and robust functionality, this versatile library opens up a world of possibilities for generating high-quality PDFs from HTML, images, and content. In this article, we'll explore the capabilities of IronPDF, delving into its key features and demonstrating how it can be utilized to efficiently handle PDF-related tasks within the C#

iTextSharp Library

iTextSharp, a renowned and powerful library in the domain of PDF manipulation using C#, has revolutionized the way developers handle PDF documents. It stands as a versatile and robust tool that facilitates the creation, modification, and extraction of content from PDF files. iTextSharp empowers developers to generate sophisticated PDFs, extract images, manipulate existing documents, and extract data, making it a go-to solution for a wide range of applications. In this article, we will delve into the capabilities and features of iTextSharp, exploring how it can be effectively utilized to manage and manipulate PDFs within the C# programming environment.

Install IronPDF

Installing IronPDF is a straightforward process, here are the steps to install and integrate IronPDF in your C# project.

  1. Open Visual Studio and create a new project or open an existing project.
  2. Go to Tools and select NuGet Package Manager from the drop-down menu.
  3. In the new side menu select NuGet Package Manager for Solution.

    Extract Text From PDF in C# Using iTextSharp VS IronPDF Figure 1 - NuGet Package Manager

  4. In the "NuGet Package Manager" window, select the "Browse" tab.
  5. In the search bar, type "IronPDF" and press Enter.
  6. The list of IronPDF instances will appear, select the latest version and press Install.

Extract Text From PDF in C# Using iTextSharp VS IronPDF Figure 2 - IronPDF Installation

Just like that IronPDF is installed and ready to use in your C# project.

Install iTextSharp Library

Installing iTextSharp PDF library is the same as installing IronPDF. Repeat all the steps explained above, just search "iTextSharp" instead of IronPDF in the browse windows, select from the list of packages, and click on install to integrate iTextSharp PDF library in your project.

Extract Text From PDF in C# Using iTextSharp VS IronPDF Figure 3 - iTextSharp

Extract Text from PDF file Using IronPDF

IronPDF offers the feature to extract text from PDF files to automatically extract the text based on specific pages or extract text from all the PDFs. In the code example below, we will see how to extract text from a specific page of a sample PDF document.

using IronPdf;
using System;
using PdfDocument PDF = PdfDocument.FromFile("Watermarked.pdf");
string Text = PDF.ExtractTextFromPage(1);
Console.Write(Text);
using IronPdf;
using System;
using PdfDocument PDF = PdfDocument.FromFile("Watermarked.pdf");
string Text = PDF.ExtractTextFromPage(1);
Console.Write(Text);
Imports IronPdf
Imports System
Private PdfDocument As using
Private Text As String = PDF.ExtractTextFromPage(1)
Console.Write(Text)
VB   C#

The above code uses the IronPDF library in C# to extract text from a PDF file and display it in the console. Firstly, the necessary namespaces are imported, including IronPDF and System. The code then loads a PDF document titled "Watermarked.pdf" into a PdfDocument object using the FromFile method. Subsequently, it extracts text from the second page of the PDF using ExtractTextFromPage and stores it in a string variable named Text. Finally, the extracted text is displayed in the console using Console.Write.

Extract Text From PDF in C# Using iTextSharp VS IronPDF Figure 4 - Output

Extract Text from PDF file Using iTextSharp Library

You can also extract text from PDF files using iTextSharp, here is an example of the iTextSharp library at play.

using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFApp2
{
    class Program
    {
        static void Main(string[] args)
        {
            string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
            string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
            int pagesToScan = 2;

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filePath);
                for (int page = 1; page <= pagesToScan; page++) 
                {
                    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                    strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                    string[] lines = strText.Split('\n');
                    foreach (string line in lines)
                    {
                        using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
                        {
                            file.WriteLine(line);
                        }
                    }
                }
                reader.Close();
            }
            catch (Exception ex)
            {
                Console.Write(ex);
            }
        }
    }
}
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFApp2
{
    class Program
    {
        static void Main(string[] args)
        {
            string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
            string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
            int pagesToScan = 2;

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filePath);
                for (int page = 1; page <= pagesToScan; page++) 
                {
                    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                    strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                    string[] lines = strText.Split('\n');
                    foreach (string line in lines)
                    {
                        using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
                        {
                            file.WriteLine(line);
                        }
                    }
                }
                reader.Close();
            }
            catch (Exception ex)
            {
                Console.Write(ex);
            }
        }
    }
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser

Namespace PDFApp2
	Friend Class Program
		Shared Sub Main(ByVal args() As String)
			Dim filePath As String = "C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf"
			Dim outPath As String = "C:\Users\buttw\OneDrive\Desktop\name.txt"
			Dim pagesToScan As Integer = 2

			Dim strText As String = String.Empty
			Try
				Dim reader As New PdfReader(filePath)
				For page As Integer = 1 To pagesToScan
					Dim its As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy()
					strText = PdfTextExtractor.GetTextFromPage(reader, page, its)

					strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)))
					Dim lines() As String = strText.Split(ControlChars.Lf)
					For Each line As String In lines
						Using file As New System.IO.StreamWriter(outPath, True)
							file.WriteLine(line)
						End Using
					Next line
				Next page
				reader.Close()
			Catch ex As Exception
				Console.Write(ex)
			End Try
		End Sub
	End Class
End Namespace
VB   C#

The provided code is a C# program that uses the iTextSharp library to extract text from specific pages of a PDF document and save it to a text file. Firstly, the necessary namespaces are imported, including System.Text, iTextSharp.text.pdf, and iTextSharp.text.pdf.parser. The program specifies the filename, input PDF file path, output text file path, and the number of pages to scan. It then utilizes iTextSharp's PdfReader to read the PDF file. For each specified page, it uses iTextSharp's new LocationTextExtractionStrategy to extract text, converting the encoding to UTF-8. The extracted text is split into lines, and the new StringBuilder text from the PDF code works right direction. Any exceptions encountered during the process are caught and displayed in the console. The program concludes by closing the PdfReader.

Extract Text From PDF in C# Using iTextSharp VS IronPDF Figure 5 - Extract Text Using iTextSharp

Conclusion

iTextSharp, a powerful and versatile C# library, revolutionizes PDF manipulation, enabling seamless content creation, modification, and extraction. Its robust features make it a go-to solution for developers, empowering them to generate sophisticated PDFs and effectively manage textual content within PDFs. Additionally, IronPDF, another prominent library in the .NET domain, offers a comprehensive suite of tools for PDF generation and image manipulation, enhancing developers' ability to effortlessly create, modify, and render high-quality PDFs from various sources. When comparing these two PDF libraries IronPDF takes the lead due to well documented and easy-to-use API, which also performs all the text extraction in just a few lines of code, on the other hand using iTextSharp you have to write lengthy and complex code and needs in-depth knowledge of library and C#

To know more about IronPDF and its features visit this link here. The complete tutorial for extracting text using IronPDF can be found at this link. For a complete tutorial on IronPDF and iTextSharp please visit the following link.