Published November 15, 2023
Extract Text From PDF in C# Using iTextSharp VS IronPDF
In the dynamic landscape of digital document management, the ability to effortlessly extract data from PDF files is a foundational task that underpins a multitude of applications. The extracting text process is vital for purposes such as comprehensive data analysis, content indexing, commercial use, and text manipulation. Among the array of available tools, iTextSharp, a highly regarded C# library, emerges as an exceptional solution for text extraction from PDF files.
In this comprehensive article, we will dive deep into the rich capabilities of using iTextSharp, exploring how this powerful and versatile parser library empowers developers to efficiently extract textual content from PDF documents using the C# programming language. We will unravel the essential methods, sample techniques, and best practices, equipping developers with the knowledge needed to leverage iTextSharp effectively for text extraction. We will also discuss and compare the best and most powerful PDF library IronPDF in this post.
How to Extract Text from PDF C#
- Download the C# library for extracting the text from a PDF.
- Load an existing PDF by instantiating the
PdfReader
object. - Extract text from the
PdfDocument
object using theGetTextFromPage
method. - Instantiate the
foreach
loop to iterate through the lines. - Write the lines into the file using the
WriteLine
method.
What is IronPDF?
IronPDF, a prominent and feature-rich library in the realm of .NET development, revolutionizes PDF generation and manipulation. Empowering developers with a comprehensive suite of tools, IronPDF facilitates seamless integration into C# applications, allowing for the effortless creation, modification, and rendering of PDF documents. With its intuitive API and robust functionality, this versatile library opens up a world of possibilities for generating high-quality PDFs from HTML, images, and content. In this article, we'll explore the capabilities of IronPDF, delving into its key features and demonstrating how it can be utilized to efficiently handle PDF-related tasks within the C#
iTextSharp Library
iTextSharp, a renowned and powerful library in the domain of PDF manipulation using C#, has revolutionized the way developers handle PDF documents. It stands as a versatile and robust tool that facilitates the creation, modification, and extraction of content from PDF files. iTextSharp empowers developers to generate sophisticated PDFs, extract images, manipulate existing documents, and extract data, making it a go-to solution for a wide range of applications. In this article, we will delve into the capabilities and features of iTextSharp, exploring how it can be effectively utilized to manage and manipulate PDFs within the C# programming environment.
Install IronPDF
Installing IronPDF is a straightforward process, here are the steps to install and integrate IronPDF in your C# project.
- Open Visual Studio and create a new project or open an existing project.
- Go to Tools and select NuGet Package Manager from the drop-down menu.
-
In the new side menu select NuGet Package Manager for Solution.
- In the "NuGet Package Manager" window, select the "Browse" tab.
- In the search bar, type "IronPDF" and press Enter.
- The list of IronPDF instances will appear, select the latest version and press Install.
Just like that IronPDF is installed and ready to use in your C# project.
Install iTextSharp Library
Installing iTextSharp PDF library is the same as installing IronPDF. Repeat all the steps explained above, just search "iTextSharp" instead of IronPDF in the browse windows, select from the list of packages, and click on install to integrate iTextSharp PDF library in your project.
Extract Text from PDF file Using IronPDF
IronPDF offers the feature to extract text from PDF files to automatically extract the text based on specific pages or extract text from all the PDFs. In the code example below, we will see how to extract text from a specific page of a sample PDF document.
using IronPdf;
using System;
using PdfDocument PDF = PdfDocument.FromFile("Watermarked.pdf");
string Text = PDF.ExtractTextFromPage(1);
Console.Write(Text);
using IronPdf;
using System;
using PdfDocument PDF = PdfDocument.FromFile("Watermarked.pdf");
string Text = PDF.ExtractTextFromPage(1);
Console.Write(Text);
Imports IronPdf
Imports System
Private PdfDocument As using
Private Text As String = PDF.ExtractTextFromPage(1)
Console.Write(Text)
The above code uses the IronPDF library in C# to extract text from a PDF file and display it in the console. Firstly, the necessary namespaces are imported, including IronPDF and System. The code then loads a PDF document titled "Watermarked.pdf" into a PdfDocument
object using the FromFile
method. Subsequently, it extracts text from the second page of the PDF using ExtractTextFromPage
and stores it in a string variable named Text. Finally, the extracted text is displayed in the console using Console.Write
.
Extract Text from PDF file Using iTextSharp Library
You can also extract text from PDF files using iTextSharp, here is an example of the iTextSharp library at play.
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Namespace PDFApp2
Friend Class Program
Shared Sub Main(ByVal args() As String)
Dim filePath As String = "C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf"
Dim outPath As String = "C:\Users\buttw\OneDrive\Desktop\name.txt"
Dim pagesToScan As Integer = 2
Dim strText As String = String.Empty
Try
Dim reader As New PdfReader(filePath)
For page As Integer = 1 To pagesToScan
Dim its As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy()
strText = PdfTextExtractor.GetTextFromPage(reader, page, its)
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)))
Dim lines() As String = strText.Split(ControlChars.Lf)
For Each line As String In lines
Using file As New System.IO.StreamWriter(outPath, True)
file.WriteLine(line)
End Using
Next line
Next page
reader.Close()
Catch ex As Exception
Console.Write(ex)
End Try
End Sub
End Class
End Namespace
The provided code is a C# program that uses the iTextSharp library to extract text from specific pages of a PDF document and save it to a text file. Firstly, the necessary namespaces are imported, including System.Text
, iTextSharp.text.pdf
, and iTextSharp.text.pdf.parser. The program specifies the filename, input PDF file path, output text file path, and the number of pages to scan. It then utilizes iTextSharp's PdfReader
to read the PDF file. For each specified page, it uses iTextSharp's new LocationTextExtractionStrategy
to extract text, converting the encoding to UTF-8. The extracted text is split into lines, and the new StringBuilder
text from the PDF code works right direction. Any exceptions encountered during the process are caught and displayed in the console. The program concludes by closing the PdfReader
.
Conclusion
iTextSharp, a powerful and versatile C# library, revolutionizes PDF manipulation, enabling seamless content creation, modification, and extraction. Its robust features make it a go-to solution for developers, empowering them to generate sophisticated PDFs and effectively manage textual content within PDFs. Additionally, IronPDF, another prominent library in the .NET domain, offers a comprehensive suite of tools for PDF generation and image manipulation, enhancing developers' ability to effortlessly create, modify, and render high-quality PDFs from various sources. When comparing these two PDF libraries IronPDF takes the lead due to well documented and easy-to-use API, which also performs all the text extraction in just a few lines of code, on the other hand using iTextSharp you have to write lengthy and complex code and needs in-depth knowledge of library and C#
To know more about IronPDF and its features visit this link here. The complete tutorial for extracting text using IronPDF can be found at this link. For a complete tutorial on IronPDF and iTextSharp please visit the following link.