Test in a live environment
Test in production without watermarks.
Works wherever you need it to.
In the dynamic landscape of digital document management, the ability to effortlessly extract data from PDF files is a foundational task that underpins a multitude of applications. The extracting text process is vital for purposes such as comprehensive data analysis, content indexing, commercial use, and text manipulation. Among the array of available tools, iTextSharp, a highly regarded C# library, emerges as an exceptional solution for text extraction from PDF files.
In this comprehensive article, we will dive deep into the rich capabilities of using iTextSharp, exploring how this powerful and versatile parser library empowers developers to efficiently extract textual content from PDF documents using the C# programming language. We will unravel the essential methods, sample techniques, and best practices, equipping developers with the knowledge needed to leverage iTextSharp effectively for text extraction. We will also discuss and compare the best and most powerful PDF library IronPDF in this post.
PdfReader
object.PdfDocument
object using the GetTextFromPage
method.foreach
loop to iterate through the lines.WriteLine
method.IronPDF Overview, a prominent and feature-rich library in the realm of .NET development, revolutionizes PDF generation and manipulation. Empowering developers with a comprehensive suite of tools, IronPDF facilitates seamless integration into C# applications, allowing for the effortless creation, modification, and rendering of PDF documents. With its intuitive API and robust functionality, this versatile library opens up a world of possibilities for generating high-quality PDFs from HTML, images, and content. In this article, we'll explore the capabilities of IronPDF, delving into its key features and demonstrating how it can be utilized to efficiently handle PDF-related tasks within the C#
iTextSharp, a renowned and powerful library in the domain of PDF manipulation using C#, has revolutionized the way developers handle PDF documents. It stands as a versatile and robust tool that facilitates the creation, modification, and extraction of content from PDF files. iTextSharp empowers developers to generate sophisticated PDFs, extract images, manipulate existing documents, and extract data, making it a go-to solution for a wide range of applications. In this article, we will delve into the capabilities and features of iTextSharp, exploring how it can be effectively utilized to manage and manipulate PDFs within the C# programming environment.
Installing IronPDF is a straightforward process, here are the steps to install and integrate IronPDF in your C# project.
In the new side menu, select NuGet Package Manager for Solution.
Just like that IronPDF is installed and ready to use in your C# project.
Installing iTextSharp PDF library is the same as installing IronPDF. Repeat all the steps explained above, just search "iTextSharp" instead of IronPDF in the browse windows, select from the list of packages, and click on install to integrate iTextSharp PDF library in your project.
IronPDF offers the feature to extract text from PDF files to automatically extract the text based on specific pages or extract text from all the PDFs. In the code example below, we will see how to extract text from a specific page of a sample PDF document.
using IronPdf;
using System;
using PdfDocument PDF = PdfDocument.FromFile("Watermarked.pdf");
string Text = PDF.ExtractTextFromPage(1);
Console.Write(Text);
using IronPdf;
using System;
using PdfDocument PDF = PdfDocument.FromFile("Watermarked.pdf");
string Text = PDF.ExtractTextFromPage(1);
Console.Write(Text);
Imports IronPdf
Imports System
Private PdfDocument As using
Private Text As String = PDF.ExtractTextFromPage(1)
Console.Write(Text)
The above code uses the IronPDF library in C# to extract text from a PDF file and display it in the console. Firstly, the necessary namespaces are imported, including IronPDF and System. The code then loads a PDF document titled "Watermarked.pdf" into a PdfDocument
object using the FromFile
method. Subsequently, it extracts text from the second page of the PDF using ExtractTextFromPage
and stores it in a string variable named Text. Finally, the extracted text is displayed in the console using Console.Write
.
You can also extract text from PDF files using iTextSharp, here is an example of the iTextSharp library at play.
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string [] args)
{
string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
string [] lines = strText.Split('\n');
foreach (string line in lines)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string [] args)
{
string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
string [] lines = strText.Split('\n');
foreach (string line in lines)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Namespace PDFApp2
Friend Class Program
Shared Sub Main(ByVal args() As String)
Dim filePath As String = "C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf"
Dim outPath As String = "C:\Users\buttw\OneDrive\Desktop\name.txt"
Dim pagesToScan As Integer = 2
Dim strText As String = String.Empty
Try
Dim reader As New PdfReader(filePath)
For page As Integer = 1 To pagesToScan
Dim its As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy()
strText = PdfTextExtractor.GetTextFromPage(reader, page, its)
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)))
Dim lines() As String = strText.Split(ControlChars.Lf)
For Each line As String In lines
Using file As New System.IO.StreamWriter(outPath, True)
file.WriteLine(line)
End Using
Next line
Next page
reader.Close()
Catch ex As Exception
Console.Write(ex)
End Try
End Sub
End Class
End Namespace
The provided code is a C# program that uses the iTextSharp library to extract text from specific pages of a PDF document and save it to a text file. Firstly, the necessary namespaces are imported, including System.Text
, iTextSharp.text.pdf
, and iTextSharp.text.pdf.parser. The program specifies the filename, input PDF file path, output text file path, and the number of pages to scan. It then utilizes iTextSharp's PdfReader
to read the PDF file. For each specified page, it uses iTextSharp's new LocationTextExtractionStrategy
to extract text, converting the encoding to UTF-8. The extracted text is split into lines, and the new StringBuilder
text from the PDF code works in the right direction. Any exceptions encountered during the process are caught and displayed in the console. The program concludes by closing the PdfReader
.
iTextSharp, a powerful and versatile C# library, revolutionizes PDF manipulation, enabling seamless content creation, modification, and extraction. Its robust features make it a go-to solution for developers, empowering them to generate sophisticated PDFs and effectively manage textual content within PDFs. Additionally, IronPDF, another prominent library in the .NET domain, offers a comprehensive suite of tools for PDF generation and image manipulation, enhancing developers' ability to effortlessly create, modify, and render high-quality PDFs from various sources. When comparing these two PDF libraries, IronPDF takes the lead due to its well-documented and easy-to-use API, which also performs all the text extraction in just a few lines of code, whereas using iTextSharp you have to write lengthy and complex code and need an in-depth knowledge of the library and C#
To know more about IronPDF's Features and its features visit the official webpage. The complete tutorial for extracting text using IronPDF can be found at this IronPDF Text Extraction Tutorial. For a complete tutorial on IronPDF and iTextSharp, please visit the IronPDF vs iTextSharp Comparison.
9 .NET API products for your office documents