C# Extract Text From PDF (Code Example Tutorial)

Portable Document Format files are used in a variety of businesses. When developing documents, developers use the PDF format to assist their clients. Thanks to today's libraries, creating PDFs has never been easier. We must evaluate elements such as constructing, reading, and converting in order to determine which commercial library is best to use in a project.

1. IronPDF Features

IronPDF is a robust PDF converter that can perform nearly any operation that a browser can. Creating, reading, and manipulating PDF documents is simple with the .NET library for developers. IronPDF converts HTML-to-PDF documents using the Chrome engine. IronPDF supports HTML, ASPX, Razor HTML, and MVC View, among other web components. The Microsoft .NET application is supported by IronPDF (both ASP.NET Web applications and traditional Windows applications). IronPDF can also be used to create a visually appealing PDF document.

We can make a PDF document from HTML5, JavaScript, CSS, and images with IronPDF. Additionally, the files can have headers and footers. Thanks to IronPDF, we can easily read a PDF document. IronPDF also has a comprehensive PDF converting engine and a powerful HTML-to-PDF converter that can handle PDF documents.

  • Sources that can be used to create a PDF file include HTML, HTML5, ASPX, and Razor/MVC View. Not only can we convert HTML files to PDF, but we can also convert image files to PDF.
  • IronPDF allows you to build interactive PDF documents, fill out and submit interactive forms, merge and split a PDF document, extract text and images from PDF files, search text in a PDF file, rasterize PDF pages to images, convert PDF to HTML, and print PDF documents.
  • IronPDF is capable of creating a document from a URL. It also supports custom network login credentials, user agents, proxies, cookies, HTTP headers, and form variables for login behind HTML login forms.
  • IronPDF is a program that lets you view and fill out PDF documents.
  • Images can be extracted from documents using IronPDF.
  • IronPDF allows us to customize documents with headers, footers, text, images, bookmarks, watermarks, and more.
  • We can join and split pages in a new or existing document using IronPDF.
  • Without an Acrobat viewer, we may convert documents to PDF objects.
  • A CSS file can be converted into a PDF document.
  • Media-type CSS files can be transformed into documents.

2. Creating a New Project in Visual Studio

Open the Visual Studio software and go to the File menu. Select "New Project", and then select "Console Application". In this article, we are going to use a console application to generate PDF documents.

C# Extract Text From PDF (Code Example Tutorial), Figure 1: Create a new project in Visual Studio Create a new project in Visual Studio

Enter the project name and select the file path in the appropriate text box. Then, click the Create button and select the required .NET Framework, as in the screenshot below.

C# Extract Text From PDF (Code Example Tutorial), Figure 2: Configure new project in Visual Studio Configure new project in Visual Studio

The Visual Studio project will now generate the structure for the selected application, and if you have selected the Console, Windows, and Web Application, it will open the program.cs file where you can enter the code and build/run the application.

C# Extract Text From PDF (Code Example Tutorial), Figure 3: Selecting .NET Core Selecting .NET Core

Next, we can add the library to test the code.

3. Install the IronPDF Library

The IronPDF Library can be downloaded and installed in four ways.

These are:

  • Using Visual Studio.
  • Using the Visual Studio Command-Line.
  • Direct download from the NuGet website.
  • Direct download from the IronPDF website.

3.1 Using Visual Studio

The Visual Studio software provides the NuGet Package Manager option to install the package directly to the solution. The below screenshot shows how to open the NuGet Package Manager.

C# Extract Text From PDF (Code Example Tutorial), Figure 4: Visual Studio program.cs file Visual Studio program.cs file

It provides the search box to show the list of packages from the NuGet website. In the package manager, we need to search for the keyword "IronPdf", as in the screenshot below.

C# Extract Text From PDF (Code Example Tutorial), Figure 5: NuGet Package Manager NuGet Package Manager

In the above image, we can see the list of the related search items. We need to select the required option to install the package to the solution.

3.2 Using the Visual Studio Command-Line

In Visual Studio, go to Tools > NuGet Package Manager > Package Manager Console

Enter the following line in the package manager console tab:

Install-Package IronPdf

Now the package will download/install to the current project and be ready to use.

C# Extract Text From PDF (Code Example Tutorial), Figure 6: IronPdf library in NuGet Package Manager IronPdf library in NuGet Package Manager

3.3 Direct download from the NuGet website

The third way is to download the NuGet package directly from the website.

  • Navigate to the Link.
  • Select the download package option from the menu on the right-hand side.
  • Double-click the downloaded package. It will be installed automatically.
  • Next, reload the solution and start using it in the project.

3.4 Direct download from the IronPDF website

Visit our IronPDF website to download the latest package directly from the website. Once downloaded, follow the steps below to add the package to the project.

  • Right-click the project from the solution window.
  • Then, select the options reference and browse the location of the downloaded reference.
  • Next, click OK to add the reference.

4. Extract Text Using IronPDF

The IronPDF program allows us to perform text extraction from the PDF file and convert PDF pages into PDF objects. The following is an example of how to use IronPDF to read an existing PDF.

The first approach is to extract text from a PDF and the sample code snippet is below.

var pdfDocument = IronPdf.PdfDocument.FromFile("result.pdf");
string AllText = pdfDocument.ExtractAllText();
var pdfDocument = IronPdf.PdfDocument.FromFile("result.pdf");
string AllText = pdfDocument.ExtractAllText();
Dim pdfDocument = IronPdf.PdfDocument.FromFile("result.pdf")
Dim AllText As String = pdfDocument.ExtractAllText()
VB   C#

The FromFile static method is used to load the PDF document from an existing file and transform it into PDFDocument objects, as shown in the code above. We can read the text and images accessible on the PDF pages using this object. The object has a method called ExtractAllText which extracts all the text from the whole PDF document, it then holds the extracted text into the string we can use the string to process.

Below is the code example for the second method that we can use to extract text from a PDF file, page by page.

using PdfDocument pdf = PdfDocument.FromFile("result.pdf");
for (var index = 0; index < pdf.PageCount; index++)
{
    int PageNumber = index + 1;
    string Text = pdf.ExtractTextFromPage(index);
}
using PdfDocument pdf = PdfDocument.FromFile("result.pdf");
for (var index = 0; index < pdf.PageCount; index++)
{
    int PageNumber = index + 1;
    string Text = pdf.ExtractTextFromPage(index);
}
Using pdf As PdfDocument = PdfDocument.FromFile("result.pdf")
	For index = 0 To pdf.PageCount - 1
		Dim PageNumber As Integer = index + 1
		Dim Text As String = pdf.ExtractTextFromPage(index)
	Next index
End Using
VB   C#

In the above code, we see that it will first load the whole PDF document and convert it into a PDF object. Then, we obtain the page count of the whole PDF document by using an inbuilt method called PageCount, and this will retrieve the total number of pages available on the loaded PDF document. Using the "for loop" and ExtractTextFromPage function allows us to pass the page number as a parameter to extract text from the loaded document. It will then hold the exact text into the string variable. Likewise, it will extract text from the PDF page by page with the help of the "for" or the "for each" loop.

5. Conclusion

IronPDF is one of the most commonly used PDF libraries. It doesn't depend on any other third-party library. It is independent and does not need Adobe Reader to be installed on your machine. It works on multiple platforms. The introductory price for IronPDF starts at $749. There is an option to pay a one-year fee for product support and updates, and IronPDF also provides royalty-free redistribution coverage at an extra cost. For more details please visit our pricing page.