How to Read PDF Table in C#

A PDF, which stands for Portable Document Format, is a digital document widely used over the internet to send and receive data. Its primary advantage is that it preserves the formatting and logical structure of the document. Working with PDF files in software applications has become necessary nowadays. The generated data needs to be stored permanently or sent in the form of an invoice or receipt. PDF serves this purpose as it accurately transfers the data to its destination.

Extracting data from a PDF file is quite a challenge in C#. The data can be in the form of text, images, charts, graphs, tables, etc. Sometimes, business analysts need to extract data to perform data analysis and make decisions based on those results. The IronPDF C# PDF Library is a great solution for extracting data from PDF files.

In this article, I'll demonstrate how to extract table data from PDF documents in C# using the IronPDF Library.

IronPDF - C# PDF Library

IronPDF is C# .NET Library, that helps developers read, create, and edit PDF documents easily in their software applications. Its Chromium Engine renders PDF documents with accuracy and speed. It allows developers to convert from different formats to PDF and vice versa seamlessly. It supports the latest .NET 7 Framework, as well as .NET Framework 6, 5, 4, .NET Core, and Standard.

Moreover, the IronPDF .NET API also enables developers to manipulate and edit PDFs, add headers and footers, and extract text, images, and tables from PDFs with ease.

Some Important Features include

  • Load and Create PDF files (HTML to PDF, Images to PDF)
  • Save and Print PDF files
  • Merge and Split PDF files
  • Extract Data (Text, Images, Table) from a PDF file

Steps to Extract Table Data in C# using IronPDF Library

To extract table data from PDF documents, we need the following components installed on the local computer system:

  1. Visual Studio - VS 2022 is the official IDE for C# development and it must be installed on the computer. Please download and install it from the Visual Studio website.
  2. Create Project - Create a console app for extracting data. Follow the steps below to create a project:

    • Open Visual Studio 2022 and then click on create a new project

    • Next, Select C# Console Application and click on next

    • Next, type the name of your project "ReadPDFTable" and click next

    • Choose ".NET Framework 6 long-term support" for your project.

    • Click create, and the console project will be created. Now, we are ready to extract table data from PDF documents programmatically.
  3. Install IronPDF - There are 3 different methods to install the IronPDF library. They are as follows:

    • Using Visual Studio. Visual Studio contains the NuGet Package Manager which helps to install all NuGet packages in C# applications.

      • Click Tools in the top menu, or
      • Right-click the project in the Solution Explorer

        Tools > Manage NuGet Packages

        Tools > Manage NuGet Packages

      • Once NuGet Package Manager is opened, browse for IronPDF and click install, as shown below:

        Tools > Manage NuGet Packages

        Browse and Install IronPDF from NuGet Package

    • Download NuGet Package directly. Another easy way to download and install IronPDF is by visiting its page on the NuGet website.
    • Download IronPDF .DLL Library. IronPDF can also be downloaded from the IronPDF website. Click on: IronPDF DLL download to download and install it. Remember you will have to reference the .DLL in your project in order to use it.

Create a PDF Document with Table Data

Before creating anything, we need to add the IronPDF namespace and set the license key to use the Extract Text methods from the IronPDF Library.

using IronPdf;

License.LicenseKey = "YOUR-TRIAL/PURCHASED-LICENSE-KEY";
using IronPdf;

License.LicenseKey = "YOUR-TRIAL/PURCHASED-LICENSE-KEY";
Imports IronPdf

License.LicenseKey = "YOUR-TRIAL/PURCHASED-LICENSE-KEY"
VB   C#

Here, we will create a PDF document from an HTML string containing a table and then extract that data using IronPDF. The HTML is stored in a string variable, and the code is as follows:

string HTML = "<html>" +
        "<style>" +
            "table, th, td {" +
                "border:1px solid black;" +
            "}" +
        "</style>" +
        "<body>" +
            "<h1>A Simple table example</h2>" +
            "<table>" +
                "<tr>" +
                    "<th>Company</th>" +
                    "<th>Contact</th>" +
                    "<th>Country</th>" +
                "</tr>" +
                "<tr>" +
                    "<td>Alfreds Futterkiste</td>" +
                    "<td>Maria Anders</td>" +
                    "<td>Germany</td>" +
                "</tr>" +
                "<tr>" +
                    "<td>Centro comercial Moctezuma</td>" +
                    "<td>Francisco Chang</td>" +
                    "<td>Mexico</td>" +
                "</tr>" +
            "</table>" +
            "<p>To understand the example better, we have added borders to the table.</p>" +
        "</body>" +
     "</html>";
string HTML = "<html>" +
        "<style>" +
            "table, th, td {" +
                "border:1px solid black;" +
            "}" +
        "</style>" +
        "<body>" +
            "<h1>A Simple table example</h2>" +
            "<table>" +
                "<tr>" +
                    "<th>Company</th>" +
                    "<th>Contact</th>" +
                    "<th>Country</th>" +
                "</tr>" +
                "<tr>" +
                    "<td>Alfreds Futterkiste</td>" +
                    "<td>Maria Anders</td>" +
                    "<td>Germany</td>" +
                "</tr>" +
                "<tr>" +
                    "<td>Centro comercial Moctezuma</td>" +
                    "<td>Francisco Chang</td>" +
                    "<td>Mexico</td>" +
                "</tr>" +
            "</table>" +
            "<p>To understand the example better, we have added borders to the table.</p>" +
        "</body>" +
     "</html>";
Dim HTML As String = "<html>" & "<style>" & "table, th, td {" & "border:1px solid black;" & "}" & "</style>" & "<body>" & "<h1>A Simple table example</h2>" & "<table>" & "<tr>" & "<th>Company</th>" & "<th>Contact</th>" & "<th>Country</th>" & "</tr>" & "<tr>" & "<td>Alfreds Futterkiste</td>" & "<td>Maria Anders</td>" & "<td>Germany</td>" & "</tr>" & "<tr>" & "<td>Centro comercial Moctezuma</td>" & "<td>Francisco Chang</td>" & "<td>Mexico</td>" & "</tr>" & "</table>" & "<p>To understand the example better, we have added borders to the table.</p>" & "</body>" & "</html>"
VB   C#

Next, we will use ChromePdfRenderer to create a PDF from an HTML string. The code is as follows:

ChromePdfRenderer renderer = new ChromePdfRenderer();
PdfDocument pdfDocument = renderer.RenderHtmlAsPdf(HTML);
pdfDocument.SaveAs("table_example.pdf");
ChromePdfRenderer renderer = new ChromePdfRenderer();
PdfDocument pdfDocument = renderer.RenderHtmlAsPdf(HTML);
pdfDocument.SaveAs("table_example.pdf");
Dim renderer As New ChromePdfRenderer()
Dim pdfDocument As PdfDocument = renderer.RenderHtmlAsPdf(HTML)
pdfDocument.SaveAs("table_example.pdf")
VB   C#

The SaveAs method will save the PdfDocument object to a PDF file named "table_example.pdf". The saved file is shown below:

Extract Table Data from PDF Documents using IronPDF

To extract data from PDF tables, we first need to open the document using the PdfDocument object and then use the ExtractAllText method to retrieve the data for further analysis. The following code demonstrates how to achieve this task:

pdfDocument = new PdfDocument("table_example.pdf");
string text = pdfDocument.ExtractAllText();
pdfDocument = new PdfDocument("table_example.pdf");
string text = pdfDocument.ExtractAllText();
pdfDocument = New PdfDocument("table_example.pdf")
Dim text As String = pdfDocument.ExtractAllText()
VB   C#

The above code analyzes the entire PDF document using the ExtractAllText method and returns the extracted data, including the tabular data, in a string variable. The value of the variable can then be displayed or stored in a file for later use. The following code displays it on the screen:

Console.WriteLine("The extracted Text is:\n" + text);
Console.WriteLine("The extracted Text is:\n" + text);
Imports Microsoft.VisualBasic

Console.WriteLine("The extracted Text is:" & vbLf & text)
VB   C#

From the above screenshot, it can be seen that the table data formatting and logical structure are preserved in the Console.WriteLine method output. You can find more details on how to extract data from PDF documents using IronPDF in this code example.

Extracting Tabular Data from Extracted Text Content

C# provides a String.Split method which helps to split the string on the basis of a delimiter. The following code will help you limit the output to table data only.

string[] textList = text.Split("\n");
foreach (string textItem in textList)
{
    if (textItem.Contains("."))
    {
        continue;
    }
    else
    {
        Console.WriteLine(textItem);
    }
}
string[] textList = text.Split("\n");
foreach (string textItem in textList)
{
    if (textItem.Contains("."))
    {
        continue;
    }
    else
    {
        Console.WriteLine(textItem);
    }
}
Imports Microsoft.VisualBasic

Dim textList() As String = text.Split(vbLf)
For Each textItem As String In textList
	If textItem.Contains(".") Then
		Continue For
	Else
		Console.WriteLine(textItem)
	End If
Next textItem
VB   C#

This simple code example helps to extract only table cell data from the extracted text. First, the text lines are split and saved in a string array. Then, each array element is iterated and the ones with a full stop "." at the end are skipped. In most cases, only the tabular data is retrieved from the extracted data, although it may retrieve other lines as well. The output is as follows:

The output can also be saved to a CSV file which can be later formatted and edited for more data analysis. The code is as follows:

using (StreamWriter file = new StreamWriter("table_example.csv", false))
{
    string[] textList = text.Split("\n");
    foreach (string textItem in textList)
    {
        if (textItem.Contains("."))
        {
            continue;
        }
        else
        {
            file.WriteLine(textItem);
        }
    }
}
using (StreamWriter file = new StreamWriter("table_example.csv", false))
{
    string[] textList = text.Split("\n");
    foreach (string textItem in textList)
    {
        if (textItem.Contains("."))
        {
            continue;
        }
        else
        {
            file.WriteLine(textItem);
        }
    }
}
Imports Microsoft.VisualBasic

Using file As New StreamWriter("table_example.csv", False)
	Dim textList() As String = text.Split(vbLf)
	For Each textItem As String In textList
		If textItem.Contains(".") Then
			Continue For
		Else
			file.WriteLine(textItem)
		End If
	Next textItem
End Using
VB   C#

The output will be saved in a CSV file where each textItem will be one column.

Summary

In this article, we closely examined how to extract data and tables from a PDF document using IronPDF. IronPDF offers several useful options for extracting text from PDF files. It provides the ExtractTextFromPage method, which allows for the extraction of data from a specific page. IronPDF also facilitates the conversion of different formats to PDF and from PDF to different formats. This makes it easy for developers to integrate PDF functionality into the application development process. Additionally, it does not require Adobe Acrobat Reader to view and edit PDF documents.

IronPDF is free for development and can be licensed for commercial use. It provides a free trial license to test out the full functionality of the library. You can find more detailed information on this link.