Test in production without watermarks.
Works wherever you need it to.
Get 30 days of fully functional product.
Have it up and running in minutes.
Full access to our support engineering team during your product trial
In many industries, PDF files are the go-to format for sharing structured documents like reports, invoices, and data tables. However, extracting data from PDFs, especially when it comes to tables, can be challenging due to the nature of the PDF format. Unlike structured data formats, PDFs are designed primarily for presentation, not data extraction.
However, with IronPDF, a powerful C# PDF .NET library, you can easily extract structured data like tables directly from PDFs and process them in your .NET applications. This article will guide you step-by-step on how to extract tabular data from PDF files using IronPDF.
Tables are a handy way of structuring and displaying your data, whether carrying out inventory management, data entry, recording data such as rainfall, etc. Thus, there may also be many reasons for needing to extract tables and table data from PDF documents. Some of the most common use cases include:
The PDF file format does not offer any native ability to store data in structured formats like tables. The table we use in today's example was created in HTML, before being converted to PDF format. Tables are rendered as text and lines, so extracting table data often requires some parsing and interpreting of content unless you are using OCR software, such as IronOCR.
Before we explore how IronPDF can tackle this task, let's first explore an online tool capable of handling PDF extraction. To extract a table from a PDF document using an online PDF tool, follow the steps outlined below:
Today, we will be using Docsumo as our online PDF tool example. Docsumo is an online PDF document AI that offers a free PDF table extraction tool.
Now, click the "Upload File" button to upload your PDF file for extraction. The tool will immediately begin to process your PDF.
Once Docsumo has finished processing the PDF, it will display the extracted table. You can then make adjustments to the table structure such as adding and removing rows. Here, you can download the table as either another PDF, XLS, JSON, or Text.
IronPDF allows you to extract data, text, and graphics from PDFs, which can then be used to reconstruct tables programmatically. To do this, you will first need to extract the textual content from the table in the PDF and then use that text to parse the table into rows and columns. Before we start extracting tables, let's take a look at how IronPDF's ExtractAllText() method works by extracting the data within a table:
using IronPDF;
class Program
{
static void Main(string[] args)
{
// Load the PDF document
PdfDocument pdf = PdfDocument.FromFile("example.pdf");
// Extract all text from the PDF
string text = pdf.ExtractAllText();
// Output the extracted text to the console
Console.WriteLine(text);
}
}
using IronPDF;
class Program
{
static void Main(string[] args)
{
// Load the PDF document
PdfDocument pdf = PdfDocument.FromFile("example.pdf");
// Extract all text from the PDF
string text = pdf.ExtractAllText();
// Output the extracted text to the console
Console.WriteLine(text);
}
}
Imports IronPDF
Friend Class Program
Shared Sub Main(ByVal args() As String)
' Load the PDF document
Dim pdf As PdfDocument = PdfDocument.FromFile("example.pdf")
' Extract all text from the PDF
Dim text As String = pdf.ExtractAllText()
' Output the extracted text to the console
Console.WriteLine(text)
End Sub
End Class
In this example, we have loaded the PDF document using the PdfDocument class, and then used the ExtractAllText() method to extract all the text within the document, before finally displaying the text on the console.
After extracting text from the PDF, the table will appear as a series of rows and columns in plain text. You can split this text based on line breaks (\n) and then further split rows into columns based on consistent spacing or delimiters such as commas or tabs. Here is a basic example of how to parse the table from the text:
using IronPDF;
using System;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// Load the PDF document
PdfDocument pdf = PdfDocument.FromFile("table.pdf");
// Extract all text from the PDF
string text = pdf.ExtractAllText();
// Split the text into lines (rows)
string[] lines = text.Split('\n');
foreach (string line in lines)
{
// Split the line into columns using the tab character
string[] columns = line.Split('\t').Where(col => !string.IsNullOrWhiteSpace(col)).ToArray();
Console.WriteLine("Row:");
foreach (string column in columns)
{
Console.WriteLine(" " + column); // Output each column in the row
}
}
}
}
using IronPDF;
using System;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// Load the PDF document
PdfDocument pdf = PdfDocument.FromFile("table.pdf");
// Extract all text from the PDF
string text = pdf.ExtractAllText();
// Split the text into lines (rows)
string[] lines = text.Split('\n');
foreach (string line in lines)
{
// Split the line into columns using the tab character
string[] columns = line.Split('\t').Where(col => !string.IsNullOrWhiteSpace(col)).ToArray();
Console.WriteLine("Row:");
foreach (string column in columns)
{
Console.WriteLine(" " + column); // Output each column in the row
}
}
}
}
Imports Microsoft.VisualBasic
Imports IronPDF
Imports System
Imports System.Linq
Friend Class Program
Shared Sub Main(ByVal args() As String)
' Load the PDF document
Dim pdf As PdfDocument = PdfDocument.FromFile("table.pdf")
' Extract all text from the PDF
Dim text As String = pdf.ExtractAllText()
' Split the text into lines (rows)
Dim lines() As String = text.Split(ControlChars.Lf)
For Each line As String In lines
' Split the line into columns using the tab character
Dim columns() As String = line.Split(ControlChars.Tab).Where(Function(col) Not String.IsNullOrWhiteSpace(col)).ToArray()
Console.WriteLine("Row:")
For Each column As String In columns
Console.WriteLine(" " & column) ' Output each column in the row
Next column
Next line
End Sub
End Class
In this example, we followed the same steps as before for loading our PDF document and extracting the text. Then, using text.Split('\n') we split the extracted text into rows based on line breaks and store the results in the lines array. A foreach loop is then used to loop through the rows in the array, where line.Split('\t') is used to further split the rows into columns using the tab character '\t' as the delimiter. The next part of the columns array, Where(col => !string.IsNullOrWhiteSpace(col)).ToArray() filters out empty columns that may arise due to extra spaces, and then adds the columns to the column array.
Finally, we write text to the console output window with basic row and column structuring.
Now that we've covered how to extract tables from PDF files, let's take a look at what we can do with that extracted data. Exporting the exported table as a CSV file is one useful way of handling table data and automating tasks such as data entry. For this example, we have filled a table with simulated data, in this case, the daily rainfall amount in a week, extracted the table from the PDF, and then exported it to a CSV file.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using IronPDF;
class Program
{
static void Main(string[] args)
{
string pdfPath = "table.pdf";
string csvPath = "output.csv";
// Extract and parse table data
var tableData = ExtractTableDataFromPdf(pdfPath);
// Write the extracted data to a CSV file
WriteDataToCsv(tableData, csvPath);
Console.WriteLine($"Data extracted and saved to {csvPath}");
}
static List<string[]> ExtractTableDataFromPdf(string pdfPath)
{
var pdf = PdfDocument.FromFile(pdfPath);
// Extract text from the first page
var text = pdf.ExtractTextFromPage(0);
var rows = new List<string[]>();
// Split text into lines (rows)
var lines = text.Split('\n');
// Variable to hold column values temporarily
var tempColumns = new List<string>();
foreach (var line in lines)
{
var trimmedLine = line.Trim();
// Check for empty lines or lines that don't contain table data
if (string.IsNullOrEmpty(trimmedLine) || trimmedLine.Contains("Header"))
{
continue;
}
// Split line into columns. Adjust this based on how columns are separated.
var columns = trimmedLine.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
if (columns.Length > 0)
{
// Add columns to temporary list
tempColumns.AddRange(columns);
rows.Add(tempColumns.ToArray());
tempColumns.Clear(); // Clear temporary list after adding to rows
}
}
return rows;
}
static void WriteDataToCsv(List<string[]> data, string csvPath)
{
using (var writer = new StreamWriter(csvPath))
{
foreach (var row in data)
{
// Join columns with commas and quote each field to handle commas within data
var csvRow = string.Join(",", row.Select(field => $"\"{field.Replace("\"", "\"\"")}\""));
writer.WriteLine(csvRow);
}
}
}
}
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using IronPDF;
class Program
{
static void Main(string[] args)
{
string pdfPath = "table.pdf";
string csvPath = "output.csv";
// Extract and parse table data
var tableData = ExtractTableDataFromPdf(pdfPath);
// Write the extracted data to a CSV file
WriteDataToCsv(tableData, csvPath);
Console.WriteLine($"Data extracted and saved to {csvPath}");
}
static List<string[]> ExtractTableDataFromPdf(string pdfPath)
{
var pdf = PdfDocument.FromFile(pdfPath);
// Extract text from the first page
var text = pdf.ExtractTextFromPage(0);
var rows = new List<string[]>();
// Split text into lines (rows)
var lines = text.Split('\n');
// Variable to hold column values temporarily
var tempColumns = new List<string>();
foreach (var line in lines)
{
var trimmedLine = line.Trim();
// Check for empty lines or lines that don't contain table data
if (string.IsNullOrEmpty(trimmedLine) || trimmedLine.Contains("Header"))
{
continue;
}
// Split line into columns. Adjust this based on how columns are separated.
var columns = trimmedLine.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
if (columns.Length > 0)
{
// Add columns to temporary list
tempColumns.AddRange(columns);
rows.Add(tempColumns.ToArray());
tempColumns.Clear(); // Clear temporary list after adding to rows
}
}
return rows;
}
static void WriteDataToCsv(List<string[]> data, string csvPath)
{
using (var writer = new StreamWriter(csvPath))
{
foreach (var row in data)
{
// Join columns with commas and quote each field to handle commas within data
var csvRow = string.Join(",", row.Select(field => $"\"{field.Replace("\"", "\"\"")}\""));
writer.WriteLine(csvRow);
}
}
}
}
Imports Microsoft.VisualBasic
Imports System
Imports System.Collections.Generic
Imports System.IO
Imports System.Linq
Imports IronPDF
Friend Class Program
Shared Sub Main(ByVal args() As String)
Dim pdfPath As String = "table.pdf"
Dim csvPath As String = "output.csv"
' Extract and parse table data
Dim tableData = ExtractTableDataFromPdf(pdfPath)
' Write the extracted data to a CSV file
WriteDataToCsv(tableData, csvPath)
Console.WriteLine($"Data extracted and saved to {csvPath}")
End Sub
Private Shared Function ExtractTableDataFromPdf(ByVal pdfPath As String) As List(Of String())
Dim pdf = PdfDocument.FromFile(pdfPath)
' Extract text from the first page
Dim text = pdf.ExtractTextFromPage(0)
Dim rows = New List(Of String())()
' Split text into lines (rows)
Dim lines = text.Split(ControlChars.Lf)
' Variable to hold column values temporarily
Dim tempColumns = New List(Of String)()
For Each line In lines
Dim trimmedLine = line.Trim()
' Check for empty lines or lines that don't contain table data
If String.IsNullOrEmpty(trimmedLine) OrElse trimmedLine.Contains("Header") Then
Continue For
End If
' Split line into columns. Adjust this based on how columns are separated.
Dim columns = trimmedLine.Split( { " "c, ControlChars.Tab }, StringSplitOptions.RemoveEmptyEntries)
If columns.Length > 0 Then
' Add columns to temporary list
tempColumns.AddRange(columns)
rows.Add(tempColumns.ToArray())
tempColumns.Clear() ' Clear temporary list after adding to rows
End If
Next line
Return rows
End Function
Private Shared Sub WriteDataToCsv(ByVal data As List(Of String()), ByVal csvPath As String)
Using writer = New StreamWriter(csvPath)
For Each row In data
' Join columns with commas and quote each field to handle commas within data
Dim csvRow = String.Join(",", row.Select(Function(field) $"""{field.Replace("""", """""")}"""))
writer.WriteLine(csvRow)
Next row
End Using
End Sub
End Class
As you can see, we have successfully exported the PDF table to CSV. First, we loaded the PDF containing the table and created a new CSV file path. After this, we extracted the table using the var tableData = ExtractTableDataFromPdf(pdfPath) line, which is called the ExtractTableDataFromPdf() method. This method extracts all of the text on the PDF page that the table resides on, storing it in the text variable.
Then, we split the text into lines and columns. Finally, after returning the result from this splitting process, we call the method static void WriteDataToCsv() which takes the extracted, split-up text and writes it to our CSV file using StreamWriter.
When working with PDF tables, following some basic best practices can help to ensure you minimize the chance of running into any errors or issues.
IronPDF offers different licensing options, allowing you to try out all the powerful features IronPDF has to offer for yourself before committing to a license.
Extracting tables from PDFs using IronPDF is a powerful way to automate data extraction, facilitate analysis, and convert documents into more accessible formats. Whether dealing with simple tables or complex, irregular formats, IronPDF provides the tools needed to extract and process table data efficiently.
With IronPDF, you can streamline workflows such as automated data entry, document conversion, and data analysis. The flexibility and advanced features offered by IronPDF make it a valuable tool for handling various PDF-based tasks.
IronPDF is a powerful C# PDF .NET library that allows you to extract structured data like tables directly from PDF files and process them within .NET applications.
Extracting tables from PDF documents is useful for automating data entry, data analysis, document conversion, and ensuring auditing and compliance by converting tabular data into more accessible formats like Excel or CSV.
PDF tables are rendered as text and lines, and the PDF format itself does not support structured data formats like tables. This requires parsing and interpreting content for data extraction.
To extract table data using IronPDF, you can load the PDF document, extract all text, and then parse the text into rows and columns programmatically.
Best practices include pre-processing PDFs for consistent formatting, validating extracted data, implementing error handling, and optimizing performance for large PDFs.
After extracting and parsing the table data from a PDF, you can write the data to a CSV file using a StreamWriter to handle the conversion.
Common use cases include automating data entry, facilitating data analysis, converting documents to more accessible formats, and aiding in auditing and compliance processes.
Yes, IronPDF provides tools to efficiently extract and process table data from both simple and complex, irregular formats.
IronPDF offers different licensing options, allowing you to try out its features before committing to a full license.