Out-of-Order Text Extraction from Tightly Packed Tables

Text extracted from tightly packed table or form layouts appears scrambled. Characters from adjacent rows are interleaved, for example, "Unique Number 123456789" is extracted as "U123456789nique Number".

This is a known limitation of page.Lines extraction for densely packed layouts. page.Lines groups characters using a built-in vertical tolerance that is not configurable. When two visible rows in a table cell sit only a point or two apart, the grouper collapses both rows into one logical line and sorts characters left-to-right, interleaving them. The ExtractTextFromPage(i) method defaults to content-stream order, which may not match visual reading order in form PDFs.

Solution

Each TextChunk exposes its BoundingBox, which allows grouping by vertical position and sorting within each row by horizontal position:

using IronPdf.Pages;
using System.Text;

static string ExtractTextUsingTextChunks(IPdfPage page, double rowTolerance = 2.0)
{
    var rows = page.TextChunks
        .Where(c => !string.IsNullOrWhiteSpace(c.Contents))
        .GroupBy(c => Math.Round(c.BoundingBox.Top / rowTolerance) * rowTolerance)
        .OrderByDescending(g => g.Key);

    var extractedText = new StringBuilder();
    foreach (var row in rows)
    {
        var line = string.Join(" ",
            row.OrderBy(c => c.BoundingBox.Left)
               .Select(c => c.Contents.Trim())
               .Where(text => !string.IsNullOrWhiteSpace(text)));
        extractedText.AppendLine(line);
    }
    return extractedText.ToString();
}
using IronPdf.Pages;
using System.Text;

static string ExtractTextUsingTextChunks(IPdfPage page, double rowTolerance = 2.0)
{
    var rows = page.TextChunks
        .Where(c => !string.IsNullOrWhiteSpace(c.Contents))
        .GroupBy(c => Math.Round(c.BoundingBox.Top / rowTolerance) * rowTolerance)
        .OrderByDescending(g => g.Key);

    var extractedText = new StringBuilder();
    foreach (var row in rows)
    {
        var line = string.Join(" ",
            row.OrderBy(c => c.BoundingBox.Left)
               .Select(c => c.Contents.Trim())
               .Where(text => !string.IsNullOrWhiteSpace(text)));
        extractedText.AppendLine(line);
    }
    return extractedText.ToString();
}
Imports IronPdf.Pages
Imports System.Text

Private Shared Function ExtractTextUsingTextChunks(page As IPdfPage, Optional rowTolerance As Double = 2.0) As String
    Dim rows = page.TextChunks _
        .Where(Function(c) Not String.IsNullOrWhiteSpace(c.Contents)) _
        .GroupBy(Function(c) Math.Round(c.BoundingBox.Top / rowTolerance) * rowTolerance) _
        .OrderByDescending(Function(g) g.Key)

    Dim extractedText As New StringBuilder()
    For Each row In rows
        Dim line = String.Join(" ", row.OrderBy(Function(c) c.BoundingBox.Left) _
                                   .Select(Function(c) c.Contents.Trim()) _
                                   .Where(Function(text) Not String.IsNullOrWhiteSpace(text)))
        extractedText.AppendLine(line)
    Next
    Return extractedText.ToString()
End Function
$vbLabelText   $csharpLabel

Start with rowTolerance = 2.0. If rows still merge, lower it to 1.0 or 0.5. For finer control, substitute page.Characters for page.TextChunks.

Alternative: Visual-order extraction

For PDFs where only the content-stream order is wrong:

pdf.ExtractTextFromPage(i, TextExtractionOrder.VisualOrder);
pdf.ExtractTextFromPage(i, TextExtractionOrder.VisualOrder);
pdf.ExtractTextFromPage(i, TextExtractionOrder.VisualOrder)
$vbLabelText   $csharpLabel

For PDFs with rows that are visually very close together, the TextChunks approach handles tight tolerances better.

Curtis Chau
Technical Writer

Curtis Chau holds a Bachelor’s degree in Computer Science (Carleton University) and specializes in front-end development with expertise in Node.js, TypeScript, JavaScript, and React. Passionate about crafting intuitive and aesthetically pleasing user interfaces, Curtis enjoys working with modern frameworks and creating well-structured, visually appealing manuals.

...

Read More
Ready to Get Started?
Nuget Downloads 19,345,590 | Version: 2026.6 just released
Still Scrolling Icon

Still Scrolling?

Want proof fast? PM > Install-Package IronPdf
run a sample watch your HTML become a PDF.