Out-of-Order Text Extraction from Tightly Packed Tables
Text extracted from tightly packed table or form layouts appears scrambled. Characters from adjacent rows are interleaved, for example, "Unique Number 123456789" is extracted as "U123456789nique Number".
This is a known limitation of page.Lines extraction for densely packed layouts. page.Lines groups characters using a built-in vertical tolerance that is not configurable. When two visible rows in a table cell sit only a point or two apart, the grouper collapses both rows into one logical line and sorts characters left-to-right, interleaving them. The ExtractTextFromPage(i) method defaults to content-stream order, which may not match visual reading order in form PDFs.
Solution
Recommended: Extract using TextChunks with a tunable tolerance
Each TextChunk exposes its BoundingBox, which allows grouping by vertical position and sorting within each row by horizontal position:
using IronPdf.Pages;
using System.Text;
static string ExtractTextUsingTextChunks(IPdfPage page, double rowTolerance = 2.0)
{
var rows = page.TextChunks
.Where(c => !string.IsNullOrWhiteSpace(c.Contents))
.GroupBy(c => Math.Round(c.BoundingBox.Top / rowTolerance) * rowTolerance)
.OrderByDescending(g => g.Key);
var extractedText = new StringBuilder();
foreach (var row in rows)
{
var line = string.Join(" ",
row.OrderBy(c => c.BoundingBox.Left)
.Select(c => c.Contents.Trim())
.Where(text => !string.IsNullOrWhiteSpace(text)));
extractedText.AppendLine(line);
}
return extractedText.ToString();
}
using IronPdf.Pages;
using System.Text;
static string ExtractTextUsingTextChunks(IPdfPage page, double rowTolerance = 2.0)
{
var rows = page.TextChunks
.Where(c => !string.IsNullOrWhiteSpace(c.Contents))
.GroupBy(c => Math.Round(c.BoundingBox.Top / rowTolerance) * rowTolerance)
.OrderByDescending(g => g.Key);
var extractedText = new StringBuilder();
foreach (var row in rows)
{
var line = string.Join(" ",
row.OrderBy(c => c.BoundingBox.Left)
.Select(c => c.Contents.Trim())
.Where(text => !string.IsNullOrWhiteSpace(text)));
extractedText.AppendLine(line);
}
return extractedText.ToString();
}
Imports IronPdf.Pages
Imports System.Text
Private Shared Function ExtractTextUsingTextChunks(page As IPdfPage, Optional rowTolerance As Double = 2.0) As String
Dim rows = page.TextChunks _
.Where(Function(c) Not String.IsNullOrWhiteSpace(c.Contents)) _
.GroupBy(Function(c) Math.Round(c.BoundingBox.Top / rowTolerance) * rowTolerance) _
.OrderByDescending(Function(g) g.Key)
Dim extractedText As New StringBuilder()
For Each row In rows
Dim line = String.Join(" ", row.OrderBy(Function(c) c.BoundingBox.Left) _
.Select(Function(c) c.Contents.Trim()) _
.Where(Function(text) Not String.IsNullOrWhiteSpace(text)))
extractedText.AppendLine(line)
Next
Return extractedText.ToString()
End Function
Start with rowTolerance = 2.0. If rows still merge, lower it to 1.0 or 0.5. For finer control, substitute page.Characters for page.TextChunks.
Alternative: Visual-order extraction
For PDFs where only the content-stream order is wrong:
pdf.ExtractTextFromPage(i, TextExtractionOrder.VisualOrder);
pdf.ExtractTextFromPage(i, TextExtractionOrder.VisualOrder);
pdf.ExtractTextFromPage(i, TextExtractionOrder.VisualOrder)
For PDFs with rows that are visually very close together, the TextChunks approach handles tight tolerances better.

