PDF 工具

如何在 C++ 中查看 PDF 文件

发布 2023年八月2日

PDF 文件是一种广泛使用的文档交换格式，因为它能够在不同平台上保留格式。在各种应用程序中，通过编程读取 PDF 文件的内容变得非常重要。

在本文中，我们将学习如何使用 Xpdf 命令行工具在 C++ 中查看 PDF 文件中的文本。 Xpdf提供了一套命令行实用程序和 C++ 库，用于处理 PDF 文件，包括文本提取。通过将Xpdf` 集成到我们的 C++ PDF 查看器程序中，我们可以高效地查看 PDF 文件中的文本内容，并以编程方式对其进行处理。

Xpdf` - C++ 库和命令行工具

XpdfPDF 是一款开源软件套件，提供一系列用于处理 PDF 文件的工具和库。它包括各种命令行实用程序和 C++ 库，可实现与 PDF 相关的功能，如解析、渲染、打印和文本提取。 Xpdf 的命令行工具还提供了直接从终端查看 PDF 文件的方法。

Xpdf 的关键组件之一是 pdftotext，它主要用于从 PDF 文件中提取文本内容。然而，当与其他工具如 pdftops 和 pdfimages 结合使用时，Xpdf 允许用户以不同的方式查看 PDF 内容。事实证明，"pdftotext "工具对于从 PDF 中提取文本信息进行进一步处理或分析非常有价值，它提供了指定从哪些页面提取文本的选项。

先决条件

在我们开始之前，请确保您具备以下先决条件：

系统中已安装 GCC 或 Clang 等 C++ 编译器。我们将使用Code::Blocks 集成开发环境为此。
已安装 Xpdf 命令行工具并可从命令行访问。下载 Xpdf并安装适合您环境的版本。然后，在系统环境变量路径中设置 Xpdf 的 bin 目录，以便从文件系统的任何位置访问它。

创建 PDF 查看器项目

打开 Code::Blocks： 在计算机上启动 Code::Blocks IDE。
创建新项目： 点击顶部菜单中的 "文件"，然后从下拉菜单中选择 "新建"。然后，点击子菜单中的 "项目"。
选择项目类型： 在 "从模板新建 "窗口中，选择 "控制台应用程序"，然后点击 "转到"。然后选择语言 "C/C++"，点击 "下一步"。
输入项目详细信息： 在 "项目标题 "字段中，为您的项目命名(例如，"PDFViewer"). 选择保存项目文件的位置，然后单击 "下一步"。
选择编译器： 选择您的项目要使用的编译器。默认情况下，Code::Blocks 会自动检测系统中可用的编译器。如果没有，请从列表中选择合适的编译器，然后单击 "完成"。

用 C++ 查看 PDF 文本的步骤

包含必要的标题

首先，让我们在 main.cpp 文件中添加所需的头文件：

#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cstdio>

#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cstdio>

C++

设置输入和输出路径

string pdfPath = "input.pdf";
string outputFilePath = "output.txt";

string pdfPath = "input.pdf";
string outputFilePath = "output.txt";

C++

在 main 函数中，我们声明了两个字符串：pdfPath "和 "outputFilePath"。 pdfPath "存储输入 PDF 文件的路径，"outputFilePath "存储将提取的文本保存为纯文本文件的路径。

输入文件如下：

如何用 C++ 查看 PDF 文件：图 1

执行 `pdftotext` 命令

string command = "pdftotext " + pdfPath + " " + outputFilePath;
int status = system(command.c_str());

string command = "pdftotext " + pdfPath + " " + outputFilePath;
int status = system(command.c_str());

C++

在这里，我们使用 pdfPath 和 outputFilePath 变量构建了 pdftotext 命令，以打开 PDF 文件查看其内容。然后调用 system 函数来执行命令，其返回值存储在 status 变量中。

检查文本提取状态

if (status == 0) 
{
    cout << "Text extraction successful." << endl;
} else 
{ 
    cout << "Text extraction failed." << endl; 
}

if (status == 0) 
{
    cout << "Text extraction successful." << endl;
} else 
{ 
    cout << "Text extraction failed." << endl; 
}

C++

我们检查 status 变量，查看 pdftotext 命令是否成功执行。如果 status 等于 0，则表示文本提取成功，我们将打印一条成功消息。如果 status 非零，则表示出错，我们将打印出错信息。

读取提取的文本并显示

ifstream outputFile(outputFilePath);
if (outputFile.is_open()) { 
    string textContent;
    string line;
    while (getline(outputFile, line)) {
        textContent += line + "\n";
    }
    outputFile.close();
    cout << "Text content extracted from PDF:" << endl;
    cout << textContent << endl;
} else {
    cout << "Failed to open output file." << endl;
}

ifstream outputFile(outputFilePath);
if (outputFile.is_open()) { 
    string textContent;
    string line;
    while (getline(outputFile, line)) {
        textContent += line + "\n";
    }
    outputFile.close();
    cout << "Text content extracted from PDF:" << endl;
    cout << textContent << endl;
} else {
    cout << "Failed to open output file." << endl;
}

C++

在上述示例代码中，我们打开了`输出文件(生成的文本文件)如果您需要翻译.NET 文件，请使用 "textContent "字符串，逐行读取其内容，并将其存储在 "textContent "字符串中。最后，我们关闭文件并在控制台上打印提取的文本内容。

删除输出文件

如果您不需要可编辑的输出文本文件或希望释放磁盘空间，只需在程序结束时使用以下命令删除它，然后再结束主函数：

remove(outputFilePath.c_str());

remove(outputFilePath.c_str());

C++

编译和运行程序

使用 "Ctrl+F9 "快捷键构建代码。编译成功后，运行可执行文件将从指定的 PDF 文档中提取文本内容并显示在控制台上。输出结果如下

如何用 C++ 查看 PDF 文件：图 2

在C#中查看PDF文件

IronPDF for .NET C# 库C# PDF 是一个功能强大的 .NET C# PDF 库，允许用户在其 C# 应用程序中轻松查看 PDF 文件。 IronPDF 利用 Chromium 网页浏览器引擎，可准确渲染和显示 PDF 内容，包括图像、字体和复杂格式。凭借其友好的用户界面和丰富的功能，开发人员可以将 IronPDF 无缝集成到他们的 C# 项目中，使用户能够高效、交互式地查看 PDF 文档。无论是用于显示报告、发票还是任何其他 PDF 内容，IronPDF 都能提供一个强大的解决方案，用于在 C# 中创建功能丰富的 PDF 查看器。

要在 Visual Studio 中安装 IronPdf NuGet 包，请按照以下步骤操作：

打开 Visual Studio： 启动 Visual Studio 或您喜欢的任何其他集成开发环境。
创建或打开您的项目： 创建一个新的 C# 项目或打开一个现有项目，在其中安装 IronPDF 软件包。
打开 NuGet 包管理器： 在 Visual Studio 中，转到 "工具">"NuGet 包管理器">"管理解决方案的 NuGet 包"。或者，点击解决方案资源管理器，然后选择 "管理解决方案的 NuGet 包"。
搜索 IronPDF： 在 "NuGet 包管理器 "窗口中，点击 "浏览 "选项卡，然后在搜索栏中搜索 "IronPDF"。或者，请访问NuGet IronPDF 软件包并直接下载最新版本的 "IronPDF"。
选择 IronPDF 软件包： 找到 "IronPDF "软件包并点击，为您的项目选择该软件包。
安装 IronPdf： 点击 "安装 "按钮安装所选软件包。
不过，您也可以使用 NuGet 软件包管理器控制台，使用以下命令安装 IronPdf：

    :ProductInstall

使用 IronPDF，我们可以执行以下操作从 PDF 文档中提取文本和图像并将其显示在控制台中供查看。以下代码有助于完成此任务：

using IronPdf;
using IronSoftware.Drawing;
using System.Collections.Generic;

// Extracting Image and Text content from Pdf Documents

// open a 128-bit encrypted PDF
var pdf = PdfDocument.FromFile("encrypted.pdf", "password");

// Get all text to put in a search index
string text = pdf.ExtractAllText();

// Get all Images
var allImages = pdf.ExtractAllImages();

// Or even find the precise text and images for each page in the document
for (var index = 0 ; index < pdf.PageCount ; index++)
{
    int pageNumber = index + 1;
    text = pdf.ExtractTextFromPage(index);
    List<AnyBitmap> images = pdf.ExtractBitmapsFromPage(index);
    //...
}

using IronPdf;
using IronSoftware.Drawing;
using System.Collections.Generic;

// Extracting Image and Text content from Pdf Documents

// open a 128-bit encrypted PDF
var pdf = PdfDocument.FromFile("encrypted.pdf", "password");

// Get all text to put in a search index
string text = pdf.ExtractAllText();

// Get all Images
var allImages = pdf.ExtractAllImages();

// Or even find the precise text and images for each page in the document
for (var index = 0 ; index < pdf.PageCount ; index++)
{
    int pageNumber = index + 1;
    text = pdf.ExtractTextFromPage(index);
    List<AnyBitmap> images = pdf.ExtractBitmapsFromPage(index);
    //...
}

Imports IronPdf
Imports IronSoftware.Drawing
Imports System.Collections.Generic

' Extracting Image and Text content from Pdf Documents

' open a 128-bit encrypted PDF
Private pdf = PdfDocument.FromFile("encrypted.pdf", "password")

' Get all text to put in a search index
Private text As String = pdf.ExtractAllText()

' Get all Images
Private allImages = pdf.ExtractAllImages()

' Or even find the precise text and images for each page in the document
For index = 0 To pdf.PageCount - 1
	Dim pageNumber As Integer = index + 1
	text = pdf.ExtractTextFromPage(index)
	Dim images As List(Of AnyBitmap) = pdf.ExtractBitmapsFromPage(index)
	'...
Next index

VB C#

有关 IronPDF 的更多详细信息，请访问IronPDF文档.

结论

在本文中，我们学习了如何使用 Xpdf 命令行工具在 C++ 中提取和查看 PDF 文档的内容。这种方法使我们能够在 C# 应用程序中无缝处理和分析提取的文本。

IronPDF 许可证信息为开发目的免费使用，但生成的 PDF 文件带有水印。要去除水印并将 IronPDF 用于商业目的，您可以购买许可证。

A免费试用许可也可用于商业测试。

乔尔迪·巴尔迪亚

软件工程师

LinkedIn | Website

Jordi 最擅长 Python、C# 和 C++，当他不在 Iron Software 运用技能时，他会进行游戏编程。作为产品测试、产品开发和研究的负责人之一，Jordi 为持续的产品改进增添了极大的价值。多样化的经验让他充满挑战和参与感，他说这是他在 Iron Software 工作中最喜欢的方面之一。Jordi 在佛罗里达州迈阿密长大，并在佛罗里达大学学习计算机科学和统计学。

< 前一页
如何在NodeJS中将PDF转换为图像

下一步 >
如何在 C++ 中创建 PDF 文件