PDF 工具

如何在C++中讀取PDF檔案

坎納帕特·烏頓潘

2023年7月5日

PDF(可攜式文件格式)檔案廣泛用於文件交換，能夠以程式方式讀取其內容在各種應用中具有價值。以下庫可用於在 C++ 中讀取 PDF：Poppler、Mupdf、Haru free PDF library、Xpdf 和 Qpdf。

在本文中，我們將探討如何使用 Xpdf 命令行工具在 C++ 中讀取 PDF 文件。 Xpdf 提供一系列用於處理 PDF 文件的工具，包括提取文本內容。通過將 Xpdf 集成到 C++ 程式中，我們可以從 PDF 文件中提取文本並以程式化方式進行處理。

Xpdf - 命令列工具

Xpdf是一個開源軟體套件，提供一系列工具和庫以處理PDF。(可攜式文件格式)檔案。 Xpdf 套件包含若干命令行工具和 C++ 库，提供多種與 PDF 相關的功能，例如解析、渲染、文字提取等。 Xpdf 的一些關鍵組件包括 pdfimages、pdftops、pdfinfo 和 pdfimages。在這裡，我們將使用 pdftotext 來閱讀 PDF 文件。

pdftotext 是一個命令行工具，可以從 PDF 文件中提取文本內容，並將其輸出為純文本。該工具在需要從 PDF 中提取文本信息以進行進一步處理或分析時特別有用。使用選項，您還可以指定要提取文字的頁面或頁數。

先決條件

要製作一個提取文本的 PDF 閱讀器項目，我們需要滿足以下先決條件：

您的系統上安裝了 C++ 編譯器，例如 GCC 或 Clang。您可以使用任何支援 C++ 程式設計的 IDE。
在系統上安裝的 Xpdf 命令列工具。 Xpdf 是一個 PDF 工具集，可以從 Xpdf 網站獲取。從Xpdf 網站. 將 Xpdf 的 bin 目錄設置在環境變量路徑中，以便可以從任何地方使用命令行工具訪問它。

在 C++ 中讀取 PDF 文件格式的步驟

步驟1 包含必要的標頭檔

首先，讓我們在 main.cpp 文件的頂部添加必要的頭文件：

#include <cstdlib>
#include <iostream>
#include <fstream>

#include <cstdlib>
#include <iostream>
#include <fstream>

C++

步驟 2 撰寫 C++ 程式碼

讓我們編寫 C++ 代碼來調用 Xpdf 命令行工具，從 PDF 文檔中提取文本內容。我們將使用以下的 input.pdf 文件：

如何在 C++ 中讀取 PDF 文件：圖 1

代碼範例如下：

// Include C library
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cstdio>

using namespace std;

int main() {
    string pdfPath = "input.pdf";
    string outputFilePath = "output.txt";

    string command = "pdftotext " + pdfPath + " " + outputFilePath;
    int status = system(command.c_str());

    if (status == 0) {
        cout << "Text extraction successful." << endl;
    } else {
        cout << "Text extraction failed." << endl;
        return 1;
    }

    ifstream outputFile(outputFilePath);
    if (outputFile.is_open()) {
        string textContent;
        string line;
        while (getline(outputFile, line)) {
            textContent += line + "\n";
        }
        outputFile.close();

        cout << "Text content extracted from PDF document:" << endl;
        cout << textContent << endl;
    } else {
        cout << "Failed to open output file." << endl;
        return 1;
    }

    return 0;
}

// Include C library
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cstdio>

using namespace std;

int main() {
    string pdfPath = "input.pdf";
    string outputFilePath = "output.txt";

    string command = "pdftotext " + pdfPath + " " + outputFilePath;
    int status = system(command.c_str());

    if (status == 0) {
        cout << "Text extraction successful." << endl;
    } else {
        cout << "Text extraction failed." << endl;
        return 1;
    }

    ifstream outputFile(outputFilePath);
    if (outputFile.is_open()) {
        string textContent;
        string line;
        while (getline(outputFile, line)) {
            textContent += line + "\n";
        }
        outputFile.close();

        cout << "Text content extracted from PDF document:" << endl;
        cout << textContent << endl;
    } else {
        cout << "Failed to open output file." << endl;
        return 1;
    }

    return 0;
}

C++

程式碼說明

在上述代碼中，我們定義了 pdfPath 變量來保存輸入 PDF 文件的路徑。請確保將其替換為您實際輸入 PDF 文件的適當路徑。

我們還定義了 outputFilePath 變數，以保存將由 Xpdf 生成的輸出文字檔案的路徑。

程式碼使用 system 函數執行 pdftotext 命令，將輸入的 PDF 檔案路徑和輸出文字檔案路徑作為命令行參數傳遞。 status 變數捕捉命令的退出狀態。

如果 pdftotext 成功執行(由狀態為0表示)，我們使用 ifstream 開啟輸出文本檔案。然後，我們逐行讀取文本內容，並將其存儲在 textContent 字符串中。

最後，我們將從生成的輸出文件中將提取的文本內容輸出到控制台。如果您不需要可編輯的輸出文本檔或想要釋放硬碟空間，可以在程式結束之前使用以下命令刪除它：

remove(outputFilePath.c_str());

remove(outputFilePath.c_str());

C++

第 3 步：編譯和運行程式

編譯 C++ 代碼並運行可執行檔案。如果將 pdftotext 新增到環境變數系統路徑，其命令將成功執行。該程式生成輸出文本檔案並從 PDF 文件中提取文本內容。提取的文本隨後顯示在控制台上。

輸出的結果如下

如何在C++中閱讀PDF文件：圖2

用 C# 讀取 PDF 檔案

IronPDF 資料庫

IronPDF C# 程式庫概述是一個受歡迎的 C# PDF 庫，提供強大的功能來處理 PDF 文件。它讓開發人員能程式化地創建、編輯、修改和閱讀 PDF 文件。

使用 IronPDF 庫閱讀 PDF 文件是一個簡單的過程。該庫提供各種方法和屬性，使開發人員能夠從 PDF 頁面提取文字、圖像、元數據和其他數據。提取的信息可以用於進一步的處理、分析或在應用程序中顯示。

以下的程式碼範例將使用 IronPDF 讀取 PDF 文件:

// Rendering PDF documents to Images or Thumbnails
using IronPdf;
using IronSoftware.Drawing;
using System.Collections.Generic;

// Extracting Image and Text content from Pdf Documents

// open a 128 bit encrypted PDF
var pdf = PdfDocument.FromFile("encrypted.pdf", "password");

// Get all text to put in a search index
string text = pdf.ExtractAllText();

// Get all Images
var allImages = pdf.ExtractAllImages();

// Or even find the precise text and images for each page in the document
for (var index = 0 ; index < pdf.PageCount ; index++)
{
    int pageNumber = index + 1;
    text = pdf.ExtractTextFromPage(index);
    List<AnyBitmap> images = pdf.ExtractBitmapsFromPage(index);
    //...
}

// Rendering PDF documents to Images or Thumbnails
using IronPdf;
using IronSoftware.Drawing;
using System.Collections.Generic;

// Extracting Image and Text content from Pdf Documents

// open a 128 bit encrypted PDF
var pdf = PdfDocument.FromFile("encrypted.pdf", "password");

// Get all text to put in a search index
string text = pdf.ExtractAllText();

// Get all Images
var allImages = pdf.ExtractAllImages();

// Or even find the precise text and images for each page in the document
for (var index = 0 ; index < pdf.PageCount ; index++)
{
    int pageNumber = index + 1;
    text = pdf.ExtractTextFromPage(index);
    List<AnyBitmap> images = pdf.ExtractBitmapsFromPage(index);
    //...
}

IRON VB CONVERTER ERROR developers@ironsoftware.com

如需有關如何閱讀 PDF 文件的更多詳細資訊，請造訪IronPDF C# PDF 閱讀指南.

結論

在本文中，我們學習了如何使用 Xpdf 命令行工具在 C++ 中讀取 PDF 文件的內容。透過將 Xpdf 整合到 C++ 程式中，我們可以在短短一秒內以程式設計方式從 PDF 檔案中提取文本內容。此方法使我們能夠在 C++ 應用程式中處理和分析提取的文本。

探索 IronPDF是強大的 C# 程式庫，可以輕鬆讀取和操作 PDF 檔案。其廣泛的功能、易於使用以及可靠的渲染引擎，使其成為在 C# 專案中處理 PDF 文件的開發者的熱門選擇。

IronPDF 在開發階段是免費的並提供一個商業用途免費試用. 除此之外，它需要是授權用於商業目的.

坎納帕特·烏頓潘

立即與工程團隊聊天

軟體工程師

在成為軟體工程師之前，Kannapat 在日本北海道大學完成了環境資源博士學位。在攻讀學位期間，Kannapat 也成為了車輛機器人實驗室的成員，該實驗室隸屬於生物生產工程學系。2022 年，他利用自己的 C# 技能，加入了 Iron Software 的工程團隊，專注於 IronPDF 的開發。Kannapat 珍視這份工作，因為他可以直接向負責撰寫大部分 IronPDF 程式碼的開發人員學習。除了同儕學習外，Kannapat 還享受在 Iron Software 工作的社交方面。當他不在撰寫程式碼或文件時，Kannapat 通常會在 PS5 上玩遊戲或重看《最後生還者》。

< 上一頁
如何在C++中創建PDF檔案

下一個 >
如何在 C++ 中將 HTML 轉換為 PDF