跳過到頁腳內容
使用IRONPDF
如何使用 IronPDF 從 PDF 中提取文本

如何在C#中從PDF中提取數據

Extracting data from PDFs is crucial for saving time on manual inputting. This article explains how developers can use the IronPDF library to extract text and images from PDF documents.

IronPDF: C# PDF Library

IronPDF is a .NET library that can be used to create, edit, and convert PDF files. It provides an easy-to-use API for developers to use in their applications. It is one of the most popular libraries for creating, editing, and converting PDF files globally. With IronPDF, you can create a straightforward and quick solution to PDFs. Your text will be customized for each document, your layout will be set up for easy reading, and your graphics will be designed with help from the accompanying .NET program.

The IronPDF library has a fantastic feature for extracting data from PDF files. This article will look at how to extract data using IronPDF. First, a C# Project needs to be created or opened. Let's move on to the next section.

Create or Open a C# Project in Visual Studio

This tutorial recommends using the latest version of Visual Studio.

Once Visual Studio is opened, follow the steps below to create a new C# Project. If there is an existing project that you would like to use, then skip these next steps and proceed to the next section directly.

  • Open Visual Studio
  • Click on the "Create a new project" button.

How to Extract Data from PDFs in C#, Figure 1: Visual Studio opening UI Visual Studio opening UI

  • Select the "C# Console Application" from the templates.

How to Extract Data from PDFs in C#, Figure 2: Create a new project Create a new project

  • Give a name to the Project and click on the Next button.
  • Select a .NET Framework according to your project's requirements and click on the Create button.

How to Extract Data from PDFs in C#, Figure 3: .NET Framework selection .NET Framework selection

Visual Studio will now generate a new C# .NET project.

Install the IronPDF Library

The IronPDF library can be installed in multiple ways.

Using Package Manager Console

  • Open the Package Manager Console by going to Tools > NuGet Package Manager > Package Manager Console.
  • Run the following command to install the IronPDF library:
Install-Package IronPdf

How to Extract Data from PDFs in C#, Figure 4: Installation progress in the Package Manager Console tab Installation progress in the Package Manager Console tab

After installation, you will see the IronPDF dependency in the dependencies section of the Solution Explorer, as shown below.

How to Extract Data from PDFs in C#, Figure 5: Reference IronPdf package in Solution Explorer Reference IronPdf package in Solution Explorer

Using the NuGet Package Manager

Another way to install the IronPDF library is by using Visual Studio's integrated NuGet Package Manager UI.

  • Go to the Tools from the main menu. Hover on "NuGet Package Manager" from the drop-down menu and select the "Manage NuGet Packages for Solution...".

How to Extract Data from PDFs in C#, Figure 6: Navigate to NuGet Package Manager Navigate to NuGet Package Manager

  • This will open the NuGet Package Manager window. Go to the Browse tab, write IronPdf in search, and press Enter.
  • Select IronPDF from the search results and click on the "Install" button to begin the installation.

How to Extract Data from PDFs in C#, Figure 7: Install the IronPdf package from the NuGet Package Manager Install the IronPdf package from the NuGet Package Manager

Extract Data from PDF Files

Let's have a look at the following code on how to extract data using IronPDF:

// Import necessary namespaces
using IronPdf;
using System.Collections.Generic;
using System.Drawing;

public class PDFExtractor
{
    public void ExtractDataFromPDF()
    {
        // Open a 128-bit encrypted PDF file by providing the filename and password
        using PdfDocument pdf = PdfDocument.FromFile("encrypted.pdf", "password");

        // Extract all text from the PDF document
        string allText = pdf.ExtractAllText();

        // Extract all images from the PDF document
        IEnumerable<Image> allImages = pdf.ExtractAllImages();

        // Iterate over each page in the PDF document
        for (var index = 0; index < pdf.PageCount; index++)
        {
            int pageNumber = index + 1;

            // Extract text from the specific page
            string text = pdf.ExtractTextFromPage(index);

            // Extract images from the specific page
            IEnumerable<Image> images = pdf.ExtractImagesFromPage(index);

            // Code to process the extracted text and images
            //...
        }
    }
}
// Import necessary namespaces
using IronPdf;
using System.Collections.Generic;
using System.Drawing;

public class PDFExtractor
{
    public void ExtractDataFromPDF()
    {
        // Open a 128-bit encrypted PDF file by providing the filename and password
        using PdfDocument pdf = PdfDocument.FromFile("encrypted.pdf", "password");

        // Extract all text from the PDF document
        string allText = pdf.ExtractAllText();

        // Extract all images from the PDF document
        IEnumerable<Image> allImages = pdf.ExtractAllImages();

        // Iterate over each page in the PDF document
        for (var index = 0; index < pdf.PageCount; index++)
        {
            int pageNumber = index + 1;

            // Extract text from the specific page
            string text = pdf.ExtractTextFromPage(index);

            // Extract images from the specific page
            IEnumerable<Image> images = pdf.ExtractImagesFromPage(index);

            // Code to process the extracted text and images
            //...
        }
    }
}
' Import necessary namespaces
Imports IronPdf
Imports System.Collections.Generic
Imports System.Drawing

Public Class PDFExtractor
	Public Sub ExtractDataFromPDF()
		' Open a 128-bit encrypted PDF file by providing the filename and password
		Using pdf As PdfDocument = PdfDocument.FromFile("encrypted.pdf", "password")
	
			' Extract all text from the PDF document
			Dim allText As String = pdf.ExtractAllText()
	
			' Extract all images from the PDF document
			Dim allImages As IEnumerable(Of Image) = pdf.ExtractAllImages()
	
			' Iterate over each page in the PDF document
			For index = 0 To pdf.PageCount - 1
				Dim pageNumber As Integer = index + 1
	
				' Extract text from the specific page
				Dim text As String = pdf.ExtractTextFromPage(index)
	
				' Extract images from the specific page
				Dim images As IEnumerable(Of Image) = pdf.ExtractImagesFromPage(index)
	
				' Code to process the extracted text and images
				'...
			Next index
		End Using
	End Sub
End Class
$vbLabelText   $csharpLabel

In this code example:

  1. The FromFile method is used to load the input PDF document, which is encrypted and requires a password.
  2. The ExtractAllText method extracts all textual content from the PDF.
  3. The ExtractAllImages method fetches all embedded images.
  4. A loop iterates over each page of the document to extract text and images from that specific page using ExtractTextFromPage and ExtractImagesFromPage.

Conclusion

IronPDF allows developers to extract text and images from PDF files with ease. Using ExtractAllText and ExtractAllImages, the entire contents of a PDF file can be extracted instantly. Alternatively, these methods can be used to extract content from a specific page. The previous code demonstrated how to use both methods to read text and images from a range of pages.

Additionally, IronPDF offers features like rendering charts, adding barcodes, enhancing security with passwords, watermarking, and handling PDF forms programmatically.

IronPDF is available for free during development, with payment required for commercial use. A free trial of IronPDF is available for production use without payment.

Purchase the full suite of Iron Software's document libraries for the cost of two IronPDF Lite Licenses.

Download IronPDF now to start extracting data from PDFs today!

常見問題解答

如何使用 C# 從 PDF 文件中提取文字?

您可以使用 IronPDF 的ExtractAllText方法從 PDF 文件中提取所有文字。此方法簡化了操作流程,讓您可以輕鬆存取 PDF 的文字內容。

如何使用 C# 從 PDF 中提取圖像?

使用 IronPDF,您可以透過ExtractAllImages方法從 PDF 文件中提取圖像。此方法可以有效率地檢索 PDF 文件中的所有嵌入影像。

如何在 C# 專案中安裝 PDF 處理庫?

若要在 C# 專案中安裝 IronPDF,可以使用套件管理器控制台,透過命令Install-Package IronPdf或透過 Visual Studio 中的 NuGet 套件管理器 UI 來安裝該套件。

C# 是否可以處理加密的 PDF 檔案?

是的,IronPDF 允許您使用FromFile方法開啟和操作加密的 PDF 文件,您可以透過提供文件名稱和密碼來存取內容。

我可以用 C# 從 PDF 的特定頁面中提取資料嗎?

IronPDF 讓您可以遍歷 PDF 文件的每一頁,並使用ExtractTextFromPageExtractImagesFromPage等方法從特定頁面提取資料。

C# PDF 函式庫還提供了哪些其他功能?

除了資料擷取之外,IronPDF 還提供圖表渲染、添加條碼、使用密碼增強文件安全性、浮水印以及以程式設計方式處理 PDF 表單等功能。

如何在C#中將HTML轉換為PDF?

您可以使用 IronPDF 的RenderHtmlAsPdf方法將 HTML 字串轉換為 PDF,這對於從 Web 內容建立 PDF 文件特別有用。

C# PDF 庫是否有試用版?

IronPDF 在開發階段可免費使用,方便您測試其各項功能。生產環境使用需要商業許可證,但也提供免費試用版。

我該如何開始使用 C# 庫從 PDF 中提取資料?

若要開始使用 IronPDF 進行資料擷取,請下載資料庫,在 Visual Studio 中建立或開啟 C# 項目,安裝 IronPDF,然後依照程式碼範例有效率地從 PDF 中擷取文字和影像。

.NET 10 相容性:我可以在 .NET 10 中使用 IronPDF 的資料擷取功能嗎?

是的——IronPDF 完全支援 .NET 10,包括其資料提取功能,例如提取文字和圖像。您無需特殊配置即可在 .NET 10 專案中使用 IronPDF。它支援 .NET 10、.NET 9、.NET 8 及更早版本,以及 .NET Standard 和 .NET Framework。 (ironpdf.com)

Curtis Chau
技術作家

Curtis Chau 擁有卡爾頓大學計算機科學學士學位,專注於前端開發,擅長於 Node.js、TypeScript、JavaScript 和 React。Curtis 熱衷於創建直觀且美觀的用戶界面,喜歡使用現代框架並打造結構良好、視覺吸引人的手冊。

除了開發之外,Curtis 對物聯網 (IoT) 有著濃厚的興趣,探索將硬體和軟體結合的創新方式。在閒暇時間,他喜愛遊戲並構建 Discord 機器人,結合科技與創意的樂趣。