跳過到頁腳內容
.NET幫助

Html Agility Pack C#(對於開發者的運行原理)

The need to dynamically manage and manipulate document content is widespread in the world of C# development. Developers commonly rely on robust libraries to automate activities like creating PDF reports and extracting data from web pages. This article explores the straightforward integration of IronPDF and HTML Agility Pack in C# and provides code examples to demonstrate how these libraries can be used to effortlessly create PDF documents and read HTML text.

IronPDF is a feature-rich .NET library for working with PDF files. As IronPDF allows developers to dynamically generate PDF files from HTML content, URLs, or raw data, it serves as a valuable tool for document creation, reporting, and data visualization.

To streamline document generation in .NET applications, we will look at how to connect IronPDF with HTML Agility Pack in this post. Combining these technologies allows programmers to work with remote systems, generate dynamic PDF pages, and get data via network connectivity, all while increasing productivity and scalability in their programs.

How to Use HtmlAgilityPack in C#

  1. Create a new C# Project.
  2. Install the library HtmlAgilityPack.
  3. Import the namespace. Create an object.
  4. Import data from URL and parse the HTML.
  5. Get the required data and dispose of the object.

Introduction to HtmlAgilityPack

HTML Agility Pack is a versatile and powerful HTML parsing library for .NET developers. With the help of its extensive collection of APIs, developers can easily navigate, alter, and extract data from HTML documents. HTML Agility Pack makes working with HTML content programmatically easier for all developers, regardless of experience level.

The capacity of HTML Agility Pack to gently manage HTML that is badly organized or faulty is what makes it unique. It is perfect for online scraping operations where the quality of HTML markup may vary since it uses a forgiving parsing algorithm that can parse even the most badly constructed HTML.

Features of HtmlAgilityPack

HTML Parsing

With the powerful HTML parsing features offered by HTML Agility Pack, developers may load HTML documents from a variety of sources, including files, URLs, and strings. Due to its lenient parsing approach, it can gracefully handle poorly formatted or incorrect HTML, making it suitable for web scraping activities where the HTML markup quality can vary.

DOM Manipulation

For exploring, browsing, and working with the HTML Document Object Model (DOM) structure, HAP offers a user-friendly API. HTML elements, attributes, and text nodes can all be added, removed, or modified programmatically by developers, allowing for dynamic HTML content manipulation.

XPath and LINQ Support

For choosing and querying HTML components, HTML Agility Pack supports LINQ (Language Integrated Query) as well as XPath syntax searches. To choose items in an HTML document according to their attributes, tags, or hierarchy, XPath expression queries provide a strong and easy-to-understand syntax. For developers used to working with LINQ in C#, LINQ queries offer a familiar querying syntax that facilitates smooth integration with other .NET components.

Getting Started with HtmlAgilityPack

Setting Up HtmlAgilityPack in C# Projects

The HtmlAgility Base Class Library comes in a single bundled package, which should be available in NuGet by installing it and can be used in the C# project. It offers an HTML parser and CSS selectors from the HTML document and HTML URLs.

Implementing HtmlAgilityPack in Windows Console and Forms

Many C# application types, such as Windows Forms (WinForms) and Windows Console, implement HtmlAgilityPack. Though the implementation varies from framework to framework, the fundamental idea remains constant.

Html Agility Pack C# (How It Works For Developers): Figure 1 - Search for HtmlAgilityPack using NuGet Package Manager and install it

HtmlAgilityPack C# Example

One of the most important tools in the C# developer's toolbox for navigating, processing, and working with HTML documents is the HTML Agility Pack (HAP). Data extraction from HTML pages is made easier by its user-friendly API, which works like an organized tree of elements. Let's examine a straightforward code example to demonstrate how to use it.

using HtmlAgilityPack;

// Load HTML content from a file or URL
HtmlWeb web = new HtmlWeb();
var doc = web.Load("https://ironpdf.com/");

// Select specific html nodes and parse html string
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

// Iterate through selected nodes and extract content
foreach (HtmlNode node in nodes)
{
    Console.WriteLine(node.InnerText);
}
Console.ReadKey();
using HtmlAgilityPack;

// Load HTML content from a file or URL
HtmlWeb web = new HtmlWeb();
var doc = web.Load("https://ironpdf.com/");

// Select specific html nodes and parse html string
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

// Iterate through selected nodes and extract content
foreach (HtmlNode node in nodes)
{
    Console.WriteLine(node.InnerText);
}
Console.ReadKey();
Imports HtmlAgilityPack

' Load HTML content from a file or URL
Private web As New HtmlWeb()
Private doc = web.Load("https://ironpdf.com/")

' Select specific html nodes and parse html string
Private nodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']")

' Iterate through selected nodes and extract content
For Each node As HtmlNode In nodes
	Console.WriteLine(node.InnerText)
Next node
Console.ReadKey()
$vbLabelText   $csharpLabel

In this example, we load HTML node material from a URL using HTML Agility Pack. The HTML is then loaded into the var doc for parsing and manipulation. To extract content, the program first identifies the root node of the HTML document and then specifically targets nodes within the document using XPath queries. From the code above, we specifically select div elements with the class product-homepage-header from the string HTML data, and then each selected node's inner text is printed to the console.

Html Agility Pack C# (How It Works For Developers): Figure 2 - Extracted text from retrieving the inner text of the product-homepage-header class

HtmlAgilityPack Operations

HTML Transformation

Developers can perform several transformations and manipulations to HTML texts using the HTML Agility Pack. This covers operations like adding, deleting, or changing text nodes, elements, and attributes in addition to reorganizing the DOM hierarchy of the HTML document.

可擴展性

Because HAP is meant to be expandable, programmers can add new features and behaviors to increase its functionality. Using the supplied API, developers can design their own HTML parsers, filters, or manipulators to customize HAP to their unique needs and use cases.

Performance and Efficiency

Large HTML texts can be handled well by the algorithms and data structures of HTML Agility Pack, which is tuned for speed and effectiveness. It ensures quick and responsive HTML content parsing and manipulation by reducing memory utilization and processing overhead.

Integrating HtmlAgilityPack with IronPdf

Using IronPDF with HtmlAgilityPack

The possibilities for document management and report creation are endless when HTML Agility Pack and IronPDF for PDF Conversion are combined. Through the use of HTML Agility Pack for HTML parsing and IronPDF Documentation for PDF conversion, developers may effortlessly automate the creation of PDF documents from dynamic online material.

安裝 IronPDF

  • Launch the Visual Studio project.
  • Select "Tools" > "NuGet Package Manager" > "Package Manager Console".
  • Input this command into the Package Manager Console:
Install-Package IronPdf
  • As an alternative, you can use NuGet Package Manager for Solutions to install IronPDF.
  • Search results for the IronPDF package may be browsed, and chosen, and then the "Install" button can be clicked. Visual Studio will take care of the installation and download for you.

    Html Agility Pack C# (How It Works For Developers): Figure 3 - Install IronPDF using the Manage NuGet Package for Solution by searching IronPdf in the search bar of NuGet Package Manager, then select the project and click on the Install button.

  • The IronPDF package and any dependencies needed for your project will be installed by NuGet.
  • IronPDF can be used for your project after installation.

通過 NuGet 網站安裝

To find out more about the features, compatibility, and other download choices of IronPDF, see its IronPDF NuGet Package Information on the NuGet website.

使用 DLL 安裝

As an alternative, you can use IronPDF's DLL file to integrate it straight into your project. Click this IronPDF DLL Download to obtain the ZIP file containing the DLL. After unzipping, incorporate the DLL into your project.

實施邏輯

By integrating the features of both libraries, HTML Agility Pack (HAP) and IronPDF may be implemented in C# to read HTML information and produce PDF documents on the fly. The steps for implementation are listed below, along with a sample code that walks through each one:

  1. Load HTML Content using HTML Agility Pack: To load HTML material from a source, such as a file, string, or URL, use the HTML Agility Pack. In this phase, the HTML document is parsed and a manipulable HTML document object is created.
  2. Extract Desired Content: To choose and extract particular content from the HTML document, use the HTML Agility Pack in conjunction with XPath or LINQ queries. This could entail choosing elements according to their properties, tags, or hierarchical structure.
  3. Convert HTML to PDF using IronPDF: To create a PDF document from the retrieved HTML content, use IronPDF. IronPDF converts HTML material to PDF format with ease while maintaining style and layout.
  4. Optional: Customize PDF Output: Use IronPDF to add headers, footers, page numbering, and other dynamic components to customize the PDF output as needed. This step improves the resulting PDF document's appearance and usability.
  5. Save or Stream PDF Document: The created PDF document can be streamed straight to the client or browser for download, or it can be saved to a file. IronPDF offers ways to save PDF files to different output streams.
using HtmlAgilityPack;
using IronPdf;
using System;
using System.Text;

class Program
{
    static void Main()
    {
        StringBuilder htmlContent = new StringBuilder();

        // Load HTML content from a file or URL
        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load("https://ironpdf.com/");

        // Select specific elements using XPath or LINQ
        HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

        // Iterate through selected nodes and extract content
        foreach (HtmlNode node in nodes)
        {
            htmlContent.Append(node.OuterHtml);
            Console.WriteLine(node.InnerText);
        }

        // Convert HTML content to PDF using IronPDF
        var Renderer = new HtmlToPdf();
        var PDF = Renderer.RenderHtmlAsPdf(htmlContent.ToString());

        // Save PDF to file
        PDF.SaveAs("output.pdf");
        Console.WriteLine("PDF generated successfully!");
        Console.ReadKey();
    }
}
using HtmlAgilityPack;
using IronPdf;
using System;
using System.Text;

class Program
{
    static void Main()
    {
        StringBuilder htmlContent = new StringBuilder();

        // Load HTML content from a file or URL
        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load("https://ironpdf.com/");

        // Select specific elements using XPath or LINQ
        HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

        // Iterate through selected nodes and extract content
        foreach (HtmlNode node in nodes)
        {
            htmlContent.Append(node.OuterHtml);
            Console.WriteLine(node.InnerText);
        }

        // Convert HTML content to PDF using IronPDF
        var Renderer = new HtmlToPdf();
        var PDF = Renderer.RenderHtmlAsPdf(htmlContent.ToString());

        // Save PDF to file
        PDF.SaveAs("output.pdf");
        Console.WriteLine("PDF generated successfully!");
        Console.ReadKey();
    }
}
Imports HtmlAgilityPack
Imports IronPdf
Imports System
Imports System.Text

Friend Class Program
	Shared Sub Main()
		Dim htmlContent As New StringBuilder()

		' Load HTML content from a file or URL
		Dim web As New HtmlWeb()
		Dim doc As HtmlDocument = web.Load("https://ironpdf.com/")

		' Select specific elements using XPath or LINQ
		Dim nodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']")

		' Iterate through selected nodes and extract content
		For Each node As HtmlNode In nodes
			htmlContent.Append(node.OuterHtml)
			Console.WriteLine(node.InnerText)
		Next node

		' Convert HTML content to PDF using IronPDF
		Dim Renderer = New HtmlToPdf()
		Dim PDF = Renderer.RenderHtmlAsPdf(htmlContent.ToString())

		' Save PDF to file
		PDF.SaveAs("output.pdf")
		Console.WriteLine("PDF generated successfully!")
		Console.ReadKey()
	End Sub
End Class
$vbLabelText   $csharpLabel

Visit Utilizing IronPDF for Conversion to learn more about the code example.

Html Agility Pack C# (How It Works For Developers): Figure 4 - IronPDF homepage

執行輸出如下所示:

Example output from the code above

結論

Whether parsing HTML data or creating PDF reports, developers can manage and alter document material with ease thanks to the smooth integration of HTML Agility Pack and IronPDF in C#. Developers can easily and precisely automate operations connected to documents by combining the PDF production features of IronPDF with the parsing capabilities of HTML Agility Pack. The combination of these two libraries provides a strong C# document management solution, regardless of whether you're building dynamic reports or pulling data from web pages.

A perpetual license, a year of software maintenance, and a library upgrade are all included in the $799 Lite bundle. IronPDF provides free licensing with temporal and redistribution limitations. During the trial period, users can evaluate the solution without seeing a watermark. Please go to IronPDF's Licensing Information to learn more about the cost and license.

Learn more about Iron Software libraries.

常見問題解答

怎樣在 C# 中將 HTML 轉換為 PDF?

您可以使用 IronPDF 的 RenderHtmlAsPdf 方法將 HTML 字符串轉換為 PDF。您還可以使用 RenderHtmlFileAsPdf 將 HTML 文件轉換為 PDF。

在 C# 項目中使用 HtmlAgilityPack 的目的是什麼?

HtmlAgilityPack 用於 C# 項目中解析和處理 HTML 文檔。它可以處理格式不佳的 HTML,非常適合網頁爬取和數據提取任務。

如何在 C# 應用程序中設置 HtmlAgilityPack?

要設置 HtmlAgilityPack,請在 Visual Studio 中通過 NuGet Package Manager 安裝。安裝完成後,您可以導入必要的命名空間並在應用程序中開始解析 HTML 內容。

IronPDF 和 HtmlAgilityPack 能否一起用於文件創建?

是的,IronPDF 和 HtmlAgilityPack 可以結合起來從 HTML 內容創建動態 PDF 文件。HtmlAgilityPack 提取並操作 HTML 數據,然後可以使用 IronPDF 將其轉換為 PDF。

IronPDF 對於 .NET 開發人員的主要功能是什麼?

IronPDF 提供的功能包括將 HTML 轉換為 PDF、合併 PDF 以及將文本或圖像添加到 PDF。它支持廣泛的功能,用於 .NET 應用程序中的強大 PDF 文檔管理。

HtmlAgilityPack 如何幫助從網頁中提取數據?

HtmlAgilityPack 允許開發人員加載 HTML 文檔,並使用 XPath 或 LINQ 查詢來導航和提取基於特定節點或屬性的數據,以促進網頁數據提取。

將 PDF 庫與 HtmlAgilityPack 集成的好處是什麼?

將 IronPDF 與 HtmlAgilityPack 結合使用可以通過將動態 HTML 內容轉換為 PDF 報告來增強文檔自動化,從而簡化 .NET 應用程序中的文檔生成。

能否在控制台應用程序中使用 IronPDF?

是的,IronPDF 可以在各種 C# 應用程序類型中實施,包括 Windows 控制台應用程序,這樣可以實現多功能的文檔處理和 PDF 生成。

可使用 HtmlAgilityPack 執行哪些類型的 HTML 操作?

HtmlAgilityPack 支持添加、刪除或修改 HTML 節點和元素,以及重組 DOM 結構等操作,是用於 HTML 文檔處理的多功能工具。

IronPDF 是否提供開發人員的免費試用版?

IronPDF 提供一定限制的免費許可,允許開發人員在試用期間無水印地評估該庫,提供在購買前測試其功能的機會。

Curtis Chau
技術作家

Curtis Chau 擁有卡爾頓大學計算機科學學士學位,專注於前端開發,擅長於 Node.js、TypeScript、JavaScript 和 React。Curtis 熱衷於創建直觀且美觀的用戶界面,喜歡使用現代框架並打造結構良好、視覺吸引人的手冊。

除了開發之外,Curtis 對物聯網 (IoT) 有著濃厚的興趣,探索將硬體和軟體結合的創新方式。在閒暇時間,他喜愛遊戲並構建 Discord 機器人,結合科技與創意的樂趣。