.NET HELP

Html Agility Pack C# (How It Works For Developers)

Published June 4, 2024
Share:

Introduction

The need to dynamically manage and manipulate document content is widespread in the world of C# development. Developers commonly rely on robust libraries to automate activities like creating PDF reports and extracting data from web pages. This article explores the straightforward integration of IronPDF and HTML Agility Pack in C# and provides code examples to demonstrate how these libraries can be used to effortlessly create PDF documents and read HTML text.

On the other hand, IronPDF is a feature-rich.NET library for working with PDF files. As IronPDF allows developers to dynamically generate PDF files from HTML content, URLs, or raw data, it serves as a valuable tool for document creation, reporting, and data visualization.

To streamline document generation in .NET applications, we will look at how to connect IronPDF with HtmlAgilityPack in this post. Combining these technologies allows programmers to work with remote systems, generate dynamic PDF pages, and get data via network connectivity, all while increasing productivity and scalability in their programs.

How to Use HtmlAgilityPack in C#

  1. Create a new C# Project.
  2. Install the library HtmlAgilityPack.
  3. Import the namespace. Create an object.
  4. Import data from Url and Parse the HTML.
  5. Get the required data and dispose of the object.

Introduction to HtmlAgilityPack

HTML Agility Pack is a versatile and powerful HTML parsing library for .NET developers. With the help of its extensive collection of APIs, developers can easily navigate, alter, and extract data from HTML documents. HTML Agility Pack makes working with HTML content programmatically easier for all developers, regardless of experience level.

The capacity of HTML Agility Pack to gently manage HTML that is badly organized or faulty is what makes it unique. It is perfect for online scraping operations where the quality of HTML markup may vary since it uses a forgiving parsing algorithm that can parse even the most badly constructed HTML.

Features of HtmlAgilityPack

HTML Parsing

With the powerful HTML parsing features offered by HTML Agility Pack, developers may load HTML documents from a variety of sources, including files, URLs, and strings. Due to its lenient parsing approach, it can gracefully handle poorly formatted or incorrect HTML, making it suitable for web scraping activities where the HTML markup quality can vary.

DOM Manipulation

For exploring, browsing, and working with the HTML Document Object Model (DOM) structure, HAP offers a user-friendly API. HTML elements, attributes, and text nodes can all be added, removed, or modified programmatically by developers, allowing for dynamic HTML content manipulation.

XPath and LINQ Support

For choosing and querying HTML components, HTML Agility Pack supports LINQ (Language Integrated Query) as well as XPath syntax searches. To choose items in an HTML document according to their attributes, tags, or hierarchy, XPath expression queries provide a strong and easy-to-understand syntax. For developers used to working with LINQ in C#, LINQ queries offer a familiar querying syntax that facilitates smooth integration with other .NET components.

Getting Started with HtmlAgilityPack

Setting Up HtmlAgilityPack in C# Projects

The HtmlAgility Base Class Library comes in a single bundled package, which should be available in Nuget by installing it and can be used in the C# project. It offers an HTML parser and CSS selectors from the HTML document and HTML URLs.

Implementing HtmlAgilityPack in Windows console and forms

Many C# application types, such as Windows Forms (WinForms) and Windows Console, implement HtmlAgilityPack. Though the implementation varies from framework to framework, the fundamental idea remains constant.

Html Agility Pack C# (How It Works For Developers): Figure 1 - Search for HtmlAgilityPack using NuGet Package Manager and install it

HtmlAgilityPack c# Example

One of the most important tools in the C# developer's toolbox for navigating, processing, and working with HTML documents is the HTML Agility Pack (HAP). Data extraction from HTML pages is made easier by its user-friendly API, which works like an organized tree of elements. Let's examine a straightforward code example to demonstrate how to use it.

using HtmlAgilityPack;

// Load HTML content from a file or URL
HtmlWeb web = new HtmlWeb();
var doc = web.Load("https://ironpdf.com/");

// Select specific html nodes and parse html string
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

// Iterate through selected nodes and extract content
foreach (HtmlNode node in nodes)
{
    Console.WriteLine(node.InnerText);
}
Console.ReadKey();
using HtmlAgilityPack;

// Load HTML content from a file or URL
HtmlWeb web = new HtmlWeb();
var doc = web.Load("https://ironpdf.com/");

// Select specific html nodes and parse html string
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

// Iterate through selected nodes and extract content
foreach (HtmlNode node in nodes)
{
    Console.WriteLine(node.InnerText);
}
Console.ReadKey();
Imports HtmlAgilityPack

' Load HTML content from a file or URL
Private web As New HtmlWeb()
Private doc = web.Load("https://ironpdf.com/")

' Select specific html nodes and parse html string
Private nodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']")

' Iterate through selected nodes and extract content
For Each node As HtmlNode In nodes
	Console.WriteLine(node.InnerText)
Next node
Console.ReadKey()
VB   C#

In this example, we load HTML node material from a URL using HTML Agility Pack. The HTML is then loaded into the var doc for parsing and manipulation. To extract content, the program first identifies the root node of the HTML document and then specifically targets nodes within the document using XPath queries. From the code above, we specifically select div elements with the class product-homepage-header from the string HTML data, and then each selected node's inner text is printed to the console.

Html Agility Pack C# (How It Works For Developers): Figure 2 - Extracted text from retrieving the inner text of the product-homepage-header class

HtmlAgilityPack Operations

HTML Transformation

Developers can perform several transformations and manipulations to HTML texts using the HTML Agility Pack. This covers operations like adding, deleting, or changing text nodes, elements, and attributes in addition to reorganizing the DOM hierarchy of the HTML document.

Extensibility

Because HAP is meant to be expandable, programmers can add new features and behaviors to increase its functionality. Using the supplied API, developers can design their own HTML parsers, filters, or manipulators to customize HAP to their unique needs and use cases.

Performance and Efficiency

Large HTML texts can be handled well by the algorithms and data structures of HTML Agility Pack, which is tuned for speed and effectiveness. It ensures quick and responsive HTML content parsing and manipulation by reducing memory utilization and processing overhead.

Integrating HtmlAgilityPack with IronPdf

Using IronPDF with HtmlAgilityPack

The possibilities for document management and report creation are endless when HTML Agility Pack and IronPDF are combined. Through the use of HTML Agility Pack for HTML parsing and IronPDF for PDF conversion, developers may effortlessly automate the creation of PDF documents from dynamic online material. To learn more about the IronPDF documentation, please refer here.

Install IronPDF

  • Launch the Visual Studio project.
  • Select "Tools" > "NuGet Package Manager" > "Package Manager Console".
  • Input this command into the Package Manager Console:
Install-Package IronPdf
  • As an alternative, you can use NuGet Package Manager for Solutions to install IronPDF.
  • Search results for the IronPDF package may be browsed, and chosen, and then the "Install" button can be clicked. Visual Studio will take care of the installation and download for you.

    Html Agility Pack C# (How It Works For Developers): Figure 3 - Install IronPDF using the Manage NuGet Package for Solution by searching IronPdf in the search bar of NuGet Package Manager, then select the project and click on the Install button.

  • The IronPDF package and any dependencies needed for your project will be installed by NuGet.
  • IronPDF can be used for your project after installation.

Install Through the NuGet Website

To find out more about the features, compatibility, and other download choices of IronPDF, see its page at https://www.nuget.org/packages/IronPdf on the NuGet website.

Utilize DLL to Install

As an alternative, you can use IronPDF's DLL file to integrate it straight into your project. Click this link to obtain the ZIP file containing the DLL. After unzipping, incorporate the DLL into your project.

Implementing Logic

By integrating the features of both libraries, HTML Agility Pack (HAP) and IronPDF may be implemented in C# to read HTML information and produce PDF documents on the fly. The steps for implementation are listed below, along with a sample code that walks through each one:

  1. Load HTML Content using HTML Agility Pack: To load HTML material from a source, such as a file, string, or URL, use the HTML Agility Pack. In this phase, the HTML document is parsed and a manipulable HTML document object is created.
  2. Extract Desired Content: To choose and extract particular content from the HTML document, use the HTML Agility Pack in conjunction with XPath or LINQ queries. This could entail choosing elements according to their properties, tags, or hierarchical structure.
  3. Convert HTML to PDF using IronPDF: To create a PDF document from the retrieved HTML content, use IronPDF. IronPDF converts HTML material to PDF format with ease while maintaining style and layout.
  4. Optional: Customize PDF Output: Use IronPDF to add headers, footers, page numbering, and other dynamic components to customize the PDF output as needed. This step improves the resulting PDF document's appearance and usability.
  5. Save or Stream PDF Document: The created PDF document can be streamed straight to the client or browser for download, or it can be saved to a file. IronPDF offers ways to save PDF files to different output streams.
using HtmlAgilityPack;        
StringBuilder htmlContent=new StringBuilder();

// Load HTML content from a file or URL
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://ironpdf.com/");
// Select specific elements using XPath or LINQ
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

// Iterate through selected nodes and extract content
foreach (HtmlNode node in nodes)
{
    htmlContent.Append(node.OuterHtml);
    Console.WriteLine(node.InnerText);
}
// Convert HTML content to PDF using IronPDF
var Renderer = new IronPdf.HtmlToPdf();
var PDF = Renderer.RenderHtmlAsPdf(htmlContent.ToString());
// Save PDF to file
PDF.SaveAs("output.pdf");
Console.WriteLine("PDF generated successfully!");
Console.ReadKey();
using HtmlAgilityPack;        
StringBuilder htmlContent=new StringBuilder();

// Load HTML content from a file or URL
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://ironpdf.com/");
// Select specific elements using XPath or LINQ
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']");

// Iterate through selected nodes and extract content
foreach (HtmlNode node in nodes)
{
    htmlContent.Append(node.OuterHtml);
    Console.WriteLine(node.InnerText);
}
// Convert HTML content to PDF using IronPDF
var Renderer = new IronPdf.HtmlToPdf();
var PDF = Renderer.RenderHtmlAsPdf(htmlContent.ToString());
// Save PDF to file
PDF.SaveAs("output.pdf");
Console.WriteLine("PDF generated successfully!");
Console.ReadKey();
Imports HtmlAgilityPack
Private htmlContent As New StringBuilder()

' Load HTML content from a file or URL
Private web As New HtmlWeb()
Private doc As HtmlDocument = web.Load("https://ironpdf.com/")
' Select specific elements using XPath or LINQ
Private nodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//h1[@class='product-homepage-header product-homepage-header--ironpdf']")

' Iterate through selected nodes and extract content
For Each node As HtmlNode In nodes
	htmlContent.Append(node.OuterHtml)
	Console.WriteLine(node.InnerText)
Next node
' Convert HTML content to PDF using IronPDF
Dim Renderer = New IronPdf.HtmlToPdf()
Dim PDF = Renderer.RenderHtmlAsPdf(htmlContent.ToString())
' Save PDF to file
PDF.SaveAs("output.pdf")
Console.WriteLine("PDF generated successfully!")
Console.ReadKey()
VB   C#

Visit here to learn more about the code example.

Html Agility Pack C# (How It Works For Developers): Figure 4 - IronPDF homepage

The execution output is shown below:

Example output from the code above

Conclusion

Whether parsing HTML data or creating PDF reports, developers can manage and alter document material with ease thanks to the smooth integration of HTML Agility Pack and IronPDF in C#. Developers can easily and precisely automate operations connected to documents by combining the PDF production features of IronPDF with the parsing capabilities of HTML Agility Pack. The combination of these two libraries provides a strong C# document management solution, regardless of whether you're building dynamic reports or pulling data from web pages.

A perpetual license, a year of software maintenance, and a library upgrade are all included in the $749 Lite bundle. IronPDF provides free licensing with temporal and redistribution limitations. During the trial period, users can evaluate the solution without seeing a watermark. Please go to IronPDF's license page to learn more about the cost and license. To learn more about Iron Software libraries, visit this page.

< PREVIOUS
docfx C# (How It Works For Developers)
NEXT >
C# Continue (How It Works For Developers)

Ready to get started? Version: 2024.10 just released

Free NuGet Download Total downloads: 10,912,787 View Licenses >