Published July 12, 2023
How to Read PDF Files in Node.JS
1.0 Introduction
The Portable Document Format (PDF), developed by Adobe, is essential for maintaining the integrity of content that is both text-rich and aesthetically pleasing when it comes to document sharing. Online PDF file access usually necessitates the use of a specific program. PDF files are now essential for many key publications in the digital world. For the creation of expert documents and invoices, many firms use PDF files. In addition, developers frequently use PDF document generating libraries to meet particular customer needs. With the introduction of contemporary libraries, the procedure to create PDFs has been simplified. It is crucial to take build, read, and conversion capabilities into account when choosing the best library for a project that includes creating PDFs in order to ensure seamless integration and top performance. We can create a simple PDF viewer using the Node.js
2.0 Setting Up an Express Project
2.1 Package Requirements
- Visual Studio Code
- Node.js
- Installing packages requires a package manager; you can use
npm
or Yarn for this.
For the program development we are going to use Visual Studio Code. To set up your project requires first creating a new directory and installing the required packages. We can do installation, tests and CLI usage like below. To create a directory, use the command:
mkdir pdfviewer
Next, use the following command to install pdfreader.js
, which allows to read the file on the directory:
cd pdfviewer
npm install pdfreader
2.3 Initialize the pdfreader
Server
import { PdfReader } from "pdfreader";
new PdfReader().parseFileItems("pdf/demo.pdf", (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
});
The pdfreader
module parses and reads text from PDF files. The module provides automatic column recognition, rule-based processing, and support for tabular data. The PdfReader
class is exposed by this module and can be instantiated. To log debugging information, you can offer the constructor the argument debug: true
. (helpful for resolving issues)
We use the parseFileItems
method to read texts from PDF files (as shown above). In this method, we specify a callback function that will be called each time the instance locates a PDF item. This module extracts textual content entries from PDF files. It no longer aids photographed text.
Some notes about the data provided within the callback:
- When the parsing is finished or an error occurs, null.
- When a PDF record is opened, the primary object is continually the record metadata,
record:path:string
. - The page number, starting at 1, is provided by the page metadata, "
page:integer
,width:float
,height:float
," when a new page is being parsed. For the coordinates of text elements to be processed, this effectively serves as a carriage return. - You can conceive of text elements as basic objects with a text attribute and floating 2D AABB coordinates on the page, such as "
text:string
,x:float
,y:float
,w:float
..." - The
PdfReader
has automatic column detection which allows it to extract data from tables. - Your callback is responsible for converting these things into the data structure of your choice and for handling any errors that are thrown at it. Which also allows reading raw PDF from PDF buffer.
2.4 Rule-based Data Extraction
When parsing a PDF file, data extraction rules can be defined and processed using the Rule
class.
"Accumulators," methods that specify the data extraction technique to be utilized for each rule, are exposed by process data extraction rules. Accumulators are used to specify data extraction strategies for parsing PDF documents.
const processItem = Rule.makeItemProcessor([
Rule.on(/^Hello \"(.*)\"$/)
.extractRegexpValues()
.then(displayValue),
Rule.on(/^Value\:/)
.parseNextItemValue()
.then(displayValue),
Rule.on(/^c1$/).parseTable(3).then(displayTable),
Rule.on(/^Values\:/)
.accumulateAfterHeading()
.then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error(err);
else processItem(item);
});
2.5 To Run Application
After adding above code, you can launch it by entering the following command in the project directory's root:
node server.js
Make sure it is not a password-protected PDF file.
To know more about the Node.js PDF reader, refer the link here
3.0 IronPDF
IronPDF was created to make generating, browsing, and modifying PDF files in contemporary browsers easy. It functions as a strong PDF converter and offers a rich API for creating, editing, and modifying PDF files. IronPDF is compatible with Xamarin, Blazor, Unity, HoloLens apps, Windows Forms, HTML, ASPX, Razor HTML, .NET Core, and ASP.NET.
To convert HTML to PDF, IronPDF makes use of the Chrome engine. Using Microsoft.NET and.NET Core, it supports both traditional Windows programs and online ASP.NET apps. It makes it possible to create PDFs from HTML5, JavaScript, and CSSS.
Developers may read and edit PDF files without using Acrobat Reader by using the IronPDF PDF library. They can also extract images, split and move text property, combine pages in new or existing PDF documents, add text and graphics, bookmarks, watermarks, headers, and footers.
Additionally, CSS and CSS media files can be used to create PDF documents. Old PDF forms and new office documents can both be created, uploaded, and edited with IronPDF.
- From a variety of sources, including HTML, HTML5, ASPX, and Razor/MVC View, IronPDF can produce PDF files. It has the ability to create PDF files from HTML pages and photos.
- Creating interactive PDFs, completing and submitting interactive forms, merging and dividing PDF files, extracting text and images, searching text within PDF files, rasterizing PDFs to images, changing font size, and converting PDF files are just a few of the many tasks that can be accomplished using the tools available in the IronPDF library.
- IronPDF enables HTML login form validation by supporting user-agents, proxies, cookies, HTTP headers, and form variables.
- IronPDF uses usernames and passwords to allow access to secured documents.
The IronPDF library turns PDF pages into PDF objects and enables text extraction from PDF files. The example that follows shows how to read an existing PDF using IronPDF.
3.1 Create a New Project in Visual Studio
Choose "new project" from the file menu when Visual Studio is open. We'll be utilizing a console app in this article.
Enter the project name and file location in the appropriate text box.
Next, click the Next button to pick the required .NET Framework. After selecting the additional information click the Create button. This will help us to create a new project.
The solution's necessary IronPDF library must then be downloaded. By entering the following code in the package manager console, you can download the package:
Install-Package IronPdf
The "IronPDF" package can also be found and downloaded using the NuGet Package Manager. Dependency management in your project is made simple with the NuGet Package Manager.
3.2 Extract ALL Text from PDF Files
The first method involves extracting text from a PDF; a sample of the code is provided below.
var pdfDocument = IronPdf.PdfDocument.FromFile("Demo.pdf");
string AllText = pdfDocument.ExtractAllText();
The source code above demonstrates how to load a PDF file from an existing file and convert it into PdfDocument
objects by using the FromFile
method. With the aid of this object, we are able to read the text and images that are available on the PDF pages. All the text in the whole PDF document is extracted by the PdfDocument
class object's ExtractAllText
method, which holds the retrieved text in a string that can be processed.
The second way that we can use to extract text from a PDF file, page by page, has a code example for it below.
3.3 Extract Text from Individual Pages
using IronPdf;
PdfDocument PDF = PdfDocument.FromFile("result.pdf");
for (var index = 0; index < PDF.PageCount; index++)
{
int PageNumber = index + 1;
string Text = PDF.ExtractTextFromPage(index);
}
The source code above demonstrates how it will load the entire PDF file before turning it into a PDF object. Then, we use an internal method called PageCount
to get the total number of pages in the loaded PDF document. This will retrieve the page count for the entire PDF document. The page variety may be handled as a parameter with the use of the ExtractTextFromPage
method and the for loop
to extract textual content from the loaded document. The string variable will then contain the precise text. The for
or for each
loop will also be used to extract content from the PDF page by page.
For more tutorials on IronPDF, refer here.
4.0 Conclusion
The Node.js code shown on the first page above has the potential to be abused and may present security problems when utilized by others, it is crucial to remember. The hazards of unauthorized access and data security flaws must be taken into account when integrating such code into a console application, web application or website. It's also important to consider compatibility difficulties with various operating systems, browsers, and obsolete browsers.
The IronPDF library, in contrast, offers strong security features to reduce potential dangers. It is not specialized to any one browser and is compatible with all popular ones. IronPDF allows programmers to easily produce and read PDF files with just a few lines of code.
The IronPDF library provides a range of licensing options, including a free developer license and extra development licenses that are available for purchase, to meet the needs of different developers.
A perpetual license, a 30-day money-back guarantee, a year of software maintenance, and upgrade possibilities are all included in the $749 Lite package. There are no further fees beyond the initial purchase. These licenses are usable in development, staging, and production settings. In addition, IronPDF offers free licenses with some restrictions on the length of time and redistribution. Please click here for more information on IronPDF pricing and licensing.