How to Read PDF File in Java

PDF files are the most used document format for transferring data in the modern era mainly because it can preserve the formatting and present data in the same form as it was sent without any exception. To load, open and view PDF document, we require a PDF document reader system. There are many PDF readers available but if you want to open a PDF file in your software application programmatically then a suitable class library is required to do so.

Here we are going to look at one such system library which helps open and read PDF file using filename in Java program.

IronPDF

IronPDF - Java library is built on top of already successful working .NET Framework. This makes IronPDF a versatile tool for working with PDF documents as compared to other class libraries such as Apache PDFBox. It provides the facility to extract/parse content, load text and load images. It also provides options to customize the PDF pages such as page layout, margins, header and footer, page orientation and much more.

In addition to this, IronPDF also supports conversion from other file formats, protecting PDFs with a password, digital signing, merging and splitting PDF documents.

How to Read PDF files in Java

Prerequisites

To use IronPDF to make Java PDF reader, we first need to ensure that the following components are installed on the computer:

  1. JDK - Java Development Kit is required for building and running Java programs. If it is not installed, download it from Oracle Website.
  2. IDE - Integrated Development Environment is software which helps write, edit and debug a program. Download any IDE for Java. E.g. Eclipse, Netbeans, Intellij.
  3. Maven - Maven is an automation tool which helps downloading libraries from Central Repository. Download it from the Apache Maven website.
  4. IronPDF - Finally, IronPDF is required to read the PDF file in Java. This needs to be added as a dependency in your Java Maven Project. Include the IronPDF artifact along with slf4j dependency in the pom.xml file as shown in the example below:

    <dependency>
       <groupId>com.ironsoftware</groupId>
       <artifactId>com.ironsoftware</artifactId>
       <version>2024.3.1</version>
    </dependency>
    XML

Adding Necessary Imports

Firstly, add the following code on top of the Java source file to reference all the required methods from IronPDF. Import org is optional in this example.

import com.ironsoftware.ironpdf.*;
JAVA

Next, configure IronPDF with a valid license key to use its method. Invoke setLicenseKey method in main method.

License.setLicenseKey("Your license key");
JAVA

Note: You can get a free trial license key to create, read and print PDFs.

Read Existing PDF file in Java

To read PDF files, there must be PDF files or we can create one. Here we will use already created PDF file. The code is simple and a two-step process to extract text from the document.

PdfDocument pdf = PdfDocument.fromFile(Paths.get("assets/sample.pdf"));
String text = pdf.extractAllText();
System.out.println(text);
JAVA

In the above code, fromFile opens a PDF document. The Paths.get method gets the directory of the file and is ready to extract content from the file. Then, extractAllText reads all the text in the document.

The output is below:

How to Read PDF Files in Java - Figure 1: Reading PDF Text Output

The output generated from retrieving all the text from a PDF file

Read Text from a Specific Page

IronPDF can also read content from a specific page in a PDF. The extractTextFromPage method uses a PageSelection object to accept a range of page(s) from which text will be read.

In the following example, we extract the text from the second page of the PDF document. PageSelection.singlePage takes the index of the page which needs to be extracted.

PdfDocument pdf = PdfDocument.fromFile(Paths.get("assets/sample.pdf"));
String text = pdf.extractTextFromPage(PageSelection.singlePage(1));
System.out.println(text);
JAVA
How to Read PDF Files in Java - Figure 2: Reading PDF Text Output

The output generated from retrieving the text from a the second page of the sample PDF file

Other methods available in the PageSelection class which can be used to extract text from various page include: [firstPage](/java/object-reference/api/com/ironsoftware/ironpdf/edit/PageSelection.html#lastPage()), [lastPage](/java/object-reference/api/com/ironsoftware/ironpdf/edit/PageSelection.html#firstPage()), pageRange, and [allPages](/java/object-reference/api/com/ironsoftware/ironpdf/edit/PageSelection.html#allPages()).

Read Text from a Newly-Generated PDF File

We can also search text from newly generated PDF file from either HTML file or URL. The following sample code generates PDF from URL and extracts all text from the website.

PdfDocument pdf = PdfDocument.renderUrlAsPdf("https://unsplash.com/");
String text = pdf.extractAllText();
System.out.println("Text extracted from the website: " + text);
JAVA
How to Read PDF Files in Java - Figure 2: Read from a New File

Reading text from a New PDF File

IronPDF can also be used to extract images from PDF files.

The complete code is as follows:

import com.ironsoftware.ironpdf.License;
import com.ironsoftware.ironpdf.PdfDocument;
import com.ironsoftware.ironpdf.edit.PageSelection;

import java.*;
import java.io.IOException;
import java.nio.file.Paths;

public class Main {
    public static void main(String[] args) throws IOException {

        License.setLicenseKey("YOUR LICENSE KEY HERE");

        PdfDocument pdf = PdfDocument.fromFile(Paths.get("assets/sample.pdf"));
        String text = pdf.extractTextFromPage(PageSelection.singlePage(1));
        System.out.println(text);

        pdf = PdfDocument.renderUrlAsPdf("https://unsplash.com/");
        text = pdf.extractAllText();
        System.out.println("Text extracted from the website: " + text);

    }
}
JAVA

Summary

In this article, we looked at how we can open and read PDFs in Java using IronPDF.

IronPDF helps easily create PDFs from HTML or URL and also convert from different file format. It also helps in getting PDF tasks done quickly and easily.

Try IronPDF for 30-days and find how how well it works for you in production. Commercial licenses starts only from $749.