Published May 9, 2023
How to Parse PDFs in Java (Developer Tutorial)
Portable Document Format (PDF) is a digital format used to send data over the internet. It preserves data formatting and allows users better control over the content. It is the most suited format for printing. When working in Java, there may be a need to read data from a PDF file in a Java program, which can be a tedious task to build a PDF parser and extract text from different sections of the PDF. However, with improving technologies and the emergence of numerous libraries, creating a Java PDF parser and extracting text has now become easier.
In this article, we will create a PDF parser in Java using the IronPDF Library.
IronPDF - Java PDF Library
IronPDF is a Java PDF library that enables the creation, reading, and manipulation of PDF documents with ease and accuracy. It is built on the success of IronPDF for .NET and provides efficient functionality across different platforms. IronPDF for Java utilizes the IronPdfEngine
, which is fast and optimized for performance.
With IronPDF, you can parse PDF pages and extract text, images, and other objects from PDF files. It also enables the creation of PDFs from HTML strings, files, URLs, and images, as well as conversion between different file formats. Additionally, you can easily add new content, stamp signatures, and add metadata to existing PDF documents. IronPDF is designed specifically for Java 8+, Scala, and Kotlin, and is compatible with Windows, Linux, and Cloud platforms.
How to Parse a PDF File in Java
Create PDF File Parser using IronPDF in Java Program
Prerequisites
To make a PDF Parsing project in Java, you will need the following tools:
- Java IDE: You can use any Java supported IDE. There multiple JAVA IDEs available for development. Here we will be using IntelliJ IDE. You can use NetBeans, Eclipse, etc.
- Maven Project: Maven is a dependency manager and allows control over the Java project. Maven for Java can be downloaded from here. IntelliJ JAVA IDE has built support for Maven.
IronPDF - You can download and install IronPDF for Java in multiple ways.
Adding IronPDF dependency in the pom.xml file in a Maven project.
<dependency>
<groupId>com.ironsoftware</groupId>
<artifactId>com.ironsoftware</artifactId>
<version>2023.9.2</version>
</dependency>XML- Visit the Maven website and download latest IronPDF package for Java, it can be downloaded here.
- A direct download from IronPDF website through this link.
- Manually install IronPDF using the JAR file in your simple Java Application.
Slf4j-Simple: This dependency is also required to stamp content to an existing document. It can be added using the Maven dependencies manager in IntelliJ, or it can be directly downloaded from the Maven website. Add the following dependency to pom.xml file:
<dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>2.0.5</version> </dependency>
XML
Adding the Necessary Imports
Once all the prerequisites are installed, we need to import the necessary IronPDF packages to work with a PDF document. Add the following code on top of the Main.java file:
import com.ironsoftware.ironpdf.*;
import java.io.IOException;
import java.nio.file.Paths;
License Key
Some methods available in IronPDF require a license to be used. You can purchase a license or try IronPDF free in a 30-day trial. You can set the key as follows:
License.setLicenseKey("YOUR-KEY");
Step 1: Parse an Existing PDF document
To parse an existing document for content extraction, the PdfDocument
class is used. Its static fromFile
method is used to parse a PDF file from a specific path with a specific file name in a Java program. The code is as follows:
PdfDocument parsedDocument = PdfDocument.fromFile(Paths.get("sample.pdf"));
Step 2: Extract Text Data from Parsed PDF file
IronPDF for Java provides an easy method for extracting text from PDF documents. The following code snippet is for extracting text data from a PDF file is below:
String extracted_text = parsedDocument.extractAllText();
The above code produces the output given below:
Step 3: Extract Text Data from URLs or HTML String
The capability of the IronPDF for Java is not only restricted to existing PDFs, but it can also create and parse a new file to extract content. Here, we will create a PDF file from URL and then extract content from it. The following example shows how to achieve this task:
public class Main {
public static void main(String[] args) throws IOException {
License.setLicenseKey("YOUR-KEY");
PdfDocument parsedDocument = PdfDocument.renderUrlAsPdf("https://ironpdf.com/java/");
String extracted_text = parsedDocument.extractAllText();
System.out.println("Text Extracted from URL:\n" + extracted_text);
}
}
The output is as follows:
Step 4: Extract Images from Parsed PDF Document
IronPDF also provides an easy option to extract all images from the parsed document. Here we will use previous example to see how easily the images are extracted from the PDF files.
public static void main(String[] args) throws IOException {
License.setLicenseKey("YOUR-KEY");
PdfDocument parsedDocument = PdfDocument.renderUrlAsPdf("https://ironpdf.com/java/");
try {
List images = parsedDocument.extractAllImages();
System.out.println("Number of images extracted from the website: " + images.size());
int i = 0;
for (BufferedImage image : images) {
ImageIO.write(image, "PNG", Files.newOutputStream(Paths.get("assets/extracted_" + ++i + ".png")));
}
} catch(Exception exception) {
System.out.println("Failed to extract images from the website");
exception.printStackTrace();
}
}
The extractAllImages
method returns a list of BufferedImages
. Each BufferedImage
can then be stored as PNG images on a location using the ImageIO.write
method. There are 34 images in the parsed PDF file and every image is perfectly extracted.
Step 5: Extract Data from Table in PDF Files
Extracting content from tabular boundaries in a PDF file is made easy with just a one-line code using the extractAllText
method. The following code snippet demonstrates how to extract text from a table in a PDF file:
PdfDocument parsedDocument = PdfDocument.fromFile(Paths.get("table.pdf"));
String extracted_text = parsedDocument.extractAllText();
System.out.println(extracted_text);
The output is as follows:
6. Conclusion
In this article, we have learned how to parse an existing PDF document or create a new PDF parser file from a URL to extract data from it in Java using IronPDF. After opening the file, we can extract tabular data, images, and text from the PDF. We can also add the extracted text to a text file for later use.
For more detailed information on how to work with PDF files programmatically in Java, please visit this link.
The IronPDF for Java library is free for development purposes with a 30-day free trial. However, for commercial use it can be licensed, starting at $749.