跳過到頁腳內容
使用 IRONPDF FOR JAVA

如何在 Java 中解析 PDF(開發人員教程)

This article will create a PDF parser in Java using the IronPDF Library in an efficient approach.

IronPDF - Java PDF 庫

IronPDF for Java is a Java PDF library that enables the creation, reading, and manipulation of PDF documents with ease and accuracy. It is built on the success of IronPDF for .NET and provides efficient functionality across different platforms. IronPDF for Java utilizes the IronPdfEngine which is fast and optimized for performance.

With IronPDF, you can extract text and images from PDF files and it also enables creating PDFs from various sources including HTML strings, files, URLs, and images. Furthermore, you can easily add new content, insert signatures with IronPDF, and embed metadata into PDF documents. IronPDF is specifically designed for Java 8+, Scala, and Kotlin, and is compatible with Windows, Linux, and Cloud platforms.

Create PDF File Parser using IronPDF in Java Program

先決條件

To make a PDF Parsing project in Java, you will need the following tools:

  1. Java IDE: You can use any Java-supported IDE. There are multiple Java IDEs available for development. Here this tutorial will be using IntelliJ IDE. You can use NetBeans, Eclipse, etc.
  2. Maven Project: Maven is a dependency manager and allows control over the Java project. Maven for Java can be downloaded from the Maven official website. IntelliJ Java IDE has built-in support for Maven.
  3. IronPDF - You can download and install IronPDF for Java in multiple ways.

    • Adding IronPDF dependency in the pom.xml file in a Maven project.

      <dependency>
       <groupId>com.ironsoftware</groupId>
       <artifactId>ironpdf</artifactId>
       <version>[LATEST_VERSION]</version>
      </dependency>
      <dependency>
       <groupId>com.ironsoftware</groupId>
       <artifactId>ironpdf</artifactId>
       <version>[LATEST_VERSION]</version>
      </dependency>
      XML
    • Visit the Maven repository website for the latest IronPDF package for Java.
    • A direct download from the Iron Software official download page.
    • Manually install IronPDF using the JAR file in your simple Java Application.
  4. Slf4j-Simple: This dependency is also required to stamp content to an existing document. It can be added using the Maven dependencies manager in IntelliJ, or it can be directly downloaded from the Maven website. Add the following dependency to the pom.xml file:

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>2.0.5</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>2.0.5</version>
    </dependency>
    XML

Adding the Necessary Imports

Once all the prerequisites are installed, the first step is to import the necessary IronPDF packages to work with a PDF document. Add the following code on top of the Main.java file:

import com.ironsoftware.ironpdf.*;
import java.io.IOException;
import java.nio.file.Paths;
import com.ironsoftware.ironpdf.*;
import java.io.IOException;
import java.nio.file.Paths;
JAVA

License Key

Some methods available in IronPDF require a license to be used. You can purchase a license or try IronPDF free in a free trial. You can set the key as follows:

License.setLicenseKey("YOUR-KEY");
License.setLicenseKey("YOUR-KEY");
JAVA

Step 1: Parse an Existing PDF document

To parse an existing document for content extraction, the PdfDocument class is used. Its static fromFile method is used to parse a PDF file from a specific path with a specific file name in a Java program. The code is as follows:

PdfDocument parsedDocument = PdfDocument.fromFile(Paths.get("sample.pdf"));
PdfDocument parsedDocument = PdfDocument.fromFile(Paths.get("sample.pdf"));
JAVA

How to Parse PDF in Java (Developer Tutorial), Figure 1: Parsed document Parsed document

Step 2: Extract Text Data from Parsed PDF file

IronPDF for Java provides an easy method for extracting text from PDF documents. The following code snippet is for extracting text data from a PDF file is below:

String extractedText = parsedDocument.extractAllText();
String extractedText = parsedDocument.extractAllText();
JAVA

The above code produces the output given below:

How to Parse PDF in Java (Developer Tutorial), Figure 2: Output 輸出

Step 3: Extract Text Data from URLs or HTML String

The capability of IronPDF for Java is not only restricted to existing PDFs, but it can also create and parse a new file to extract content. Here, this tutorial will create a PDF file from a URL and extract content from it. The following example shows how to achieve this task:

public class Main {
    public static void main(String[] args) throws IOException {
        License.setLicenseKey("YOUR-KEY");

        PdfDocument parsedDocument = PdfDocument.renderUrlAsPdf("https://ironpdf.com/java/");
        String extractedText = parsedDocument.extractAllText();
        System.out.println("Text Extracted from URL:\n" + extractedText);
    }
}
public class Main {
    public static void main(String[] args) throws IOException {
        License.setLicenseKey("YOUR-KEY");

        PdfDocument parsedDocument = PdfDocument.renderUrlAsPdf("https://ironpdf.com/java/");
        String extractedText = parsedDocument.extractAllText();
        System.out.println("Text Extracted from URL:\n" + extractedText);
    }
}
JAVA

輸出如下:

How to Parse PDF in Java (Developer Tutorial), Figure 3: Output 輸出

Step 4: Extract Images from Parsed PDF Document

IronPDF also provides an easy option to extract all images from parsed documents. Here the tutorial will use the previous example to see how easily the images are extracted from the PDF files.

import com.ironsoftware.ironpdf.*;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

public class Main {
    public static void main(String[] args) throws IOException {
        License.setLicenseKey("YOUR-KEY");

        PdfDocument parsedDocument = PdfDocument.renderUrlAsPdf("https://ironpdf.com/java/");

        try {
            List<BufferedImage> images = parsedDocument.extractAllImages();
            System.out.println("Number of images extracted from the website: " + images.size());

            int i = 0;
            for (BufferedImage image : images) {
                ImageIO.write(image, "PNG", Files.newOutputStream(Paths.get("assets/extracted_" + ++i + ".png")));
            }
        } catch (Exception exception) {
            System.out.println("Failed to extract images from the website");
            exception.printStackTrace();
        }
    }
}
import com.ironsoftware.ironpdf.*;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

public class Main {
    public static void main(String[] args) throws IOException {
        License.setLicenseKey("YOUR-KEY");

        PdfDocument parsedDocument = PdfDocument.renderUrlAsPdf("https://ironpdf.com/java/");

        try {
            List<BufferedImage> images = parsedDocument.extractAllImages();
            System.out.println("Number of images extracted from the website: " + images.size());

            int i = 0;
            for (BufferedImage image : images) {
                ImageIO.write(image, "PNG", Files.newOutputStream(Paths.get("assets/extracted_" + ++i + ".png")));
            }
        } catch (Exception exception) {
            System.out.println("Failed to extract images from the website");
            exception.printStackTrace();
        }
    }
}
JAVA

The [extractAllImages](/java/object-reference/api/com/ironsoftware/ironpdf/PdfDocument.html#extractAllImages()) method returns a list of BufferedImages. Each BufferedImage can then be stored as PNG images on a location using the ImageIO.write method. There are 34 images in the parsed PDF file and every image is perfectly extracted.

How to Parse PDF in Java (Developer Tutorial), Figure 4: Extracted images Extracted images

Step 5: Extract Data from Table in PDF Files

Extracting content from tabular boundaries in a PDF file is made easy with just a one-line code using the [extractAllText method](/java/object-reference/api/com/ironsoftware/ironpdf/PdfDocument.html#extractAllText()). The following code snippet demonstrates how to extract text from a table in a PDF file:

How to Parse PDF in Java (Developer Tutorial), Figure 5: Table in PDF Table in PDF

PdfDocument parsedDocument = PdfDocument.fromFile(Paths.get("table.pdf"));
String extractedText = parsedDocument.extractAllText();
System.out.println(extractedText);
PdfDocument parsedDocument = PdfDocument.fromFile(Paths.get("table.pdf"));
String extractedText = parsedDocument.extractAllText();
System.out.println(extractedText);
JAVA

輸出如下:

How to Parse PDF in Java (Developer Tutorial), Figure 6: Output 輸出

結論

This article demonstrated how to parse an existing PDF document or create a new PDF parser file from a URL to extract data from it in Java using IronPDF. After opening the file, it can extract tabular data, images, and text from the PDF, and can also add the extracted text to a text file for later use.

For more detailed information on how to work with PDF files programmatically in Java, please visit these PDF file creation examples.

The IronPDF for Java library is free for development purposes with a free trial available. However, for commercial use it can be licensed through IronSoftware, starting at $799.

常見問題解答

如何在Java中創建PDF解析器?

要在Java中創建PDF解析器,您可以使用IronPDF庫。首先下載並安裝IronPDF,然後使用fromFile方法加載您的PDF文檔。您可以分別使用extractAllTextextractAllImages方法提取文本和圖像。

IronPDF可以與Java 8+一起使用嗎?

可以,IronPDF兼容Java 8及以上版本,以及Scala和Kotlin。它支持多個平台,包括Windows, Linux和雲環境。

在Java中使用IronPDF解析PDF的關鍵步驟是什麼?

關鍵步驟包括設置Maven項目,添加IronPDF依賴項,用fromFile加載PDF文檔,使用extractAllText提取文本,並使用extractAllImages提取圖像。

如何在Java中將URL轉換為PDF?

您可以使用IronPDF的renderUrlAsPdf方法在Java中將URL轉換為PDF。這允許您將網頁高效渲染為PDF文檔。

IronPDF是否適合用於基於雲的Java應用程序?

可以,IronPDF設計得非常多功能,支持基於雲的環境,非常適合開發需要在雲中進行PDF功能的Java應用程序。

如何管理Java PDF解析項目的依賴項?

對於管理Java項目的依賴項,您可以使用Maven。將IronPDF庫添加到您的項目的pom.xml文件中以作為依賴項。

IronPDF 的許可選擇有哪些?

IronPDF為開發目的提供免費試用版。不過,用於商業用途則需要授權。這確保獲得所有功能和優先支持。

Darrius Serrant
全棧軟件工程師 (WebOps)

Darrius Serrant 擁有邁阿密大學計算機科學學士學位,目前任職於 Iron Software 的全栈 WebOps 市場營銷工程師。從小就迷上編碼,他認為計算既神秘又可接近,是創意和解決問題的完美媒介。

在 Iron Software,Darrius 喜歡創造新事物,並簡化複雜概念以便於理解。作為我們的駐場開發者之一,他也自願教學生,分享他的專業知識給下一代。

對 Darrius 來說,工作令人滿意因為它被重視且有實際影響。