Extract Text

IronPDF for Python can extract the text content of a PDF document — either from the entire document at once or from individual pages — using the ExtractAllText and ExtractTextFromPage methods.

Getting Started

Load a PdfDocument from a file using PdfDocument.FromFile, then call the appropriate extraction method. Both methods return a Python string containing the extracted text.

Understanding the Code

  • PdfDocument.FromFile(path): Opens an existing PDF document from the specified file path. For password-protected files, pass the password as a second argument.
  • ExtractAllText(): Returns a single string containing the text extracted from all pages of the document, in page order.
  • ExtractTextFromPage(pageIndex): Returns the text content of a single page. The pageIndex argument is zero-based — pass 0 for the first page, 1 for the second, and so on.

Use Cases

Text extraction is useful for:

  • Search indexing — making PDF content searchable in a database or search engine.
  • Data parsing — extracting structured data such as invoice numbers, dates, or addresses from PDF reports.
  • Content verification — programmatically checking that generated PDFs contain the expected text.
  • Accessibility — converting PDF content to plain text for screen readers or downstream processing.

Learn to extract content from PDFs with IronPDF for Python!

Ready to Get Started?
Version: 2026.6 just released
Still Scrolling Icon

Still Scrolling?

Want proof fast?
run a sample watch your HTML become a PDF.