Published February 17, 2023
How to Convert PDF to Text in Python (Tutorial)
PDF is one of the most widely used digital media to send documents across the world. PDF stands for Portable Document Format and uses the .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating systems.
In this article, we will create a simple PDF to text converter in Python. There are a lot of online applications available for this purpose, but how cool would it be if you could create your own PDF to text converter using a simple Python script?
Let's get started!
Convert PDF to Text in Python
What is Python?
Python is a programming language used to build websites and software, automate tasks, and conduct data analysis. Here we are going to use this high-level language to convert and extract text from PDF documents.
Steps to Extract Text from a PDF Document
To perform the following steps, you must have installed Python 3+ on your computer. You can download and install it using this download and install Python from the Official Website.
1: Create a PDF File
- Open a new Word Document.
- Type anything in the Word Document.
- Now go to File -> Print -> Save.
- Save the PDF file as "PDF_to_text_Python.pdf" in the same location where the Python script file is present.
You can also use an existing PDF file as an alternative to creating a new one using the steps above.
For this example, we are going to use the following PDF File:

This tutorial's sample PDF File
2: Install PyPDF2 Python Library
Firstly, we will download and install an external package named PyPDF2.
PyPDF2 is a pure Python PDF library that you can use for splitting, merging, and manipulating PDFs.
To install the PyPDF2 package, open your Windows command prompt or Windows PowerShell and use the following pip command to install PyPDF2.
PS C:\Users\hp> pip3 install PyPDF2
Pip is the official package manager for Python. It downloads and install third-party software packages with features and functionality not found in the Python standard library.
Note: Python must be added to the path to execute this command from anywhere on the command line. Pip3 is recommended for Python 3+ as it is the updated version of pip.
3: Opening a new Python file
Open the Python IDLE application and press the keys ctrl + N. The text editor will be opened. You can use your preferred choice of text editor for this.
Text Editor
- Save the file as pdftotext.py, in the same location as the PDF file that we will be converting.
Now everything is ready, let's start with the code to convert PDF to text using Python.
Type the following code in your text editor:
# importing required modules
import PyPDF2
# create a pdf file object
pdfFileObj = open('PDF_to_Text_Python.pdf', 'rb')
# create a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# print number of pages in the pdf file
print("Page Number:", pdfReader.numPages)
# create a page object
pageObj = pdfReader.getPage(0)
# extract text from page
text = pageObj.extractText()
# display just the text
print(text)
# save to a text file for later use
# copy the path where the script and pdf is placed
file1=open(r"C:\Users\hp\Downloads\convertedtext.txt","a")
file1.writelines(text)
# closing the pdf file object
pdfFileObj.close()
# closing the text file object
file1.close()

Commandline output after executing the pdftotext.py Python script
The output is also saved in the text file:

Text File Output
Let's have a look at the code line by line to understand:
pdfFileObj = open('PDF_to_Text_Python.pdf', 'rb')
We opened the PDF_to_Text_Python.pdf in binary mode and saved the file object to the variable pdfFileObj
.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Here, we created an object of the PdfFileReader
class of PyPDF2 module and pass the PDF file object. This method returns a PDFReader
object.
print("Page Number:", pdfReader.numPages)
The numPages
property gives the number of pages in the PDF file.
pageObj = pdfReader.getPage(0)
Now, we created an object of the PageObject
class of the PyPDF2 module. The PDFReader
object has the getPage
function which takes the page number (starting index is 0) as an argument and returns the page object.
text = pageObj.extractText()
The page object has the function extractText
, which extracts text from the PDF page.
file1=open(r"C:\Users\hp\Downloads\convertedtext.txt","a")
file1.writelines(text)
Next, we save our text to convertedtext.txt using the writeLines
function.
pdfFileObj.close()
Finally, we close the PDF file object and text file object.
This script will only convert text-based PDF to text in Python. To convert image-based PDFs to text, you'll need to use Optical Character Recognition (OCR). To extract text from scanned PDF files, you'll need Pytesseract for OCR and Open CV for image pre-processing. It's beyond the scope of this article, as it involves a machine-learning approach.
The IronPDF Library
IronPDF is a useful tool for generating PDF documents in .NET projects. A common use of this library is “HTML to PDF” rendering, where HTML is used as the design language for rendering a PDF document.
IronPDF uses a .NET Chromium engine to render HTML pages to PDF files. With HTML to PDF conversion, there is no need to use complex APIs to position or design PDFs. IronPDF also supports all standard web page technologies: HTML, ASPX, JS, CSS, and images.
It also enables you to create a .NET PDF library using HTML5, CSS, JavaScript, and images. You can edit, stamp, and add headers and footers to a PDF effortlessly. Furthermore, it makes it very easy to read PDF text and extract images.
In many cases, you can extract embedded text from PDFs directly. The following code helps you extract text from a PDF:
using IronPdf;
using PdfDocument PDF = PdfDocument.FromFile("your_pdf_filename.pdf", "password");
//Get all text
string Text = PDF.ExtractAllText();
using IronPdf;
using PdfDocument PDF = PdfDocument.FromFile("your_pdf_filename.pdf", "password");
//Get all text
string Text = PDF.ExtractAllText();
Imports IronPdf
Private PdfDocument As using
'Get all text
Private Text As String = PDF.ExtractAllText()
If that doesn't work, your text is probably actually embedded in an image. You can use the IronOCR library with IronPDF to scan documents for any visual text that is not plain text.
The following code will help you achieve this task:
using IronOcr;
using IronPdf;
using (var Input = new OcrInput())
{
// OCR entire document
Input.AddPdf("your_pdf_filename.pdf", "password");
// Image Quality
Input.Deskew();
Input.DeNoise();
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
var Barcodes = Result.Barcodes;
var Text = Result.Text;
}
using IronOcr;
using IronPdf;
using (var Input = new OcrInput())
{
// OCR entire document
Input.AddPdf("your_pdf_filename.pdf", "password");
// Image Quality
Input.Deskew();
Input.DeNoise();
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
var Barcodes = Result.Barcodes;
var Text = Result.Text;
}
Imports IronOcr
Imports IronPdf
Using Input = New OcrInput()
' OCR entire document
Input.AddPdf("your_pdf_filename.pdf", "password")
' Image Quality
Input.Deskew()
Input.DeNoise()
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
Dim Barcodes = Result.Barcodes
Dim Text = Result.Text
End Using
You can see from the above code that it's quite simple and clean. Very few lines of code are needed to extract text from an image-based PDF. It's a fast, reliable, and time-saving solution with accurate results.
Download IronPDF and try it for free with a 30-day trial.