In this example I will show you how to extract text from images in Python program. The text extraction from image could be used for various purpose, for example, data mining for machine learning projects, reading the content from images can be used for further processing in your applications.
To extract text from image I am going to use Python based library pytesseract. Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.
Python 3.9.5 – 3.9.7, Tesseract Installer
Download Tesseract and install in your system. In Windows system the exe file path would be like the C:\Program Files\Tesseract-OCR\tesseract.
Next install tesseract using the command
pip install pytesseract.
Create a project root directory called python-extract-text-from-image as per your chosen location.
I may not mention the project’s root directory name in the subsequent sections, but I will assume that I am creating files with respect to the project’s root directory.
Python Script – Extract Text From Image
Now create a Python script file python-extract-text-from-image.py and write the following code into the script file.
import pytesseract pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract' print(pytesseract.image_to_string('1.png'))
I have imported the tesseract library in the Python script. Next, I have set the Tesseract library’s installed exe path to tesseract’s command.
Finally, I have used tesseract’s
image_to_string() function to print the text of the image.
Image used in this example:
Testing Text Extraction From Image
Execute the Python script and you will see the following output in the CLI interface.