Pytesseract OCR Python | Extracting Text From Images

Introduction

As a developer, you may need to Extract Text From Images. We can write a Python program to extract such textual information from each and every image. In Python, we can use the Pytesseract package for this OCR(Optical Character Recognition) process.

Table Of Contents

Adding Libraries

To get started with the Python script, we need to install a few required libraries.

Pytesseract

To install pytesseract, run the following command:

pip install pytesseract

Pillow

Pillow library acts as an image interpreter with all image processing capabilities.

To install the pillow, run the following command:

pip install pillow

Opencv-python

We will use OpenCV to recognize texts from the media files (images).

To install OpenCV-python, run the following command:

pip install opencv-python

Creating a Python tesseract script to Read Text From Images

Importing Libraries

We’re almost ready to read text from images. Before that, though, you need to import the Pytesseract, Pillow, OpenCV, and Numpy libraries for extracting text from image python

# Import libraries
from PIL import Image
import pytesseract
import cv2
import numpy as np
from pytesseract import Output
import os

The following script specifies the path for the Tesseract engine executable file we installed earlier. If you installed Tesseract OCR in a different location, you need to update your path accordingly.

pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files/Tesseract-OCR/tesseract.exe'

Extract text from images

# Simply extracting text from image
image = Image.open("A:/Freelance/Projects/demo.jpg")
image = image.resize((300,150))
custom_config = r'-l eng --oem 3 --psm 6' 
text = pytesseract.image_to_string(image,config=custom_config)
print(text)

The first step in reading text from an image is to open the image. You can do so by using the open() method of the Pillow library’s Image object. To read the text from an image, first, pass the image object you just opened to the Pytesseract module’s image to string() method. This is a pytesseract image to string article. The image to string() method converts the image text into a Python string, which you can then use however you like. Using the print() method, we’ll simply print the string to our screen. To read the text from the car license plate image, run the script below.

Note: You’ll need to update the path of the image to match the location of the image you want to convert to string.

custom_config = r'-l eng --oem 3 --psm 6'

Here in the custom configuration you can see the “eng” which indicates the English language i.e it will recognize the English letters you can also add multiple languages and “PSM” means Page segmentation which set the configuration of how the chunks will recognize the characters and “OEM” is the default configuration.

filename = "A:/demo.txt"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, "w") as f:
    f.write(text)

And to save that text, we create a file called demo. Every word from our image is saved in this file, so we can use it elsewhere in extracting text from image python.

Now, let’s test our code with the image below as input.

input image for text recognition using Pytesseract python

Complete code for extracting text from image python

# Import libraries
from PIL import Image
import pytesseract
import cv2
import numpy as np
from pytesseract import Output
import os

#Specifies PATH of Tesseract
pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files/Tesseract-OCR/tesseract.exe'

# Simply extracting text from image
image = Image.open("A:/Freelance/Projects/demo.jpg")
image = image.resize((300,150))
custom_config = r'-l eng --oem 3 --psm 6'
text = pytesseract.image_to_string(image,config=custom_config)
print(text)

#To Save The Text in Text file
filename = "A:/demo.txt"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, "w") as f:
f.write(text)

Output:

The output below shows that the characters from the Passage have been correctly read. However, in addition, ♀ this symbol is also printed in the output.

The output shows that though Tesseract OCR can read text from an image, it is not 100% correct. An accuracy rate of less than 100% is typical with all OCR engines, so don’t let this discourage you.

You can search this(OpenCV put text) on google to know how python’s OpenCV library can convert images to text

References - ocr-basics

Setting up Tesseract

references - opencv installtions

OCR (Optical Character Recognition) is a technique that recognizes or Extracts Text From Images. It may be used to transform handwritten or printed texts that are difficult to read into machine-readable.

You must install and set up tesseract on your PC before you can use OCR.

Install and Run Tesseract for Windows in 4 Easy Steps

Step-1. First, download the Tesseract OCR executables here.

Once you open the executable file, you’ll have to first select a language.

Install and Run Tesseract for Windows in 4 Easy Steps

Click the “Next” button on the following dialog box.

Step-2. Configure Installation

Installer Language

Click the “ok” button on the following dialog box.

License Agreement

You’ll be presented with a license agreement, as shown below. Click the “I Agree” button if you agree to the terms.

Tesseract OCR Setup

select components to install for Tesseract

choose a start menu folder for select Tesseract

Step-3. Add Tesseract OCR for Windows Installation Directory to Environment Variables

setup environment variables for Add Tesseract

Step-4. After the operation is completed, use the tesseract -v command to ensure that the OCR is installed

Conclusion

We began by learning how to install tesseract, a text extraction program. Then we grabbed a picture and retrieved the text from it and stored it in a text file. We discovered that in order to extract text from complicated pictures, we must employ OpenCV’s image modification methods. I hope you enjoyed this tutorial and found it helpful.

References - Guide

Join and follow us

Translate