Forem: Yuvan Shankar

Implementing Tamil OCR Using Python and Tesseract

Yuvan Shankar — Thu, 12 Feb 2026 03:47:16 +0000

INTRODUCTION:

Optical Character Recognition (OCR) is a technology that converts images containing text into machine-readable digital text. In this project, I implemented a Tamil OCR system using Python and Tesseract OCR engine. The goal was to test how accurately the system detects text from two different sources:
Handwritten text on white paper
Printed text from a newspaper
This blog explains the complete setup process and how the system works.

Part 1: Installing Python
Step 1: Download Python

First, download Python from the official website:
https://www.python.org/downloads/

While installing, it is very important to check the box:
“Add Python to PATH”
This allows Python to be accessed from the Command Prompt.
After installation, verify it by opening Command Prompt and typing:

If Python is installed correctly, it will display the installed version number.

Part 2: Installing Tesseract OCR
Python alone cannot perform OCR. We need an OCR engine, which is Tesseract.

Step 2: Download Tesseract for Windows
Download the Windows installer from:
https://github.com/UB-Mannheim/tesseract/wiki

Install it in the default location:

C:\Program Files\Tesseract-OCR
After installation, verify it by typing in Command Prompt:

tesseract --version

If the version details are displayed, it means Tesseract is installed correctly.

Part 3: Adding Tamil Language Support

To detect Tamil text, we must ensure that the Tamil trained data file is available.

Go to:

C:\Program Files\Tesseract-OCR\tessdata

Check if the file:

tam.traineddata
exists.

If not, download it from:

https://github.com/tesseract-ocr/tessdata

and place it inside the tessdata folder.

Part 4: Installing Required Python Libraries

Open Command Prompt and install the required libraries:

pip install pytesseract opencv-python pillow

These libraries are used for:
pytesseract → Connecting Python with Tesseract

opencv-python → Image processing

pillow → Image handling

Part 5: Project Setup
Create a project folder named:

OCR_Project
Inside the folder, create:
ocr_test.py (Python file)
test.jpg (Input image)

Part 6: Python OCR Code
Below is the Python code used for Tamil text detection:

Python

import cv2
import pytesseract

Specify Tesseract path

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

Load image

img = cv2.imread("test.jpg")

Convert to grayscale

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Apply thresholding

_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

Perform OCR in Tamil

text = pytesseract.image_to_string(thresh, lang='tam')

print("Detected Text:")
print(text)

Part 7: Running the Program
Navigate to the project folder in Command Prompt:

cd Desktop\OCR_Project
Run the program:

python ocr_test.py
The detected Tamil text will be printed in the console.

HOW THE OCR SYSTEM WORKS INTERNALLY:

*The system follows these steps:
The image is loaded.

*It is converted to grayscale to simplify processing.

*Thresholding is applied to separate text from background.

*Tesseract detects text regions.
The Tamil language model recognizes characters.

*The final detected text is returned as output.

Accuracy Testing: White Paper vs Newspaper

WHITE PAPER TEST :

Clean background
Clear handwriting
Good contrast
Result:
Accuracy is usually high (around 80–95%) because the text is clearly separated from the background.

NEWS PAPER TEST:

Small font size
Multiple columns
Images and advertisements
Background noise
Result:
Accuracy decreases (around 60–80%) because of complex layout and noise.

EXPLORING OCR MODEL AND BACKEND SUPPORT IN PYTHON

Yuvan Shankar — Wed, 11 Feb 2026 03:48:10 +0000

Optical Character Recognition (OCR) is a technology that converts images, scanned documents, or PDFs into machine-readable text. In Python, there are many powerful OCR libraries and models available that support different backends and use cases.
In this blog, I explore the most popular OCR modules available in Python and how they are used in real-world applications.

1.TESSERACT OCR (pytesseract):

Backend: Google Tesseract Engine (C++ based)

Python Wrapper: pytesseract
Tesseract is one of the most widely used open-source OCR engines. It is maintained by Google and supports multiple languages.

HOW TO INSTALL:

pip install pytesseract

SAMPLE CODE:

import pytesseract

from PIL import Image

img = Image.open("sample.png")
text = pytesseract.image_to_string(img, lang="eng")
print(text)

EASYOCR

Backend: PyTorch (Deep Learning based)

EasyOCR is a deep learning-based OCR library. It works well with complex images and multiple languages.

HOW TO INSTALL:
pip install easyocr

SAMPLE CODE:

  import easyocr

reader = easyocr.Reader(['en','ta'])
result = reader.readtext('sample.png')

for r in result:
print(r[1])

PADDLE OCR:

Backend: PaddlePaddle (Deep Learning Framework)

PaddleOCR is a powerful industrial-level OCR toolkit developed by Baidu.

HOW TO INSTALL:
pip install paddleocr paddlepaddle

SAMPLE CODE:
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('sample.png')

for line in result[0]:
print(line[1][0])

KERAS OCR:

Backend: TensorFlow / Keras
Keras-OCR is built using deep learning models and provides both text detection and recognition

HOW TO INSTALL:
pip install keras-ocr

SAMPLE CODE:
import keras_ocr

pipeline = keras_ocr.pipeline.Pipeline()
images = [keras_ocr.tools.read('sample.png')]
prediction = pipeline.recognize(images)

print(prediction)

HOW IT WORKS IN BACKEND:

Image / PDF
↓
Preprocessing (OpenCV)
↓
Text Detection (DL/CNN)
↓
Text Recognition (CRNN / Transformer)
↓
Output Text (JSON / String)

AVAILABLE MODULES :

1.pytesseract

2.easyocr

3.paddleocr

4.opencv-python

5.pillow

6.pdf2image

OCR SOFTWARE AND TOOLS

Yuvan Shankar — Tue, 10 Feb 2026 12:28:47 +0000

What is OCR?
OCR (Optical Character Recognition) is a technology that reads text from images or scanned documents and converts it into editable and searchable text.
It helps us change paper documents or image files into digital text that we can copy, edit, and store easily.

PAID OCR SOFTWARE:

ABBYY FineReader PDF
ABBYY FineReader is a professional OCR software mainly used for document digitization. It can convert scanned PDFs and images into editable formats while keeping the original layout, tables, and formatting intact.
Adobe Acrobat Pro DC (OCR-Enabled)
Adobe Acrobat Pro DC includes OCR functionality that allows users to recognize text in scanned PDFs and export them into formats like Word, Excel, or PowerPoint. It is widely used in offices and enterprises.
Nanonets OCR
Nanonets OCR is an AI-based OCR software used for automated document processing. It is commonly used in business workflows such as invoice processing, form extraction, and data automation.

COMMAN DRAWBACKS OF PAID OCR SOFTWARE:

1.High cost: Paid OCR software usually requires monthly or yearly subscriptions, which can be expensive.

2.Complex to use: Advanced features make the software powerful, but also increase the learning curve for new users.

3.Resource heavy: Some paid OCR tools need powerful systems or cloud credits to work efficiently.

FREE OCR SOFTWARE:

Tesseract OCR
Tesseract OCR is a free and open-source OCR engine. It is used to extract text from images and PDFs and is commonly integrated into custom applications and student projects.
OCRFeeder
OCRFeeder is a free OCR software that provides a graphical interface and works as a front-end for OCR engines like Tesseract. It helps users manage and process scanned documents more easily.
GOCR
GOCR is a free GNU OCR tool mainly used for basic text extraction from simple images. It is an older OCR technology but still useful for simple OCR tasks.

COMMAN DRAWBACKS OF FREE OCR TOOLS:

1.Lower accuracy: Free OCR tools usually struggle with complex layouts and handwritten text compared to paid software.
2.Limited support: Free tools have less technical support and fewer updates than commercial products.
3.Fewer advanced features: Features like batch processing and table extraction are limited or missing.
4.Technical setup required: Some free OCR tools do not have a proper graphical interface and require technical knowledge to use.

A journey through a code

Yuvan Shankar — Mon, 09 Feb 2026 10:19:27 +0000

This blog is a reflection of my learning journey in the tech field. I write about the projects I work on, the concepts I’m improving, and practical lessons gained through daily coding. Each post is driven by real experience, as I continue to learn step by step and grow through consistent practice.