Free Indic OCR

From WorkOutWiki2009

Jump to: navigation, search

Contents

Proposer

Debayan Banerjee

debayanin <att> gmail <dott> com

Purpose

After some development efforts, we have an OCR engine that works well http://code.google.com/p/tesseractindic. The aim now is to add support for as many languages as possible. Also, a GUI and Web frontend to the OCR system is desirable so that end-users can use it without having to deal with the command line.

Pre-requisites

Knowledge of any of the below 3 will do.

 Python / PyGTK
 C++
 Django

Going through http://hacking-tesseract.blogspot.com/ will help.

Getting and compiling the code

Read the instructions at http://code.google.com/p/tesseractindic.

Links to overall design/architecture

 http://code.google.com/p/tesseract-ocr/w/list
 http://tesseract-ocr.googlecode.com/files/TesseractOSCON.pdf


Tasks

 1> Add training data for new languages. So far, only Hindi and Bengali are supported. Read http://code.google.com/p/tesseractindic/wiki/TrainerGUI.
 2> Integrate TesseractIndic with OCRFeeder http://live.gnome.org/OCRFeeder.
 3> Create a web framework for the OCR using Django.

Existing work

 http://code.google.com/p/tesseractindic/downloads/list
 http://crblpocr.blogspot.com/


Getting in touch

 Indlinux mailing list
 #indlinux and #sarai on irc.freenode.net

Participants

 # Santhosh Thottingal
 # Sayamindu Dasgupta
Personal tools