Free Indic OCR
From WorkOutWiki2009
Contents |
Proposer
Debayan Banerjee
debayanin <att> gmail <dott> com
Purpose
After some development efforts, we have an OCR engine that works well http://code.google.com/p/tesseractindic. The aim now is to add support for as many languages as possible. Also, a GUI and Web frontend to the OCR system is desirable so that end-users can use it without having to deal with the command line.
Pre-requisites
Knowledge of any of the below 3 will do.
Python / PyGTK C++ Django
Going through http://hacking-tesseract.blogspot.com/ will help.
Getting and compiling the code
Read the instructions at http://code.google.com/p/tesseractindic.
Links to overall design/architecture
http://code.google.com/p/tesseract-ocr/w/list http://tesseract-ocr.googlecode.com/files/TesseractOSCON.pdf
Tasks
1> Add training data for new languages. So far, only Hindi and Bengali are supported. Read http://code.google.com/p/tesseractindic/wiki/TrainerGUI. 2> Integrate TesseractIndic with OCRFeeder http://live.gnome.org/OCRFeeder. 3> Create a web framework for the OCR using Django.
Existing work
http://code.google.com/p/tesseractindic/downloads/list
http://crblpocr.blogspot.com/
Getting in touch
Indlinux mailing list
#indlinux and #sarai on irc.freenode.net
Participants
# Santhosh Thottingal
# Sayamindu Dasgupta

