Sorting in Indic locales/Indian language spell-checking enhancements

From WorkOutWiki2008

Jump to: navigation, search

An outline of the proposals made to FOSS.IN/2008 are under this IndLinux Wiki page. For continuation of work in the future, this page will also be reproduced there. You are most welcomed to stay involved in the longer term, and if you plan to do so, please join the indlinux-group mailing list, or email me at gora AT sarai DOT net. Notes from the workout are here.

Currently, IndLinux efforts in Indian language computing are under way in at least 15 Indian languages, namely, Assamese, Bengali, Chhatisgarhi, Gujarati, Hindi, Kannada, Kashmiri, Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Telugu, and Tamil. This workout aims to clean up various aspects of Indian language computing that have been in a state of partial completion for far too long. Minimal technical knowledge is needed in order to be involved here, as that is largely handled by the back-end software. However, we can have an advanced discussion on the technologies involved. There are three main issues that we will attack in this workout:

  • Cleaning up Indian language locales for glibc, focusing on sorting:
    • Work to be done: A single file should be used to define the sorting order for all Indian languages, including English. By this, each language locale can simply include this file, and sorting will work for even a multi-lingual document involving various Indian languages.
      • Such sorting for multiple languages is already defined.
      • Most of the work here will be in verifying the sorting order for various languages.
      • Preparation of test cases, including sorting of single characters, and a sample word list, for each Indian language.
      • A discussion of advanced sorting needs, i.e., things that cannot currently be handled by the LC_COLLATE section of glibc.
      • Creation of locales for languages where these are lacking. At least for Chattisgarhi, Kashmiri, and Maithili.
      • Submission of formal patches for all work here to glibc.
    • How to contribute, and pre-requisites:
      • We urgently need people with knowledge of each Indian language listed above. It would be best if you could come prepared with defined sorting order rules for your language, or at least bring along a dictionary from where these could be surmised.
      • Required files:
      • Technical aspects here will consist of a brief explanation of how sorting is handled in glibc, and a description of the approach used. People able to improve on this, and/or document it are also most welcome.
    • Useful links for background material:
      • Documentation on LC_COLLATE is here, from the Common Unix specification.
      • The sorting in indicsort, should broadly work now, though it has been tested only for Hindi, and Oriya.
      • Pravin Satpute has also worked on sorting in Marathi, and has views on this topic here.
      • See also the following page on collation data.
      • Tables showing examples of single-character sorting in each language will be uploaded.
    • Longer-term work, after FOSS.IN/2008:
      • Resolve any differences with respect to Indic sorting between this, and the Unicode CLDR.
      • Get package maintainers for Linux distributions to push changes upstream.
  • Enhancing Indic spell-checking:
    • Work to be done: aspell and Hunspell allow the incorporation of advanced rules for spell-checking. Perhaps the most important aspect for Indian languages is to add phonetic rules. This has currently been done for Hindi, Oriya, and Punjabi, though only tested for Hindi. Also, affix rules can be added for deriving words from bases in the dictionary. Apparently, with Hunspell, it is also possible to add rules to handle agglutinative languages like Malayalam.
    • How to contribute, and pre-requisites:
      • Native speakers for each Indian language listed above will be needed. Some familiarity with spell-checking issues will be of help.
      • Required files:
      • Experience in working with advanced rules for Hunspell, and aspell will be of immense help.
      • Technical ideas for advanced spell-checking in Indian languages will be appreciated.
      • Help in putting together a web interface for reviewing dictionaries.
    • Useful links for background material:
    • Longer-term work, after FOSS.IN/2008:
      • Finish porting aspell dictionaries to Hunspell, including advanced rules.
      • Finish up work on a common spell-checking interface, and look at language-neutral interfaces.
      • Consider adding spell-checking plugins for various applications.
  • Documentation for Indian languages in Linux distributions (if time permits): While most FOSS desktops, and the underlying operating systems work fairly well with Indian languages, an Indian language desktop is still far from being as comfortably usable as an English desktop. The main reasons for this are: (a) A lack of comprehensive packaging, (b) insufficient attention to final finishing touches, and (c) a lack of documentation. We plan to approach these issues in a rigorous manner, and an outline of plans at Sarai are on this page. We hope to have a discussion that will get people involved in developing these ideas further.
  • Cutting-edge technology issues in Indian language computing (if time permits): Several interesting projects at the cutting-edge of technology have been mooted in the IndLinux mailing lists. We will try to lay a base for a coordinated effort on each of these. The topics will include:
    • Optical character recognition (OCR)
    • Machine translation (MT)
Personal tools