Saturday, September 22, 2007

Burmese Language Project at OpenOffice.org

I have created "Burmese Language Project" at OpenOffice.org . Firstly, we need to enable Burmese Language at I18N (Internationalization) part. I18N is a kind of supporting usage of Burmese with OpenOffice. I18N includes:
  • Word, line and sentence break
  • Search and replace
  • Paragraph numbering
  • Transliteration
  • Character classification
  • Number Formats
  • Calendars
  • Collation
  • Locale data
I have grouped a team of members to work on it. Our team members do not work for money but they need money to work. So, I am looking for a research grant. If any person wants to provide any contribution, not only money but also works, please contact me at wunnakoko at gmail dot com.

Saturday, September 15, 2007

Syllable Segmentation & Line Breaking Software

I hope we can publish the beta version of Syllable Segmentation Software in near future. Although, it is in beta version, we have already tested for a period of time. The software will do the following:
1. Phonologic segmentation of Burmese Syllables
2. Orthographic segmentation of Burmese Syllables
3. Line Breaking or Word Wrapping

It can handle three types of documents:
1. Text Documents (.txt) files
2. XML Documents (.xml) files
3. MS Word Documents (.doc) files

By handling XML documents, we hope that it will be useful for segmenting all types of other documents like Spreadsheet files, Database files, etc.

I hope all of our friends and Burmese community will help us in testing it.

Wednesday, September 12, 2007

Syllable Segmentation

Syllable segmentation is the process of determination of syllable boundaries in a piece of text. Since Burmese is the tonal and analytic language and Burmese writing system is a syllabic writing system, the fundamental building blocks of a language are the syllables. In determination of syllable boundaries in Burmese Script, there can be of two types; 1) phonological boundary of a syllable, and 2) orthographic boundary of a syllable. Since Burmese script is a phonetic script, the phonological segmentation of a syllable is the basic segmentation. The phonological boundary of a syllable is defined, as the name goes, according to the phonological manner whereas the orthographic boundary of a syllable is defined according to the orthography. The orthographic syllable need not correspond exactly with a phonological syllable. The orthographic syllable is just the combination of phonological syllables and the non-breaking rules. Example: In a မႏၱေလး word, it has 3 phonological syllables, မန္, and ေလး. But, for orthographic syllable, it has just 2 syllables, မႏၱ and ေလး .

References

For all of you to have a reference for future NLP works, I uploaded some of my academic publications.

Master Thesis at Nagaoka University of Technology

Languages of Myanmar In Cyberspace

Thursday, September 6, 2007

Word Break

A word, is a unit of language, carries meaning and consists of one or more morphemes. Although there are spaces to separate between words English, no spaces are needed to add in Myanmar. Typically, a word will consist of a root or stem and zero or more affixes. Words can be combined to create phrase, clauses and sentences (Wikipedia).

So, word boundary must be detected for Myanmar Language. But, Myanmar Language is the tonal and analytic language. Myanmar writing system is a syllabic writing system, so the fundamental building blocks of a language are the syllables. Determining syllable boundary can be done by rule base.

A word can be formed by one or more than one syllables.

Syllable break which can be used for sorting, searching, text to speech, transliteration, can also be used for word breaking methods.

Word break which can be used for spell checking, grammar checking, translation, line breaking, etc,.