Thursday, September 6, 2007

Word Break

A word, is a unit of language, carries meaning and consists of one or more morphemes. Although there are spaces to separate between words English, no spaces are needed to add in Myanmar. Typically, a word will consist of a root or stem and zero or more affixes. Words can be combined to create phrase, clauses and sentences (Wikipedia).

So, word boundary must be detected for Myanmar Language. But, Myanmar Language is the tonal and analytic language. Myanmar writing system is a syllabic writing system, so the fundamental building blocks of a language are the syllables. Determining syllable boundary can be done by rule base.

A word can be formed by one or more than one syllables.

Syllable break which can be used for sorting, searching, text to speech, transliteration, can also be used for word breaking methods.

Word break which can be used for spell checking, grammar checking, translation, line breaking, etc,.

3 comments:

Aung Kyaw said...

This process is also known in NLP community as Tokenization.

mark.soemin said...

Here is my Word Breaker, for your reference.

http://www.burglish.com/wbreaker.htm

mark@burglish.com

UchihaMadara said...

Great point! Did you develop any Burmese word segment tool?