Applying morphological segmentation to machine translation of low-resource and morphologically complex languages : (The) case of English-Tigrinya저 자원 및 형태 학적으로 복잡한 언어의 기계 번역에 형태 학적 세분화 적용 : 영어-티그리냐의 경우

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 1273
  • Download : 0
Machine Translation (MT) has seen substantial advances in recent years, but it remains unexplored and an ongoing challenge for most language pairs. In this thesis, we present the development of a Statistical Machine Translation (SMT) between English and a lesser-known Semitic language, Tigrinya. To the best of our knowledge, this is the first study of machine translation involving the Tigrinya language. Two of the most important factors that affect the performance of SMT systems for a given language pair are: (1) the volume of parallel data available, and (2) the language difference between the pair. In this regard, English and Tigrinya make a particularly difficult pair for the task of SMT. The English language is deeply studied and has a wealth of resources, whereas Tigrinya is much less studied with severely limited computational resources. What is more, the two languages differ markedly in syntax and morphology, particularly in the word structure. Tigrinya is an agglutinative language with a highly derivational and inflectional morphology that proliferates vocabulary and necessitates sub-word translation. Regardless of the salient differences in the making of a word among natural languages, the standard SMT approaches treat surface words as the smallest unit of translation. These techniques work fairly well for languages with simple morphology and relatively small vocabulary such as English. However, they perform suboptimal when languages with rich morphology and huge vocabulary are involved, owing it to poor phrase alignment, data sparsity, and high rate of out-of-vocabulary words. In this empirical study, we build the necessary corpora from scratch and study the effects of both rule-based and unsupervised morphological segmentation of Tigrinya words as remedial measures. Moreover, we augment the system with additional bilingual lexicon to ameliorate the out-of-vocabulary problem. To this end, we have achieved cumulative BLEU scores of 23.3 and 27.14 points for English into Tigrinya, and Tigrinya into English translations, respectively. In the end, the system is published online for public use and the dataset, which comprises 30.6k sentences of parallel corpus and 913k sentences of monolingual corpus, is also made publicly available for researchers.
Advisors
Rho, Jae Jeungresearcher노재정researcher
Description
한국과학기술원 :글로벌IT기술대학원프로그램,
Publisher
한국과학기술원
Issue Date
2017
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 글로벌IT기술대학원프로그램, 2017.8,[vi, 67 p. :]

Keywords

Natural Language Processing▼aMachine Translation▼aMorphological Segmentation▼aCorpora Buildin; 자연 언어 처리▼a기계 번역▼a형태론적 세분화▼a말뭉치 구축

URI
http://hdl.handle.net/10203/242759
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=718756&flag=dissertation
Appears in Collection
ITP-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0