GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 198
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorGaim, Fitsumko
dc.contributor.authorYang, Wonsukko
dc.contributor.authorPark, Jong-Cheolko
dc.date.accessioned2022-11-10T23:01:24Z-
dc.date.available2022-11-10T23:01:24Z-
dc.date.created2022-11-10-
dc.date.created2022-11-10-
dc.date.created2022-11-10-
dc.date.issued2022-06-21-
dc.identifier.citationThe 13th Conference on Language Resources and Evaluation (LREC 2022), pp.6578 - 6584-
dc.identifier.urihttp://hdl.handle.net/10203/299503-
dc.description.abstractLanguage identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user’s experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge’ez script as a writing system; namely Amharic, Blin, Ge’ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.-
dc.languageEnglish-
dc.publisherEuropean Language Resources Association (ELRA)-
dc.titleGeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages-
dc.typeConference-
dc.identifier.wosid000889371706075-
dc.identifier.scopusid2-s2.0-85144457780-
dc.type.rimsCONF-
dc.citation.beginningpage6578-
dc.citation.endingpage6584-
dc.citation.publicationnameThe 13th Conference on Language Resources and Evaluation (LREC 2022)-
dc.identifier.conferencecountryFR-
dc.identifier.conferencelocationMarseille-
dc.contributor.localauthorPark, Jong-Cheol-
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0