DC Field | Value | Language |
---|---|---|
dc.contributor.author | Gaim, Fitsum | ko |
dc.contributor.author | Yang, Wonsuk | ko |
dc.contributor.author | Park, Jong-Cheol | ko |
dc.date.accessioned | 2022-11-10T23:01:24Z | - |
dc.date.available | 2022-11-10T23:01:24Z | - |
dc.date.created | 2022-11-10 | - |
dc.date.created | 2022-11-10 | - |
dc.date.created | 2022-11-10 | - |
dc.date.issued | 2022-06-21 | - |
dc.identifier.citation | The 13th Conference on Language Resources and Evaluation (LREC 2022), pp.6578 - 6584 | - |
dc.identifier.uri | http://hdl.handle.net/10203/299503 | - |
dc.description.abstract | Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user’s experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge’ez script as a writing system; namely Amharic, Blin, Ge’ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch. | - |
dc.language | English | - |
dc.publisher | European Language Resources Association (ELRA) | - |
dc.title | GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages | - |
dc.type | Conference | - |
dc.identifier.wosid | 000889371706075 | - |
dc.identifier.scopusid | 2-s2.0-85144457780 | - |
dc.type.rims | CONF | - |
dc.citation.beginningpage | 6578 | - |
dc.citation.endingpage | 6584 | - |
dc.citation.publicationname | The 13th Conference on Language Resources and Evaluation (LREC 2022) | - |
dc.identifier.conferencecountry | FR | - |
dc.identifier.conferencelocation | Marseille | - |
dc.contributor.localauthor | Park, Jong-Cheol | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.