DSpace at KOASAS: MaghBERT: pre-trained language models for the Maghrebi dialects

DSpace at KOASAS

College of Engineering(공과대학)School of Computing(전산학부)CS-Theses_Ph.D.(박사논문)

MaghBERT: pre-trained language models for the Maghrebi dialects마그베르트: 마그레브 방언의 사전 훈련된 언어 모델

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 74
Download : 0

Export

Khiati, Abdel ilah Zakaria

Random Masked Language Models have brought a significant performance boost to various natural language processing tasks. However, their performance is bounded by the raw corpus domain. As such, target domains that shift notably from the source domain generally perform poorly, even if they share a considerable amount of semantics. A particular case of such a paradigm can be seen with Modern Standard Arabic Language and the Arabic Dialects. Although both share a large proportion of semantics, a pre-trained model on the standardized variant fails to perform adequately on the latter. A typical solution to address this problem is to introduce the pre-trained model to the target domain through another round of pre-training, a process known as domain adaptation. However, recent domain adaptation techniques fail to deal with noisy target data. Thus limiting their learnability, all while harming the representation of the source domain. To address these issues, we propose a semi-supervised masking strategy that leverages a relatively small set of supervised signals to extract various Term Weighting schemes such as Information Gain and Odds Ratio. During domain adaptive pre-training, sentence-level weights are merged using an ensemble ranking approach and then used to pick masking candidates over a non-uniform distribution. Furthermore, we show that at inference level, a pre-trained model and a target test corpus can be effectively used to find adequate collection frequencies before any domain adaptation or pre-training. The overall effectiveness of our approach is further reflected in various downstream tasks against multiple pre-trained dialectal models, as well as current domain-adaptation strategies.

Description: 한국과학기술원 :전산학부,

Publisher: 한국과학기술원

Issue Date: 2022

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 전산학부, 2022.8,[vi, 59 p. :]

Keywords: Arabic dialects▼aLanguage modeling▼aDomain adaptation▼aTerm weighting▼aCloze task; 아랍어 방언▼a언어 모델링▼a영역적응▼a용어가중치▼a빈칸메우기 과제

URI: http://hdl.handle.net/10203/309254

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1021112&flag=dissertation

Appears in Collection: CS-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

MaghBERT: pre-trained language models for the Maghrebi dialects마그베르트: 마그레브 방언의 사전 훈련된 언어 모델

KOASAS

Communities & Collections