MaghBERT: pre-trained language models for the Maghrebi dialects마그베르트: 마그레브 방언의 사전 훈련된 언어 모델

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 74
  • Download : 0
Random Masked Language Models have brought a significant performance boost to various natural language processing tasks. However, their performance is bounded by the raw corpus domain. As such, target domains that shift notably from the source domain generally perform poorly, even if they share a considerable amount of semantics. A particular case of such a paradigm can be seen with Modern Standard Arabic Language and the Arabic Dialects. Although both share a large proportion of semantics, a pre-trained model on the standardized variant fails to perform adequately on the latter. A typical solution to address this problem is to introduce the pre-trained model to the target domain through another round of pre-training, a process known as domain adaptation. However, recent domain adaptation techniques fail to deal with noisy target data. Thus limiting their learnability, all while harming the representation of the source domain. To address these issues, we propose a semi-supervised masking strategy that leverages a relatively small set of supervised signals to extract various Term Weighting schemes such as Information Gain and Odds Ratio. During domain adaptive pre-training, sentence-level weights are merged using an ensemble ranking approach and then used to pick masking candidates over a non-uniform distribution. Furthermore, we show that at inference level, a pre-trained model and a target test corpus can be effectively used to find adequate collection frequencies before any domain adaptation or pre-training. The overall effectiveness of our approach is further reflected in various downstream tasks against multiple pre-trained dialectal models, as well as current domain-adaptation strategies.
Description
한국과학기술원 :전산학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전산학부, 2022.8,[vi, 59 p. :]

Keywords

Arabic dialects▼aLanguage modeling▼aDomain adaptation▼aTerm weighting▼aCloze task; 아랍어 방언▼a언어 모델링▼a영역적응▼a용어가중치▼a빈칸메우기 과제

URI
http://hdl.handle.net/10203/309254
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1021112&flag=dissertation
Appears in Collection
CS-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0