Biomedical named entity recognition with a deep learning-based label-label transition model딥러닝 레이블 전이 모델 기반 생물의학 개체명 인식

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 464
  • Download : 0
As the volume of textual information in biology and medicine quickly grows, the demand for making use of ever-evolving knowledge over the Internet or the literature accordingly increases. To offer structured and organized information, various relationships among biomedical entities should be mined in diverse aspects. Prior to the relation extraction, discovering biomedical entities with high accuracy is preceded with respect to the pipeline of the information extraction. Hence, the performance of biomedical named entity recognition (briefly, BioNER) is crucial in automated biomedical knowledge acquisition.To achieve the performance of BioNER, we have to overcome two problems. The first problem is the unique naming conventions in biomedical domain. Biomedical entities’ names have the following characteristics: (i) Descriptive naming convention, (ii) Diverse names for an entity, (iii) Abbreviation, (iv) Conjunction and disjunction. The second problem is the scare resource of annotated data. Because acquiring labels is costly, the amount of labeled data to obtain improved models in a supervised manner is still limited. These two problems still remain as obstacles in the advance of BioNER.In the dissertation, we address the challenging problems by taking advantage of the notion of cotraining. Co-training is essentially comprised of multiple learners, wherein each is given its specific view on data. Once trained, their knowledge formed via different learners are complementary to enhance the model performance. In the dissertation study, we employ deep learning for representation learning in an end-to-end learning manner. It also motivates us to suggest a novel co-training framework to incorporate deep learning because previous co-training methods rely on manually split feature sets. In the end, we solve the two problems in BioNER by suggesting a novel co-training framework and relevant algorithms.For the first problem, we present DTranNER, a CRF-based co-training framework with incorporating deep learning-based models. Conditional random fields (CRF) is widely used for BioNER by regarding it as a sequence labeling problem. The CRF yields structured label outputs by examining correlations between neighboring labels. Hence, DTranNER employs two CRF-based sequence learners, namely Unary-CRF and Pairwise-CRF. They are differentiated by two types of deep neural networks, namely Unary-Network and Pairwise-Network. The former is dedicated to learn representation for individual labeling, while the latter aims to model correlations between labels in a fine-grained manner. As a result, it is led that each of Unary-Network and Pairwise-Network offers complementary knowledge that the other does not have in prediction. In the end, we obtain sufficient representation to catch up with the non-standardized naming conventions in BioNER. We performed experiments on five benchmark BioNER corpora. In comparison with current state-of-the-art methods, DTranNER achieved the best performancein the four tests. In the ablation study, we also observed that Unary-Network and Pairwise-Network learn distinctive contextual clues to enhance BioNER.For the second problem, we present a novel co-training algorithm, called “co-paced learning,” for BioNER with the aim to leverage unlabeled data. The proposed algorithm is based on the early-proposed co-training framework. Hence, co-paced learning is given the two sequence learners, namely UnaryCRF and Pairwise-CRF. They are led to learn own representation according to its potential type (i.e., unary or pairwise). By using the complementary relationship, we present a robust pseudo-labeling approach by which each unlabeled sample is temporarily annotated as their agreed prediction. Next, the pseudo-labeled samples are individually examined whether it is learnable or not via the sampleselection strategy that we suggest. That is, the sample selection strategy rules out easy samples and offers informative samples to each learner. Thus, the proposed approach reflects the recent learning paradigm in curriculum learning and self-paced learning. Their criterion gradually takes with more complex samples as learning progresses. Consequently, Unary-CRF and Pairwise-CRF leverage each other in their learning enhancement. The experiments show that co-paced learning outperforms current state-of-the-art methods as for semi-supervised learning.The strength of the dissertation stands on the novel CRF-based co-training framework and the semi-supervised learning algorithm for the aforementioned two problems (i.e., (i) the unique naming conventions and (ii) the scare resource of annotated data). We expect that the study can be a stepping stone for further prosperity of biomedical literature mining.
Advisors
Lee, Jae-Gilresearcher이재길researcher
Description
한국과학기술원 :지식서비스공학대학원,
Publisher
한국과학기술원
Issue Date
2020
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 지식서비스공학대학원, 2020.2,[vi, 75 p. :]

Keywords

Bioinformatics; Data Mining; Information Extraction; Machine Learning; Semi-Supervised Learning; Natural Language Processing; Named Entity Recognition; Deep Learning; Sequence Labeling; 바이오인포매틱스; 데이타 마이닝; 정보 추출; 머신러닝; 준지도 학습; 자연어 처리; 개체명 인식; 딥러닝

URI
http://hdl.handle.net/10203/284561
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=947936&flag=dissertation
Appears in Collection
KSE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0