DC Field | Value | Language |
---|---|---|
dc.contributor.author | Jeong, Soyeong | - |
dc.date.accessioned | 2023-06-26T19:31:27Z | - |
dc.date.available | 2023-06-26T19:31:27Z | - |
dc.date.issued | 2022 | - |
dc.identifier.uri | http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1000338&flag=dissertation | en_US |
dc.identifier.uri | http://hdl.handle.net/10203/309531 | - |
dc.description | 학위논문(석사) - 한국과학기술원 : 전산학부, 2022.2,[iv, 35 p. :] | - |
dc.description.abstract | One of the challenges in information retrieval (IR) is the $vocabulary mismatch$ problem, which refers to the failure of retrieving the query-relevant document when the terms between the query and the document are lexically different but semantically similar. While recent work has tried to tackle the problem by expanding sparse representations with additional relevant terms or by embedding the representations to learnable dense space, both of the expansion and dense models generally require a large volume of labeled query-document pairs to train, whereas it is often challenging to acquire the labeled pairs annotated by humans. The thesis focuses on augmenting the document representations, either on the document text level or on the training dataset level, without requiring additional labeled query-document pairs for both sparse and dense retrieval models. For the sparse retrieval model, we propose Unsupervised Document Expansion with Generation (UDEG), which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our UDEG on two standard IR benchmark datasets. The results show that our UDEG significantly outperforms relevant expansion baselines. For the dense retrieval model, we propose Document Augmentation for dense Retrieval (DAR), which augments the document representations with interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the seen and unseen documents. We believe that our UDEG and DAR make a good contribution to sparse and dense retrievers by augmenting document representations without annotating additional query-document pairs. | - |
dc.language | eng | - |
dc.publisher | 한국과학기술원 | - |
dc.subject | Natural language understanding▼aInformation retrieval▼aData augmentation▼aDocument expansion▼aInterpolation▼aPerturbation | - |
dc.subject | 자연 언어 이해▼a정보 검색▼a데이터 증강▼a문서 확장▼a보간▼a섭동 | - |
dc.title | Information retrieval by augmenting document representation | - |
dc.title.alternative | 문서 표현 증강을 통한 정보 검색 | - |
dc.type | Thesis(Master) | - |
dc.identifier.CNRN | 325007 | - |
dc.description.department | 한국과학기술원 :전산학부, | - |
dc.contributor.alternativeauthor | 정소영 | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.