DSpace at KOASAS: Structural optimization of a full-text n-gram index using relational normalization

DSpace at KOASAS

College of Engineering(공과대학)Dept. of Knowledge Service Engineering(지식서비스공학과)KSE-Journal Papers(저널논문)

Structural optimization of a full-text n-gram index using relational normalization

Cited 7 time in

Cited 0 time in

Hit : 449
Download : 0

Export

Kim, Min-Soo researcher / Whang, Kyu-Young researcher / Lee, Jae-Gil researcher / Lee, Min-Jae

As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1 GB show that the size of the n-gram/2L index is reduced by up to 1.9-2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen's compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305-319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15-20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15-20).

Publisher: SPRINGER

Issue Date: 2008-11

Language: English

Article Type: Article; Proceedings Paper

Citation: VLDB JOURNAL, v.17, pp.1485 - 1507

ISSN: 1066-8888

DOI: 10.1007/s00778-007-0082-x

URI: http://hdl.handle.net/10203/93008

Appears in Collection: CS-Journal Papers(저널논문)IE-Journal Papers(저널논문)

Files in This Item: There are no files associated with this item.

This item is cited by other documents in WoS

⊙ Detail Information in WoSⓡ	Click to see
⊙ Cited 7 items in WoS	Click to see citing articles in

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Structural optimization of a full-text n-gram index using relational normalization

This item is cited by other documents in WoS

KOASAS

Communities & Collections