DSpace at KOASAS: (An) approach to building a massively-parallel search engine using DBMSs and integrating distributed memory

DSpace at KOASAS

College of Engineering(공과대학)School of Computing(전산학부)CS-Theses_Ph.D.(박사논문)

(An) approach to building a massively-parallel search engine using DBMSs and integrating distributed memoryDBMS를 사용하고 분산 메모리를 통합한 대용량 병렬 검색 엔진을 구축하는 방법

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 617
Download : 0

Export

Yun, Tae-Seob / 윤태섭

A massively-parallel Web search engine consists of hundreds of thousands of nodes to store large-scale data in a distributed manner and process queries in parallel based on “NoSQL” systems. We focus on two facts that 1) NoSQL systems have very simple and primitive functionality while DBMSs effectively provide high-level functionalities such as SQL, schemas, or indexes, 2) even though we have a large amount of collective memory space over multiple nodes of a large-scale search engine, there is no work reported in the academic world that coordinates memories of multiple nodes as if we had one integrated memory storing the entire index. In this dissertation, therefore, we propose an approach to building a massively-parallel search engine using DBMSs and integrating distributed memory. In the first part of the dissertation, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. For this purpose, we evaluate and estimate the performance of ODYS[65], which is a massively-parallel search engine using a DB-IR tightly integrated parallel DBMS, to verify its commercial-level scalability and efficiency. We estimate the performance based on a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface. In the second part of the dissertation, we propose two-dimensional indexing---a novel in-memory indexing architecture that operates over distributed memory of a massively-parallel search engine. The goal of two-dimensional indexing is to provide a one-integrated-memory view as in a single node system using one large integrated memory. In two-dimensional indexing, we partition the entire index into n × m fragments and distribute them over the memories of multiple nodes in such a way that each fragment is entirely stored in main memory of one node. The proposed architecture is not only scalable as it uses a scaled-out shared-nothing architecture but also is capable of achieving low query response time as it processes queries in main memory. We also propose the concept of the one-memory point, which is the amount of the memory space required to completely store the entire index in main memory providing a one-integrated-memory view. We investigate the one-memory point by analyzing an actual dataset and compare the result with the experiments. We first prove the effectiveness of two-dimensional indexing with single-keyword queries, and then, extend the notion so as to be able to handle multiple-keyword queries. In experiments using the real-life search query set over a database consisting of 100 million Web documents crawled, we show that two-dimensional indexing can effectively provide a one-integrated-memory view without too much of additional memory compared with the single node system using one large integrated memory. This ratio of additional memory is expected to decrease as we use a larger size of the database. We also show that, with a six-node prototype, in an ideal case, it significantly improves the query processing performance over a disk-based search engine with an equivalent amount of in-memory buffer but without two-dimensional indexing by 105.05 times when queuing effects are excluded. This improvement is expected to get larger as the system is scaled-out with a larger number of machines. We summarize the contributions of this dissertation as follows. First, we have shown that a massively parallel search engine capable of processing real-world scale data and query loads can be implemented using a DB-IR tightly integrated parallel DBMS. We have presented the detailed implementation of such a system, ODYS, that provides higher-level interface and functionalities for easy application development and maintenance while providing superior performance and scalability. Second, we have proposed a notion of two-dimensional indexing that simulates collective distributed memory as one integrated memory for a massively-parallel search engine. For two-dimensional indexing, we partition the entire index into a two-dimensional array of multiple index fragments and completely store them into main memories of the multiple machines. We also have implemented the two-dimensional indexing architecture in ODYS, namely, ODYS/2D-Indexing. Finally, through extensive experiments using real-world scale data and query loads, we have estimated the performance of ODYS and have shown the effectiveness of two-dimensional indexing.

Advisors: Whang, Kyu-Young researcher; 황규영 researcher

Description: 한국과학기술원 :전산학부,

Publisher: 한국과학기술원

Issue Date: 2016

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 전산학부, 2016.2 ,[vii, 76 p. :]

Keywords: Massively-Parallel Search Engines; Performance Modeling; Parallel-DBMS; Distributed Memory; Two-Dimensional Indexing; 대형 병렬 검색 엔진; 성능 모델; 병렬 DBMS; 분산 메모리; 이차원 색인

URI: http://hdl.handle.net/10203/222426

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=648282&flag=dissertation

Appears in Collection: CS-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

(An) approach to building a massively-parallel search engine using DBMSs and integrating distributed memoryDBMS를 사용하고 분산 메모리를 통합한 대용량 병렬 검색 엔진을 구축하는 방법

KOASAS

Communities & Collections