Optimizing the aggregate throughput of concurrent deep learning jobs on a shared cluster공유 클러스터에서 동시 딥 러닝 작업의 총 처리성능 최적화

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 353
  • Download : 0
DC FieldValueLanguage
dc.contributor.advisorPark, Kyoung Soo-
dc.contributor.advisor박경수-
dc.contributor.authorSon, Kyuho-
dc.date.accessioned2019-09-04T02:41:38Z-
dc.date.available2019-09-04T02:41:38Z-
dc.date.issued2018-
dc.identifier.urihttp://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=828576&flag=dissertationen_US
dc.identifier.urihttp://hdl.handle.net/10203/266783-
dc.description학위논문(석사) - 한국과학기술원 : 전기및전자공학부(반도체학제전공), 2018.8,[3, 27 p. :]-
dc.description.abstractThe explosive popularity of deep learning (DL) has led to the evolution of deep learning frameworks. Unfortunately, despite the need for running multiple deep learning jobs on a GPU shared cluster environment, current cloud schedulers are often insufficient to schedule them efficiently. Managing resources for deep learning models without enough information or expertise results in poor performance in scalability and adversely affects the overall cluster performance. In this paper, we present Max-Speedup, a job scheduling policy of multi-tenant deep learning jobs on a shared GPU cluster. We address two main challenges, 1) precise estimation of training throughput to analyse resource-performance trade-off of a deep learning model and 2) efficient scheduling policy for multi-tenant deep learning jobs on a shared GPU cluster. We tackle these problems by estimating the finish time of parameter synchronization and maximizinig aggregate speedup by exploiting performance-resource trade-offs of DL jobs. Our evaluation shows that Max-Speedup improves the average job completion time by 3x over SRTF while it reduces makespan by up to 26.9x.-
dc.languageeng-
dc.publisher한국과학기술원-
dc.subjectJob scheduler▼adeep learning▼aperformance estimation▼aGPU cluster▼aresource management-
dc.subject작업 스케줄러▼a딥러닝▼a성능 예측▼aGPU 클러스터▼a자원 관리-
dc.titleOptimizing the aggregate throughput of concurrent deep learning jobs on a shared cluster-
dc.title.alternative공유 클러스터에서 동시 딥 러닝 작업의 총 처리성능 최적화-
dc.typeThesis(Master)-
dc.identifier.CNRN325007-
dc.description.department한국과학기술원 :전기및전자공학부(반도체학제전공),-
dc.contributor.alternativeauthor손규호-
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0