Zico: Efficient GPU Memory Sharing for Concurrent DNN Training

Cited 10 time in webofscience Cited 0 time in scopus
  • Hit : 154
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorLim, Gangmukko
dc.contributor.authorAhn, Jeongseobko
dc.contributor.authorXiao, Wencongko
dc.contributor.authorKwon, Youngjinko
dc.contributor.authorJeon, Myeongjaeko
dc.date.accessioned2021-11-02T06:47:47Z-
dc.date.available2021-11-02T06:47:47Z-
dc.date.created2021-10-26-
dc.date.issued2021-07-
dc.identifier.citationUSENIX Annual Technical Conference / 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp.523 - 536-
dc.identifier.urihttp://hdl.handle.net/10203/288556-
dc.description.abstractGPUs are the workhorse in modern server infrastructure fueling advances in a number of compute-intensive workloads such as deep neural network (DNN) training. Several recent works propose solutions on sharing GPU resources across multiple concurrent DNN training jobs, but none of them address rapidly increasing memory footprint introduced by such job co-locations, which greatly limit the effectiveness of sharing GPU resources. In this paper, we present Zico, the first DNN system that aims at reducing the system-wide memory consumption for concurrent training. Zico keeps track of the memory usage pattern of individual training job by monitoring its progress on GPU computations and makes memory reclaimed from the job globally sharable. Based on this memory management scheme, Zico automatically decides a strategy to share memory among concurrent jobs with minimum delay on training while not exceeding a given memory budget such as GPU memory capacity. Our evaluation shows that Zico outperforms existing GPU sharing approaches and delivers benefits over a variety of job co-location scenarios.-
dc.languageEnglish-
dc.publisherUSENIX ASSOC-
dc.titleZico: Efficient GPU Memory Sharing for Concurrent DNN Training-
dc.typeConference-
dc.identifier.wosid000696708600034-
dc.type.rimsCONF-
dc.citation.beginningpage523-
dc.citation.endingpage536-
dc.citation.publicationnameUSENIX Annual Technical Conference / 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI)-
dc.identifier.conferencecountryUS-
dc.identifier.conferencelocationELECTR NETWORK-
dc.contributor.localauthorKwon, Youngjin-
dc.contributor.nonIdAuthorLim, Gangmuk-
dc.contributor.nonIdAuthorAhn, Jeongseob-
dc.contributor.nonIdAuthorXiao, Wencong-
dc.contributor.nonIdAuthorJeon, Myeongjae-
Appears in Collection
CS-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 10 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0