A Case for Two-stage Inference with Knowledge Caching

Cited 0 time in webofscience Cited 2 time in scopus
  • Hit : 217
  • Download : 0
Real-world intelligent services employing deep learning technology typically take a two-tier system architecture - a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.
Publisher
ACM SIGMOBILE
Issue Date
2019-06-19
Language
English
Citation

The 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL 2019)

DOI
10.1145/3325413.3329789
URI
http://hdl.handle.net/10203/263393
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0