Compression of neural networks is an effective way to satisfy the requirement of memory-constrained edge devices. We propose a novel array microarchitecture that exploits compressed neural networks with nonlinearly quantized weights and supports variable activation and compressed weight bit widths. Computation is made more efficient by accumulating all the activations multiplied by the same weight prior to multiplication.
This design has been fabricated in TSMC 28nm technology. It achieves 3.4 TOPS/W with 16b activations and 16b weights (4b compressed) and 3.7 TOPS/W on the convolutional layers of AlexNet (8b activations, 4b compressed weights) with the ImageNet dataset, consuming 15.6mW at 44fps. This is comparable to state-of-the-art chip implementations, while introducing increased flexibility with a simple array structure.