Recent deep convolutional neural networks (CNNs) are outperforming conventional hand-crafted algorithms in a wide variety of intelligent vision tasks, but they require billons of operations and hundreds million of weights. To process large-scale CNNs energy-efficiently, three generations of CNN hardware are designed in this dissertation. The first two generations are CNN processors based on the conventional Von Neumann architecture, and the third generation CNN hardware is based on in-DRAM processing framework that does not obey Von Neumann architecture. The first generation primitive CNN processor integrates dual-range multiply-accumulate (MAC) blocks by exploiting the statistics of input feature values to reduce energy consumption of MAC operations. Also, tile-based computing method is proposed in the primitive CNN processor. In result, it achieves 1.42TOPS/W energy efficiency in the LeNet-5 CNN model. The second generation advanced CNN processor operates at near-threshold voltage (NTV) to reduce energy consumption furthermore. It also features a newly proposed enhanced output stationary dataflow (EOS) and two-stage big and small on-chip memory architecture, resulting in up to 1.15TOPS/W energy efficiency in the VGG-16 model. Finally, the third generation in-DRAM processing binary CNN hardware processes dominant convolution operations by serially cascading in-DRAM bulk bitwise operations. To this end, we first identify the problem that the bitcount operations with only bulk bitwise AND/OR/NOT incur significant overhead in terms of delay when the size of kernels gets larger. Then, we not only optimize the performance by efficiently allocating inputs and kernels to DRAM banks for both convolutional and fully-connected layers through design space explorations, but also mitigate the overhead of bitcount operations by splitting kernels into multiple parts. Partial sum accumulations and tasks of the other layers such as max-pooling and normalization layers are processed in the peripheral area of DRAM with negligible overheads. In results, our in-DRAM binary CNN processing framework achieves 19x-36x performance and 9x-14x EDP improvements for convolutional layers, and 9x-17x performance and 1.4x-4.5x EDP improvements for fully-connected layers over previous PIM technique in four large-scale CNN models. Also, it shows 3.796TOPS/W energy efficiency in AlexNet CNN model.