In video super-resolution, the spatio-temporal coherence between, and among the frames must be exploited appropriately for the accurate prediction of the high resolution frames. Although 2D-CNNs are powerful in modelling images, 3D-CNNs are more suitable for spatio-temporal feature extraction as they can preserve the temporal information. To this end, we propose an effective 3D-CNN for video super-resolution that does not require motion alignment as preprocessing. The proposed 3DSRnet maintains the temporal depth of spatio-temporal feature maps to maximally capture the temporally nonlinear characteristics between low and high resolution frames, and adopts residual learning in conjunction with the sub-pixel outputs. It outperforms the state-of-the-art method with average 0.45 dB and 0.36 dB higher in PSNR, for scale 3 and 4, in the Vidset4 benchmark. Our 3DSRnet first deals with the performance drop due to scene change, which is important in practice but has not been previously considered.