This paper addresses multi-target tracking using a monocular vision sensor. To overcome the fundamental observability issue of the monocular vision, a convolutional neural network (CNN)-based method is proposed. The method combines a CNN-based multi-target detection into a model-based multi-target tracking framework. While previous CNN applications to image-based object recognition and tracking focused on prediction of region of interest (RoI), the proposed method allows for prediction of the three-dimensional position information of the moving objects of interest. This is achieved by appropriately construct a network tailored to the moving object tracking problems with potentially occluded objects. In addition, the cubature Kalman filter integrated with a data association scheme is adopted for effective tracking of nonlinear motion of the objects with the measurements information from the learned network. A virtual simulator that generates the trajectories of the target motions and a sequence of images of the scene has been developed and used to test and verify the proposed CNN scheme. Simulation case studies demonstrate that the proposed CNN improves the position accuracy in the depth direction substantially.