This paper describes a framework for direct visual simultaneous localization and mapping (SLAM) combining a monocular camera with sparse depth information from Light Detection and Ranging (LiDAR). To ensure real-time performance while maintaining high accuracy in motion estimation, we present (i) a sliding window-based tracking method, (ii) strict pose marginalization for accurate pose-graph SLAM and (iii) depth-integrated frame matching for largescale mapping. Unlike conventional feature-based visual and LiDAR mapping, the proposed approach is direct, eliminating the visual feature in the objective function. We evaluated results using our portable camera-LiDAR system as well as KITTI odometry benchmark datasets. The experimental results prove that the characteristics of two complementary sensors are very effective in improving real-time performance and accuracy. Via validation, we achieved low drift error of 0.98% in the KITTI benchmark including various environments such as a highway and residential areas.