The research topic of estimating hand pose from the images of hand-object interaction has the potential for replicating natural hand behavior in many practical applications of virtual reality and robotics. However, the intricacy of hand-object interaction combined with mutual occlusion, and the need for physical plausibility, brings many challenges to the problem. This paper provides a comprehensive survey of the state-of-the-art deep learning-based approaches for estimating hand pose (joint and shape) in the context of hand-object interaction. We discuss various deep learning-based approaches to image-based hand tracking, including hand joint and shape estimation. In addition, we review the hand-object interaction dataset benchmarks that are well-utilized in hand joint and shape estimation methods. Deep learning has emerged as a powerful technique for solving many problems including hand pose estimation. While we cover extensive research in the field, we discuss the remaining challenges leading to future research directions.