Modern deep learning models used to implement high-performance artificial intelligence applications typically require a large amount of computation for training and inference, and GPUs are commonly utilized to reduce the time required for this. Unfortunately, even the state-of-the-art deep learning platforms often waste a substantial amount of GPU resources due to inefficient scheduling of computational operations, which ends up bloating up the completion time of the applications. We tackle this issue by designing an efficient scheduler and implementing a system optimized for it, especially for (1) a job-level scheduler of a shared GPU cluster for multi-tenant users who train their own deep learning models independently (2) an operator-level scheduler of one or more GPUs that co-operate to run training or inference of a single model in parallel. We present the design and the full implementation of the proposed systems and show that they bring a substantial reduction of overall execution time of real-world deep learning workloads over existing state-of-the-art systems.