Recently, deep learning has made remarkable success in various recognition and classification. In particular, gesture recognition performance, which is essential for human-computer interaction, is much more accurate than it used to be, but for practical use, it is necessary to increase accuracy in more diverse situations. Just as humans do not rely solely on vision but use hearing and many senses altogether when recognizing the gesture, multimodal information has been used in many studies for improving the performance of gesture recognition. This paper introduces a network using spatial and temporal attention maps for better performance on multimodal gesture recognition. The proposed network is tested with the Chalearn gesture dataset, and the results showed that the performance of multimodal gesture recognition was improved.