In these days, interests about robotics and deep learning are getting higher. Robots are developed for helping people on every part of human life, and the deep learning is essential for improving robot performance. Most robots use vision data and depth data to perform better in various fields, also in gesture recognition. However, using the vision sensor alone, it is hard to discriminate the similar gestures or situations. Misrecognizing the gestures can cause a big danger, especially in emergency situation. Thus, we tried to detect the emergency situation in real-time by using audio data. To judge the emergency in real-time, we propose a novel 3D convolutional neural network using Mel spectrogram for features. By this study, we found that detector with the proposed network can detect the specific situation such as an emergency, and finally, can classify the gestures with the sound data.