In addition to visual features, the video also contains temporal information that contributes to semantic meaning regarding the relationships between objects and scenes. There have been many attempts to describe spatial and temporal relationships in the video, but simple encoder-decoder models are not sufficient for capturing long-range relationships in video clips because of the limitations of the local operations in recurrent models. In other fields, including visual question answering (VQA) and action recognition, researchers began to have interests in describing visual relations between the objects. In this paper, we introduce a video captioning method to capture temporal long-range dependencies with a non-local block. The proposed model utilizes both local and non-local features. We evaluate our approach on a Microsoft Video Description Corpus (MSVD, YouTube2Text) dataset and a Microsoft Research-Video to Text (MSR-VTT) dataset. The experimental results show that a non-local block applied along a temporal axis could compensate the long-range dependency problem of the LSTM on video captioning datasets.