Understanding the relationships between objects in an image is an important problem in computer vision. Recently, methods for concerning the relationships have been proposed in many vision tasks, but there are few studies in the semantic-visual embedding problem. In this paper, we first propose a new dataset called R-CLEVR to concentrate on the relations between objects in semantic-visual problems, and we introduce an Object Phase Module (OPM) that focuses on relative locations of objects in an image. Experiments demonstrate that our proposed network with object phase module has the highest performance in cross-modal retrieval and phrase grounding problems on R-CLEVR datasets. Furthermore, our model demonstrates meaningful performance on MS-COCO dataset which has a relatively small number of object relations.