Visual information is rich in content, and robots require computer vision techniques to encode images into information to utilize the images. Robot vision transforms the image into descriptors using predefined patterns, whether defined by handcrafted or learned methods. However, the image descriptors are not explainable to human intelligence and limit human-robot interaction upon vision tasks. On the other hand, recent studies have discovered an efficient and expandable method of transforming an image into natural language forms. With visual transformers, the context in an image is translated into natural language representations. To create an image representation both understandable to humans and artificial intelligence, in this paper, we present a method of using the language-image model as natural representations for robotic place recognition tasks.