Where is your player : Deep pixel-wise visual localization on baseball game data via text-phrase = 야구 게임 데이터에 대한 텍스트 문구 기반 심층 픽셀 단위 시각적 위치 추정Deep pixel-wise visual localization on baseball game data via text-phrase
This paper considers a network referred to as p-LocalNet for pixel accuracy localization of the object referred to by the given input text-phrase. Given an image with a text-phrase describing an object of interest, the network is to localize the region of the object with pixel accuracy referred to by the text-phrase. To achieve this task, p-LocalNet associates visual representation with linguistic representation according to spatial area. The input text-phrase is fed into a long short-term memory network (LSTM) in generating local and global weights that can be associated with both spatially local and global visual representations of the input image. The spatially local and global visual representations of the input image are extracted from multi-level feature maps of convolutional neural network (CNN). To associate each visual representation with each weight, two stream feature-wise linear modulation (FiLM) are employed. To evaluate p-LocalNet, a small subset of MSCOCO dataset related only to baseball is collected and manually labeled. We refer to this dataset as the Baseball Game Dataset (BG-Dataset). The images are manually selected, and each image is described in detail and labeled in a binary map highlighting the object. The experimental results demonstrate that BG-Dataset is well organized to localize the object based on text-phrase, and p-LocalNet is capable of localizing the object with high pixel accuracy.