Thanks to enhancing image-text retrieval(ITR) application based on cross-modal retrieval, the application's latency is reduced by extracting feature embeddings of image and text offline. However, due to the similarity search that is the application's bottleneck, it is still not feasible to service online ITR according to our analysis of ITR workloads on GPU. In this paper, we propose a novel software-hardware design to accelerate the similarity search and implement it on a Xilinx Alveo U280 card. We reduce the dataset by 92.4% through quantizing embedding dataset from 32-bit floating point to 8-bit fixed point and reconstructing sparse text embedding matrices to be dense. Our reconstructed dataset searching algorithm is implemented as a 4-stage pipeline and leverages our custom dataflow, which minimizes off-chip data transfer. We achieve up to 214.5x and 8.3x faster and up to 264.2x and 41.7x more energy-efficient than the baseline and optimized GPU design, respectively, on the MS-COCO 5K dataset.