This paper describes a novel approach for Visual Question Answering.
Our proposed network solves an open-ended problem with candidate answer recommendation, which is generated solely from the given question.
Then, we combine the score from our question-aware prediction module and the score from candidate answer recommendation module to determine the final composite score.
Our approach uses the bag-of-words (BOW) framework to understand questions, instead of a complex and neural-network-based module; therefore, an additional dataset to pre-train the language model is not required.
Moreover, we show that the BOW framework is capable of extracting the keywords from the question.
Although our proposed approach does not achieve the state-of-the-art performance overall, our approach performs the best for certain types of questions with a small amount of training data.