This study proposes a novel approach for a vision-based navigation problem using semantically segmented aerial images generated by a convolutional neural network. Vision-based navigation provides a position solution by matching an aerial image to a georeferenced database, and it has been increasingly studied for global navigation satellite system-denied environments. Aerial images include a vast amount of information that infers the position where they are located. However, it also includes features that disturb the estimation accuracy. The progress of convolutional neural network may provide a promising solution for extracting only helpful features for this purpose. Therefore, segmented images are modeled as a Gaussian mixture model, and the L2 distance for a quantitative discrepancy between two images is established. This allows us to compare the two images quickly with improved accuracy. In addition, a framework of a particle filter is applied to estimate the position using an inertial navigation system. It employs the L2 distance as a measurement, and the particles tend to converge to the true position. Flight test experiments were conducted to verify that the proposed approach achieved distance error of less than 10 m.