It is common to store and manage many personal photos in the personal computer (PC) due to wide use of digital cameras. The larger the number of photos in the PC is, the more difficult it is for us to find a specific one among them. We suggest an attractive way to search photos by using speech query. If speech segment corresponding to the input query is included in some voice documents attached to the photos, the retrieval system will provide us the list of relevant photos that are stored in the PC. For the speech-based contents retrieval system, we propose two approaches that are based not on the speech-to-text conversion strategy but on the speech-to-speech matching strategy. The first one uses phoneme recognition techniques for the matching and the second uses traditional techniques such as vector quantization and dynamic time warping.
For the phoneme recognition approach, we take two different methods. One is to use phoneme-occurrence information and the other is to use phoneme-sequential information additionally. These methods use the phoneme recognizer as the baseline process to produce the phoneme sequence for the speech input. In these methods, the pattern of phoneme sequence in the query is compared with those in the recorded files, and the similarities are calculated, which represent how much the queries are similar with the recorded files.
The method using vector quantization(VQ) and dynamic time warping(DTW) is that the feature vectors of speech are clustered by vector quantization and the similarities are calculated between the clustered patterns of query and the recorded files by using dynamic time warping. Because dynamic time warping needs an amount of time, an alternative way is used to reduce the computations. At first, the frame sequence is separated into two sequences. One consists of the even numbered frames in the original frame sequence and the other consists of the odd numbered frames. Each sequence is compared with the odd or even...