Top-down selective attention with a deep neural network and confidence measure for automatic speech recognition = 심층 신경망에의 하향식 주의 집중과 신뢰도 측정을 이용한 자동 음성 인식 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 193
  • Download : 0
In cognitive science, a top-down selective attention (TDSA) mechanism of humans has been studied for decades and is known to be controlled by “objects” in our mind via feedback processes. This cognitive process enhances the perceptual saliency of a response to the object of interest and filters out irrelevant responses. The engineering models using TDSA have been proposed for out-of-vocabulary rejection, and isolated word recognition. In this work, we apply the TDSA mechanism to the N-best rescoring framework to provide attentional information of confusing words within competing hypotheses. The TDSA mechanism is applied to adapt a test input feature for several confusing words. The attentional information required to rescore the hypotheses is then derived as the probability of the adapted features and the amount of feature deformation. Recently, numerous neural network models with attention have been developed and successfully applied to diverse tasks. The sequence to sequence learning framework with attention has become especially popular for sequence labeling tasks such as neural machine translation, image caption generation, and speech recognition. While predicting a soft-window over input sequences corresponding to output targets in previous attention works, our attention approach adapts a test input feature “directly” using a gradient to maximize the probability of the feature given target words. Therefore, our system provides the most probable feature of the target words without the need to train extra attention networks We propose an N-best rescoring and utterance verification systems that integrate attentional information for locally confusing words extracted from alternative hypotheses to a conventional speech recognition system. The attentional information is derived by adapting a test input feature for the word of interest, which is motivated by the top-down selective attention mechanism of the brain. To rescore the competing hypotheses, we define a new confidence measure that contains both the conventional posterior probability and the attentional information for the confusing words. In addition, a neural network is designed to provide different weights within the confidence measure for each utterance. The network is then optimized to minimize the word error rates. Tests on the WSJ and Aurora4 speech recognition tasks were conducted, and our best rescoring results achieve a word error rate of 3.83% and 11.09%, yielding a relative reduction of 5.20% and 2.55% over baselines, respectively.
Lee, Soo-Youngresearcher이수영researcher
한국과학기술원 :전기및전자공학부,
Issue Date

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2018.2,[vii, 103 p. :]


top-down selective attention▼aconfidence measure▼aN-best rescoring▼aparameter optimization▼autterance verification▼aautomatic speech recognition; 하향식 선택 집중▼a신뢰도 측정▼aN-best 리스코어링▼a매개 변수 최적화▼a음성 확인▼a자동 음성 인식

Appears in Collection
Files in This Item
There are no files associated with this item.


  • mendeley


rss_1.0 rss_2.0 atom_1.0