DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | 노용만 | - |
dc.contributor.author | Kim, Minsu | - |
dc.contributor.author | 김민수 | - |
dc.date.accessioned | 2024-08-08T19:31:42Z | - |
dc.date.available | 2024-08-08T19:31:42Z | - |
dc.date.issued | 2024 | - |
dc.identifier.uri | http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100091&flag=dissertation | en_US |
dc.identifier.uri | http://hdl.handle.net/10203/322185 | - |
dc.description | 학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 63 p. :] | - |
dc.description.abstract | When humans communicate with each other, they naturally utilize multimodal information such as visual, audio, and text information. This multimodal information allows humans to understand better the intent and content of ongoing conversations. This is because the human brain has great knowledge in modeling the relationships among different multimodal features. We explore how can we develop the machine to understand the relationships between different modalities. However, as different modalities have different data forms, it is not easy to develop each data-specific module. For example, audio represents a continuous-time signal, while video or images are 2-dimensional signals that may include optional time information, and text is a discrete signal devoid of temporal characteristics. To extract common representations from the audio speech, visual speech, and text modalities, we explore a discretized speech representation, namely speech unit. The speech unit is obtained by clustering (i.e., discretizing) extracted speech features from a pre-trained self-supervised speech model. As it is discretized, now we can express the continuous audio and visual signals with discrete representations. Moreover, it keeps the information of speech, the phonetic information. By employing the characteristics of speech unit, phonetic and discrete, we show that we can improve different multimodal translation systems, visual speech-to-text translation, speech-to-speech translation, and text-to-speech translation. First, in the visual speech-to-text translation, we show that we can learn general visual speech knowledge without depending on a specific language by using the speech unit, and improve the Visual Speech Recognition (VSR) performance for low VSR resource languages. Second, in speech-to-speech translation and text-to-speech translation, we can train a machine translation system as the text system has done by employing the discrete characteristics of speech units. That is, we treat the speech unit as pseudo text and show that speech-to-speech translation for multiple languages can be possible. The effectiveness of the proposed methods is evaluated with extensive experiments including comparisons with state-of-the-art methods, ablation studies, and qualitative analysis. | - |
dc.language | eng | - |
dc.publisher | 한국과학기술원 | - |
dc.subject | 멀티모달 음성 처리▼a멀티모달 처리▼a이산화된 자기 감독 표현▼a음성 유닛▼a음성 토큰▼a시각적 음성 인식▼a음성 대 음성 번역▼a문자 대 음성 번역 | - |
dc.subject | Multimodal speech processing▼amultimodal processing▼adiscretized self-supervised representation▼aspeech unit▼avisual speech recognition▼aspeech-to-speech translation▼atext-to-speech translation | - |
dc.title | Multimodal Language Processing by Employing Phonetic and Discrete Characteristics of Speech Unit | - |
dc.title.alternative | 음성유닛의 발음적 및 이산적 특성을 통한 멀티모달 언어 처리 및 학습 | - |
dc.type | Thesis(Ph.D) | - |
dc.identifier.CNRN | 325007 | - |
dc.description.department | 한국과학기술원 :전기및전자공학부, | - |
dc.contributor.alternativeauthor | Ro, Yong Man | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.