DSpace at KOASAS: Multimodal Language Processing by Employing Phonetic and Discrete Characteristics of Speech Unit

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Ph.D.(박사논문)

Multimodal Language Processing by Employing Phonetic and Discrete Characteristics of Speech Unit음성유닛의 발음적 및 이산적 특성을 통한 멀티모달 언어 처리 및 학습

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 7
Download : 0

Export

Kim, Minsu / 김민수

When humans communicate with each other, they naturally utilize multimodal information such as visual, audio, and text information. This multimodal information allows humans to understand better the intent and content of ongoing conversations. This is because the human brain has great knowledge in modeling the relationships among different multimodal features. We explore how can we develop the machine to understand the relationships between different modalities. However, as different modalities have different data forms, it is not easy to develop each data-specific module. For example, audio represents a continuous-time signal, while video or images are 2-dimensional signals that may include optional time information, and text is a discrete signal devoid of temporal characteristics. To extract common representations from the audio speech, visual speech, and text modalities, we explore a discretized speech representation, namely speech unit. The speech unit is obtained by clustering (i.e., discretizing) extracted speech features from a pre-trained self-supervised speech model. As it is discretized, now we can express the continuous audio and visual signals with discrete representations. Moreover, it keeps the information of speech, the phonetic information. By employing the characteristics of speech unit, phonetic and discrete, we show that we can improve different multimodal translation systems, visual speech-to-text translation, speech-to-speech translation, and text-to-speech translation. First, in the visual speech-to-text translation, we show that we can learn general visual speech knowledge without depending on a specific language by using the speech unit, and improve the Visual Speech Recognition (VSR) performance for low VSR resource languages. Second, in speech-to-speech translation and text-to-speech translation, we can train a machine translation system as the text system has done by employing the discrete characteristics of speech units. That is, we treat the speech unit as pseudo text and show that speech-to-speech translation for multiple languages can be possible. The effectiveness of the proposed methods is evaluated with extensive experiments including comparisons with state-of-the-art methods, ablation studies, and qualitative analysis.

Advisors: 노용만 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2024

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 63 p. :]

Keywords: 멀티모달 음성 처리▼a멀티모달 처리▼a이산화된 자기 감독 표현▼a음성 유닛▼a음성 토큰▼a시각적 음성 인식▼a음성 대 음성 번역▼a문자 대 음성 번역; Multimodal speech processing▼amultimodal processing▼adiscretized self-supervised representation▼aspeech unit▼avisual speech recognition▼aspeech-to-speech translation▼atext-to-speech translation

URI: http://hdl.handle.net/10203/322185

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100091&flag=dissertation

Appears in Collection: EE-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Multimodal Language Processing by Employing Phonetic and Discrete Characteristics of Speech Unit음성유닛의 발음적 및 이산적 특성을 통한 멀티모달 언어 처리 및 학습

KOASAS

Communities & Collections