When humans communicate with each other, they naturally utilize multimodal information such as visual, audio, and text information. This multimodal information allows humans to understand better the intent and content of ongoing conversations. This is because the human brain has great knowledge in modeling the relationships among different multimodal features. We explore how can we develop the machine to understand the relationships between different modalities. However, as different modalities have different data forms, it is not easy to develop each data-specific module. For example, audio represents a continuous-time signal, while video or images are 2-dimensional signals that may include optional time information, and text is a discrete signal devoid of temporal characteristics. To extract common representations from the audio speech, visual speech, and text modalities, we explore a discretized speech representation, namely speech unit. The speech unit is obtained by clustering (i.e., discretizing) extracted speech features from a pre-trained self-supervised speech model. As it is discretized, now we can express the continuous audio and visual signals with discrete representations. Moreover, it keeps the information of speech, the phonetic information. By employing the characteristics of speech unit, phonetic and discrete, we show that we can improve different multimodal translation systems, visual speech-to-text translation, speech-to-speech translation, and text-to-speech translation. First, in the visual speech-to-text translation, we show that we can learn general visual speech knowledge without depending on a specific language by using the speech unit, and improve the Visual Speech Recognition (VSR) performance for low VSR resource languages. Second, in speech-to-speech translation and text-to-speech translation, we can train a machine translation system as the text system has done by employing the discrete characteristics of speech units. That is, we treat the speech unit as pseudo text and show that speech-to-speech translation for multiple languages can be possible. The effectiveness of the proposed methods is evaluated with extensive experiments including comparisons with state-of-the-art methods, ablation studies, and qualitative analysis.