G2PU: Grapheme-To-Phoneme Transducer with Speech UnitsG2PU: Grapheme-To-Phoneme Transducer with Speech Units

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 8
  • Download : 0
Most phoneme transcripts are generated using forced alignment: typically a grapheme-to-phoneme transducer (G2P) is applied to text sequences to generate candidate phoneme transcripts, which are then time-aligned to the waveform using an acoustic model. This paper demonstrates, for the first time, simultaneous optimization of the G2P, the acoustic model, and the acoustic alignment to a corpus. To this end, we propose G2PU, a joint CTC-attention model consisting of an encoder-decoder G2P network and an encoder-CTC unit-to-phoneme (U2P) network, where the units are extracted from speech. We demonstrate that the G2P and U2P, operating in parallel, produce lower phone error rates than those of state-of-the-art open-source G2P and forced alignment systems. Furthermore, although the G2P and U2P are trained using parallel speech and text, their synergy can be generalized to text-only test corpora if we also train a grapheme-to-unit (G2U) network that generates speech units from text in the absence of parallel speech. Our G2PU model is trained using phoneme transcripts generated by a teacher G2P tool. Our experiments on Chinese and Japanese show that G2PU reduces phoneme error rate by 7% to 29% relative compared to its teacher. Finally, we include case studies to provide insights into the system’s workings.
Publisher
2024 IEEE International Conference on Acoustics, Speech and Signal Processing
Issue Date
2024-04
Language
English
Citation

2024 IEEE International Conference on Acoustics, Speech and Signal Processing

URI
http://hdl.handle.net/10203/323310
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0