SPIRE: Signal Processing Interpretation and Representation Lab

Natural Non-Native English Speech Synthesis

About

A typical text to speech synthesis (TTS) system is trained on an individual's voice data so that the synthesized voice mimics that of the training subject. While the quality of the synthesized speech is nearly natural with the state-of-the-art TTS systems, there are a number of applications, where the user from native language L1 may have difficulty in listening to the synthesized speech in language L2 due to an unfamiliar accent of the subject used for TTS training. For example, TTS for eLearning has been proven to help students learn. In fact, there are several TTS softwares commercially available for eLearning. Depending on the choice of the training subject and TTS quality, the difficulty in listening the synthesized voice could become critical in such applications. This is also common when the listener and speaker have two different native languages in several voice based applications. This project aims at catering to such users by synthesizing English speech in an accent that suits the listeners’ native language without altering the content and speaker’s characteristics. It is referred to as non-native English speech synthesis. Such non-native English speech synthesis would also be useful for Computer-Assisted Language Learning (CALL) applications where learners could listen to reference speech with their own voices. Non-native English speech synthesis solution could also be useful to generate speech-to-speech translation output with the input speaker’s voice.

Objectives:

Analysis of the difficulty in understanding native English speech by people of Indian nativities. This will be carried out with subjects from multiple Indian languages.
Analysis and modelling of Indian nativity specific articulation in contrast to that of native English speech. This will help in figuring out parameters needed to modify in non-native English speech synthesis.
Developing speech synthesis framework where nativity specific parameters are explicitly modeled and adapted for non-native English speech synthesis.
Investigate the minimum amount of non-native voice data needed for adaptation to generate the non-native synthesized speech.
Evaluate the quality of non-native synthesized speech in English for multiple Indian languages.

>>top<<