Achuth Rao's PhD
Probabilistic source-filter model of speech
The human respiratory system plays a crucial role in breathing and swallowing. However, it also plays an essential role in speech production, which is unique to humans. Speech production involves expelling air from the lungs. As the air flows from the lungs to the lips, some kinetic energy gets converted to sound. Different structures modulate the generated sound, which is finally radiated out of the lips. The speech consists of various information such as linguistic content, speaker identity, emotional
state, accent, etc. Apart from speech, there are various scenarios where the sound is generated in the human respiratory system. These could be due to abnormalities in the muscles, motor control unit, or the lungs, which can directly affect generated speech as well. A variety of sounds are also generated by these structures while breathing including snoring, Stridor, Dysphagia, and Cough. The source filter (SF) model of speech is one of the earlier models of speech production. It assumes that speech is a result of filtering an excitation or source signal by a linear filter. The source and filter are assumed to be independent. Even though the SF model represents the speech production mechanism, there needs to be a tractable way of estimating the excitation and the filter. The estimation of both of them given speech falls under the general category of signal deconvolution problem, and, hence, there is no unique solution. There are several variations of the source-filter model in the literature by assuming different structures on the source/filter. There are various ways to estimate the parameters of the source and the filter. The estimated parameters are used in various speech applications such as automatic speech recognition, text to speech, speech enhancement etc. Even though the SF model is a model of speech production, it is used in applications including Parkinson's Disease classification, asthma classification.
The existing source filter models show much success in various applications, however, we believe that the models mainly lack two respects. The first limitation is that these models lack the connection to the physics of sound generation or propagation. The second limitation of the current models is that they are not fully probabilistic. The inherent nature of the airflow is stochastic because of the presence of turbulence. Hence, probabilistic modeling is necessary to model the stochastic process. The probabilistic models come with several other advantages: 1) systematically inducing the prior knowledge into the models through probabilistic priors, 2) the estimation of the uncertainty of the model parameters, 3) allows sampling of new data points 4) evaluation of the likelihood of the observed speech.
We start with the governing equation of sound generation and use a simplified geometry of the vocal folds. We show that the sound generated by the vocal folds consists of two parts. The first part is because of the difference between the subglottal and supra glottal pressure difference. The second part is because of the sound generated by turbulence. The first kind is dominant in the voiced sounds, and the second part is dominant in the unvoiced sounds. We further assume the plane wave propagation in the vocal tract, and there is no feedback from the vocal tract on the vocal folds. The resulting model is the excitation passing through an all-pole filter, and the excitation is the sum of two signals. The first signal is quasi-periodic, and the shape of each cycle depends on the time-varying area of the glottis. The second part is stochastic because the turbulence is modeled as a white noise passed through a filter. We further convert the model into a probabilistic one by assuming the following distribution on the excitations and filters. We model the excitation using a Bernoulli Gaussian distribution. Filter coefficients are modeled using the Gaussian distribution. The noise distribution is also Gaussian. Given these distributions, the likelihood of the speech can be derived as a closed-form expression. Similarly, we impose an appropriate prior to the model’s parameters and make a maximum a posteriori estimation of the parameters. But the model assumption can be changed/approximated with respect to the application and resulting in different estimation procedures. To validate the model, we apply this model to seven applications as follows:
1. Analysis and Synthesis: This application is to understand the representation power of the model.
2. Robust GCI detection: This shows the usefulness of estimated excitation, and the probabilistic modeling helps to incorporate the second-order statistics for robust the excitation estimation.
3. Probabilistic glottal inverse filtering: This application shows the usefulness of the prior distribution on filters.
4. Neural speech synthesis: We show that the model’s reformulation with the neural network results in a computationally efficient neural speech synthesis.
5. Prosthetic esophageal (PE) to normal speech conversion: We use the probabilistic model for detecting the impulses in the noisy signal to convert the PE speech to normal speech.
6. Robust essential vocal tremor classification: The usefulness of robust excitation estimation in pathological speech such as essential vocal tremor.
7. Snorer group classification: Based on the analogy between voiced speech production and snore production, the derived model is applicable for snore signals. We also use the parameter of the model to classify the snorer groups.
Google Scholar Research Guides : Dr. Prasanta Kumar Ghosh