LAB EVENTS > EVENTS

Aravind Illa's PhD

Acoustic-Articulatory Mapping: Analysis and Improvements with Neural Network Learning Paradigms

Human speech is one of many acoustic signals we perceive, which carries linguistic and paralinguistic (e.g: speaker identity, emotional state) information. Speech acoustics are produced as a result of different temporally overlapping gestures of speech articulators (such as lips, tongue tip, tongue body, tongue, dorsum, velum, and larynx) each of which
regulates constriction in different parts of the vocal tract. Estimating speech acoustic representations from articulatory movements is known as articulatory-to-acoustic forward (AAF) mapping i.e., articulatory speech synthesis. While estimating articulatory movements back from the speech acoustics is known as acoustic-to-articulatory inverse (AAI) mapping. These acoustic-articulatory mapping functions are known to be complex and nonlinear. Complexity of this mapping depends on a number of factors. These include the kind of representations used in the acoustic and articulatory spaces. Typically these representations capture both linguistic and paralinguistic aspects in speech. How each of these aspects contributes to the complexity of the mapping is unknown. These representations and, in turn, the acoustic-articulatory mapping are affected by the speaking rate as well. The nature and quality of the mapping varies across speakers. Thus, complexity of mapping also depends on the amount of the data from a speaker as well as number of speakers used in learning the mapping function. Further, how the language variations impact the mapping requires detailed investigation. This thesis analyzes few of such factors in detail and develops neural network based models to learn mapping functions robust to many of these factors. Electromagnetic articulography (EMA) sensor data has been used directly in the past as articulatory representations (ARs) for learning the acoustic-articulatory mapping function. In this thesis, we address the problem of optimal EMA sensor placement such that the air-tissue boundaries as seen in the mid-sagittal plane of the real-time magnetic resonance imaging (rtMRI) is reconstructed with minimum error. Following optimal sensor placement work, acoustic-articulatory data was collected using EMA from 41 subjects with speech stimuli in English and Indian native languages (Hindi, Kannada, Tamil and Telugu) which resulted in a total of ~23 hours of data, used in this thesis. Representations from raw waveform are also learnt for AAI task using convolutional and bidirectional long short term memory neural networks (CNN-BLSTM), where the learned filters of CNN are found to be similar to those used for computing Mel-frequency cepstral coefficients (MFCCs), typically used for AAI task. In order to examine the extent to which a representation having only the linguistic information can recover ARs, we replace MFCC vectors with one-hot encoded vectors representing phonemes, which were further modified to remove the time duration of each phoneme and keep only phoneme sequence. Experiments with phoneme sequence using attention network achieve an AAI performance that is identical to that using phoneme with timing information, while there is a drop in performance compared to that using MFCC. Experiments to examine variation in speaking rate reveal that, the errors in estimating the vertical motion of tongue articulators from acoustics with fast speaking rate, is significantly higher than those with slow speaking rate. In order to reduce the demand for data from a speaker, low resource AAI is proposed using a transfer learning approach. Further, we show that AAI can be modeled to learn acoustic-articulatory mappings of multiple speakers through a single AAI model rather than building separate speaker-specific models. This is achieved by conditioning an AAI model with speaker embeddings, which benefits AAI in seen and unseen speaker evaluations. Finally, we show the benefit of estimated ARs in voice conversion application. Experiments revealed that ARs estimated from speaker independent AAI preserves linguistic information and suppress speaker-dependent factors. These ARs (from unseen speaker and language) are used to drive target speaker specific AAF to synthesis speech, which preserves linguistic information and target speaker’s voice characteristics.


Google Scholar

Research Guides : Dr. Prasanta Kumar Ghosh

Abinay Reddy's MTech by Research

Speaker verification using whispered speech

Like neutral speech, whispered speech is one of the natural modes of speech production, and it is often used by speakers in their day-to-day life. For some people, such as laryngectomees, whispered speech is the only mode of communication. Despite the absence of voicing in whispered speech and difference in characteristics compared to the neutral speech, previous works in the literature demonstrated that whispered speech contains dequate information about the content and the speaker.
In recent times, virtual assistants have become more natural and widespread. This led to an increase in the scenarios, where the device has to detect the speech and verify the speaker even if the speaker whispers. Due to the noise-like characteristics, detecting whispered speech is a challenge. On the other hand, a typical speaker verification system, where neutral speech is used for enrolling the speakers but whispered speech for testing, often performs poorly due to the difference in acoustic characteristics between the whispered and the neutral speech. Hence, the aim of this thesis is two-fold: 1) develop a robust whisper activity detector specifically for speaker verification task, 2) improve whispered speech based speaker verification performance. The contributions in this thesis lie in whisper activity detection as well as whispered speech based speaker verification. It is shown how an Attention-based average pooling in a speaker verification model can be used to detect the whispered speech regions in noisy audio more accurately than the best of the baseline schemes available. For improving speaker verification using whispered speech, we proposed features based on formant gaps, and we showed that these features are more invariant to the modes of the speech compared to the best of the existing features. We also proposed two feature mapping methods to convert the whispered features to neutral features for speaker verification. In the first method, we introduced a novel objective function, based on cosine similarity, for training a DNN, used for feature mapping. In the second method, we iteratively optimized the feature mapping model using cosine similarity based objective function and the total variability space likelihood in the i-vector based background model. The proposed optimization provided a more reliable mapping from whispered features to neutral features resulting in an improvement of speaker verification equal error rate by 44.8% (relative) over an existing DNN based feature mapping scheme.


Google Scholar

Research Guides : Dr. Prasanta Kumar Ghosh

Publications

Achuth Rao's PhD

Probabilistic source-filter model of speech

The human respiratory system plays a crucial role in breathing and swallowing. However, it also plays an essential role in speech production, which is unique to humans. Speech production involves expelling air from the lungs. As the air flows from the lungs to the lips, some kinetic energy gets converted to sound. Different structures modulate the generated sound, which is finally radiated out of the lips. The speech consists of various information such as linguistic content, speaker identity, emotional
state, accent, etc. Apart from speech, there are various scenarios where the sound is generated in the human respiratory system. These could be due to abnormalities in the muscles, motor control unit, or the lungs, which can directly affect generated speech as well. A variety of sounds are also generated by these structures while breathing including snoring, Stridor, Dysphagia, and Cough. The source filter (SF) model of speech is one of the earlier models of speech production. It assumes that speech is a result of filtering an excitation or source signal by a linear filter. The source and filter are assumed to be independent. Even though the SF model represents the speech production mechanism, there needs to be a tractable way of estimating the excitation and the filter. The estimation of both of them given speech falls under the general category of signal deconvolution problem, and, hence, there is no unique solution. There are several variations of the source-filter model in the literature by assuming different structures on the source/filter. There are various ways to estimate the parameters of the source and the filter. The estimated parameters are used in various speech applications such as automatic speech recognition, text to speech, speech enhancement etc. Even though the SF model is a model of speech production, it is used in applications including Parkinson's Disease classification, asthma classification. The existing source filter models show much success in various applications, however, we believe that the models mainly lack two respects. The first limitation is that these models lack the connection to the physics of sound generation or propagation. The second limitation of the current models is that they are not fully probabilistic. The inherent nature of the airflow is stochastic because of the presence of turbulence. Hence, probabilistic modeling is necessary to model the stochastic process. The probabilistic models come with several other advantages: 1) systematically inducing the prior knowledge into the models through probabilistic priors, 2) the estimation of the uncertainty of the model parameters, 3) allows sampling of new data points 4) evaluation of the likelihood of the observed speech. We start with the governing equation of sound generation and use a simplified geometry of the vocal folds. We show that the sound generated by the vocal folds consists of two parts. The first part is because of the difference between the subglottal and supra glottal pressure difference. The second part is because of the sound generated by turbulence. The first kind is dominant in the voiced sounds, and the second part is dominant in the unvoiced sounds. We further assume the plane wave propagation in the vocal tract, and there is no feedback from the vocal tract on the vocal folds. The resulting model is the excitation passing through an all-pole filter, and the excitation is the sum of two signals. The first signal is quasi-periodic, and the shape of each cycle depends on the time-varying area of the glottis. The second part is stochastic because the turbulence is modeled as a white noise passed through a filter. We further convert the model into a probabilistic one by assuming the following distribution on the excitations and filters. We model the excitation using a Bernoulli Gaussian distribution. Filter coefficients are modeled using the Gaussian distribution. The noise distribution is also Gaussian. Given these distributions, the likelihood of the speech can be derived as a closed-form expression. Similarly, we impose an appropriate prior to the model’s parameters and make a maximum a posteriori estimation of the parameters. But the model assumption can be changed/approximated with respect to the application and resulting in different estimation procedures. To validate the model, we apply this model to seven applications as follows: 1. Analysis and Synthesis: This application is to understand the representation power of the model. 2. Robust GCI detection: This shows the usefulness of estimated excitation, and the probabilistic modeling helps to incorporate the second-order statistics for robust the excitation estimation. 3. Probabilistic glottal inverse filtering: This application shows the usefulness of the prior distribution on filters. 4. Neural speech synthesis: We show that the model’s reformulation with the neural network results in a computationally efficient neural speech synthesis. 5. Prosthetic esophageal (PE) to normal speech conversion: We use the probabilistic model for detecting the impulses in the noisy signal to convert the PE speech to normal speech. 6. Robust essential vocal tremor classification: The usefulness of robust excitation estimation in pathological speech such as essential vocal tremor. 7. Snorer group classification: Based on the analogy between voiced speech production and snore production, the derived model is applicable for snore signals. We also use the parameter of the model to classify the snorer groups.


Google Scholar

Research Guides : Dr. Prasanta Kumar Ghosh

Publications

Mannem Renuka's M.Tech by research

Speech task-specific representation learning using acoustic-articulatory data

Human speech production involves modulation of the air stream by the vocal tract shape determined by the articulatory configuration. Articulatory gestures are often used to represent the speech units. It has been shown that the articulatory representations contain information complementary to the acoustics. Thus, a speech task could benefit from the representations derived from both acoustic and articulatory data. A typical acoustic
representation consists of spectral and temporal characteristics e.g., Mel Frequency Cepstral Coefficients (MFCCs), Line Spectral Frequencies (LSF), and Discrete Wavelet Transform (DWT). On the other hand, articulatory representations vary depending on how the articulatory movements are captured. For example, when Electro-Magnetic Articulography (EMA) is used, the recorded raw movements of the EMA sensors placed on the tongue, jaw, upper lip, and lower lip and tract variables derived from them have often been used as articulatory representations. Similarly, when real-time Magnetic Resonance Imaging (rtMRI) is used, articulatory representations are derived primarily based on the Air-Tissue Boundaries (ATB) in the rtMRI video. The low resolution and SNR of the rtMRI video makes the ATB segmentation challenging. Therefore, we propose various supervised ATB segmentation algorithms which include semantic segmentation, object contour detection using deep convolutional networks. The proposed approaches predict ATBs better than the existing baselines, namely, Maeda Grid and Fisher Discriminant Measure based schemes. We also propose a deep fully-connected neural network based ATB correction scheme as a post processing step to improve upon the predicted ATBs. However, articulatory data is not directly available in practice, unlike the speech recording. Thus, we also consider the articulatory representations derived from acoustics using an Acoustic-to-Articulatory Inversion (AAI) method.


LinkedIn

Research Guides : Dr. Prasanta Kumar Ghosh

Publications

Chiranjeevi's PhD

Pronunciation assessment and semi-supervised feedback prediction for spoken English tutoring

Spoken English pronunciation quality is often influenced by the nativity of a learner, for whom English is the second language. Typically, the pronunciation quality of a learner depends on the degree of following four sub-qualities: 1) phoneme quality 2) syllable stress quality 3) intonation quality, and 4) fluency. In order to achieve a good pronunciation quality,
learners need to minimize their nativity influences in each of the four sub-qualities, which can be achieved with effective spoken English tutoring methods. However, these methods are expensive as they require highly proficient English experts. In cases, where a cost-effective solution is required, it is useful to have a tutoring system which assesses a learner's pronunciation and provides feedback in each of the four sub-qualities to minimize nativity influences in a manner similar to that of a human expert. Such kind of systems are also useful for learners who can not access high quality tutoring due to their demographic and physical constraints. In this thesis, several methods are developed to assess pronunciation quality and provide feedback for such a spoken English tutoring system for Indian learners.


Google Scholar

Research Guides : Dr. Prasanta Kumar Ghosh

Publications

Nisha Meenakshi's Ph.D

Analysis of whispered speech and its conversion to neutral speech

Whispering is an indispensable form of communication that emerges in private conversations as well as in pathological situations. In conditions such as partial or total laryngectomy, spasmodic dysphonia etc, alaryngeal speech such as esophageal, tracheo-esophageal speech and hoarse whispered speech are common. Whispered speech is primarily characterized by the lack of vocal fold vibrations, and, hence, pitch. In
recent times, applications such as voice activity detection, speaker identification and verification and speech recognition have been extended to whispered speech as well. Several efforts have also been undertaken to convert the less intelligible whispered speech into a more natural sounding neutral speech. Although supported by literature, research towards gaining a better understanding of whispered speech largely remains unexplored. Hence, the aim of the thesis is two-fold, 1) to analyze different characteristics of whispered speech using both speech and articulatory data, 2) to perform whispered speech to neutral speech conversion using the state-of-the-art modelling techniques.


Google Scholar

Research Guides : Dr. Prasanta Kumar Ghosh

Publications

Karthik Girija Ramesh's M.Tech by research

Binaural source localization using subband reliability and interaural time difference patterns

Machine localization of sound sources is necessary for a wide range of applications, including human-robot interaction, surveillance and hearing aids. Robot sound localization algorithms have been proposed using microphone arrays with varied number of microphones. Adding more microphones helps increase the localization performance as more
spatial cues can be obtained based on the number and arrangement of the microphones. However, humans have an incredible ability to accurately localize and attend to target sound sources even in adverse noise conditions. The perceptual organization of sounds in complex auditory scenes is done using various cues that help us group/segregate sounds. Among these, two major spatial cues are the Interaural time difference (ITD) and Interaural level/intensity difference(ILD/IID). An algorithm inspired by binaural localization of humans would extract these features from the input signals. Popular algorithms, for binaural source localization, model the distributions of ITD & ILD in each frequency subband (typically in the range of 80Hz-5kHz for speech source) using Gaussian Mixture Models (GMMs) and perform likelihood integration across the time-frequency plane to estimate the direction of arrival (DoA) of the sources. In this thesis, we show that the localization performance of a GMM based scheme varies across subbands. We propose a weighted subband likelihood scheme in order to exploit the subband reliability for localization. The weights are computed by applying a non-linear warping function on subband reliabilities. Source localization results demonstrate that the proposed weighted scheme performs better than uniformly weighing all subbands. In particular, the best set of weights closely correspond to the case of selecting only the most reliable subband. We also propose a new binaural localization technique in which templates, that capture the direction-specific interaural time difference patterns, are used to localize sources. These templates are obtained using histograms of ITDs in each subband. DoA is estimated using a template matching scheme, which is experimentally found to perform better than the GMM based scheme. The concept of matching interaural time difference patterns is also extended to binaural localization of multiple speech sources.


Google Scholar

Research Guides : Dr. Prasanta Kumar Ghosh

Publications

Pavan Subhaschandra Karjol's M.Tech by research

Speech enhancement using deep mixture of experts

Speech enhancement is at the heart of many applications such as speech communication, automatic speech recognition, hearing aids etc. In this work, we consider the speech enhancement under the framework of multiple deep neural network (DNN) system. DNNs have been extensively used in speech enhancement due to its ability to capture complex variations in the input data. As a natural extension, researchers have used
variants of a network with multiple DNNs for speech enhancement. Input data could be clustered to train each DNN or train all the DNNs jointly without any clustering. In this work, we propose clustering methods for training multiple DNN systems and its variants for speech enhancement. One of the proposed works involves grouping phonemes into broad classes and training separate DNN for each class. Such an approach is found to perform better than single DNN based speech enhancement. However, it relies on phoneme information which may not be available for all corpora. Hence, we propose a hard expectation- maximization (EM) based task specific clustering method, which, automatically determines clusters without relying on the knowledge of speech units. The idea is to redistribute the data points among multiple DNNs such that it enables better speech enhancement. The experimental results show that the hard EM based clustering performs better than the single DNN based speech enhancement and provides similar results as that of the broad phoneme class based approach.


LinkedIn

Research Guides : Dr. Prasanta Kumar Ghosh

Publications