S P I R E    L A B

Menu Icon

S i g n a l    P r o c e s s i n g    I n t e r p r e t a t i o n    a n d    R E p r e s e n t a t i o n    L A B o r a t o r y


LAB EVENTS > TALKS

16th January 2026 : End-to-End Spoken Grammatical Error Correction by Pallavi Singh
16th January 2026 : End-to-End Spoken Grammatical Error Correction by Pallavi Singh

Talk summary:

  • Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners’ speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak&Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.

9th January 2026 : Dimensions of Structure and Variability in the Human Vocal Tract: From Manual Measurement to Few-Shot Deep Learning rtMRI Analysis by Chetan Sharma
9th January 2026 : Dimensions of Structure and Variability in the Human Vocal Tract: From Manual Measurement to Few-Shot Deep Learning rtMRI Analysis by Chetan Sharma

Talk summary:

  • A defining characteristic of the human vocal tract is the complex interaction between anatomical structure and inter-speaker variability. Across a population, vocal tract dimensions vary substantially, yet variation in one dimension is rarely independent of others, raising the question of whether invariant morphological relationships exist. In this work, we present a data-driven investigation of vocal tract dimensions using multi-speaker real-time magnetic resonance imaging (rtMRI) data, replicating prior findings that reveal distinct sub-populations of speakers who share common patterns of co-variation among vocal tract parameters. In addition to reproducing the original manual measurement framework, we extend this methodology by introducing an automated pipeline based on deep learning–driven Air–Tissue Boundary segmentation using SegNet architecture. The segmentation models are pretrained on multi-speaker rtMRI data and subsequently adapted to new speakers using a small number of annotated frames, enabling reliable extraction of anatomical parameters with minimal manual effort. We show that the automatically derived measurements recover the core structure–variability relationships reported in the original study, demonstrating that large-scale vocal tract morphology analysis can be performed accurately and efficiently using automatic segmentation and few-shot model adaptation.

2nd December 2025 : Implementation of RAG pipeline with government schemes dataset by Kamali Ramesh
2nd December 2025 : Implementation of RAG pipeline with government schemes dataset by Kamali Ramesh

Talk summary:

  • To implement RAG pipeline such that it works with dataset of government schemes to assist in form-filling while restricting irrelevant queries and bringing novelty in answering all types of queries and across Hindi, English and Hinglish languages and evaluation of its responses across different metrics

26th December 2025 : “Why Does CTC Become Peaky? An Analysis of Alignment Collapse in End-to-End ASR” by K Venu Reddy
26th December 2025 : “Why Does CTC Become Peaky? An Analysis of Alignment Collapse in End-to-End ASR” by K Venu Reddy

Talk summary:

  • Connectionist Temporal Classification (CTC) has become a foundational loss function for end-to-end automatic speech recognition (ASR) due to its ability to learn alignments between variable-length acoustic sequences and label sequences without explicit frame-level supervision. However, a well-known yet under-analyzed phenomenon in CTC-based models is the emergence of peaky posterior distributions, where most non-blank symbols receive high probability mass only at a small number of time steps, while the blank symbol dominates elsewhere. In this talk, we analyze the origins of this peaky behavior from both probabilistic and optimization perspectives. We show how the CTC marginalization over alignments, combined with conditional independence assumptions and softmax normalization, encourages over-confident frame-level predictions. This leads to sharp, spiky alignments that are brittle under noise, limit gradient propagation, and adversely affect downstream decoding and representation learning—especially in low-resource or noisy conditions. We further discuss the practical consequences of peakiness, including poor uncertainty modeling, over-reliance on blanks, and reduced robustness in streaming and multi-task ASR settings. Finally, we review and categorize existing mitigation strategies—such as label smoothing, alignment regularization, auxiliary losses, and hybrid attention-CTC formulations—and provide insights into how these approaches alter the training dynamics to produce smoother and more stable alignments. This analysis aims to deepen understanding of CTC’s training behavior and motivate principled design choices for more robust end-to-end ASR systems.

12th December 2025 : Nested Learning: A new ML paradigm for continual learning by Seshan S
12th December 2025 : Nested Learning: A new ML paradigm for continual learning by Seshan S

Talk summary:

  • Over the past decades, improving neural architectures and their training algorithms has been central to machine learning research. Despite advances, particularly in large Language Models, fundamental challenges remain in continual learning, self-improvement, and effective problem-solving. We introduce Nested Learning (NL), a paradigm that models a system as nested, multi-level, and/or parallel optimization problems, each with its own context flow. NL frames existing deep learning methods as compressing their context flow, naturally giving rise to in-context learning, and offers a path toward higher-order in-context and continual learning. We demonstrate NL through three contributions: (1) Expressive Optimizers that extend traditional gradient-based methods with deep memory, (2) a Self-Modifying Learning Module that learns its own update rules, and (3) a Continuum Memory System, generalizing short- and long-term memory. Combining these, our Hope module shows promising results in language modeling, few-shot generalization, continual learning, and long-context reasoning.

12th December 2025 : PINN’s for Vocal Tract acoustic modelling by Atharva Jeevannavar
12th December 2025 : PINN’s for Vocal Tract acoustic modelling by Atharva Jeevannavar

Talk summary:

  • Traditional mesh-based numerical methods for acoustic modeling are computationally expensive, while standard Physics-Informed Neural Networks (PINNs) often struggle to stably enforce boundary conditions (BCs). This paper presents a robust PINN framework for 1-D acoustic wave propagation that incorporates a trial solution formulation, enforcing BCs by construction to eliminate hyperparameter tuning and stabilize optimization. The framework is developed for both the Helmholtz and Webster’s horn equations, with tailored formulations for Dirichlet and mixed (Robin/impedance) BCs. A key extension to complex-valued fields enables accurate modeling of physically realistic, frequency-dependent impedance. The method is validated on three cases: a uniform duct, a conical frustum, and an MRI-derived human vocal tract geometry, across frequencies from 500 to 2000 Hz. Results show excellent agreement with analytical solutions for all scenarios, including complex radiation effects. The proposed trial-solution PINN thus provides an accurate, mesh-free, and highly stable alternative for solving complex 1-D acoustic problems.

28th November 2025 : Domain-Adversarial Training of Neural Networks by Anshuman Mishra
28th November 2025 : Domain-Adversarial Training of Neural Networks by Anshuman Mishra

Talk summary:

  • We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

14th November 2025 : Dynamic Time Warping by Neelapuja Haritha
14th November 2025 : Dynamic Time Warping by Neelapuja Haritha

Talk summary:

  • Dynamic Time Warping (DTW) is a foundational algorithm in speech recognition used to align two time-dependent sequences—typically a spoken input and a reference template—by minimizing the cumulative distance between them. It allows for flexible matching even when the sequences differ in length or speed, by warping the time axis to find the optimal alignment path. DTW operates under constraints like monotonicity and continuity, ensuring realistic temporal mapping, and is solved efficiently using dynamic programming. This technique was especially vital in early template-based systems for isolated word recognition.

24th October 2025 : Robust Language Identification using Phonotatics by Shubham Sharma
24th October 2025 : Robust Language Identification using Phonotatics by Shubham Sharma

Talk summary:

  • This work presents methods for robust language identification using phoneme-level features and adversarial domain adaptation. The model uses a Conformer architecture trained on phoneme posterior representations. A gradient reversal layer allows it to learn domain-invariant features across datasets and speaking styles. Experiments on Ekstep, IIT Mandi, VAANI, and IndicVoice show that multi-target adversarial training improves recognition accuracy, especially in unseen and cross-domain settings. Single-target adaptation gives mixed results, but multi-target training reduces domain mismatch and boosts generalization. This approach provides an effective way to build LID systems for low-resource, multilingual environments.

17th October 2025 : Lip and Jaw Kinematics in Bilabial Stop Consonant Production by Shreya Karkun
17th October 2025 : Lip and Jaw Kinematics in Bilabial Stop Consonant Production by Shreya Karkun

Talk summary:

  • This paper presents two experiments on bilabial stop production. The first examined lip closure using kinematic, air pressure, and contact force data, revealing high lip velocity and mechanical interactions. The second explored lip–jaw coordination, variability, and effects of voicing and vowel context, finding no consistent voicing influence but confirming high-velocity closure. Overall, results support the idea that the lips aim for a negative lip aperture to ensure a tight seal despite contextual variations.

10th October 2025 : Improved Goodness of Pronunciation by Varshadhare K
10th October 2025 : Improved Goodness of Pronunciation by Varshadhare K

Talk summary:

  • Goodness of pronunciation (GoP) is typically formulated with Gaussian mixture model-hidden Markov model (GMM-HMM) based acoustic models considering HMM state transition prob abilities (STPs) and GMM likelihoods of context dependent phonemes. On the other hand, deep neural network (DNN) HMMbasedacoustic models employed sub-phonemic (senone) posteriors instead of GMM likelihoods along with STPs. How ever, each senone is shared across many states; thus, there is no one-to-one correspondence between them. In order to cir cumvent this, most of the existing works have proposed modifi cations to the GoP formulation considering only posteriors ne glecting the STPs. In this work, we derive a formulation for the GoPandit results in the formulation involving both senone pos teriors and STPs. Further, we illustrate the steps to implement the proposed GoP formulation in Kaldi, a state-of-the-art au tomatic speech recognition toolkit. Experiments are conducted on English data collected from Indian speakers using acoustic models trained with native English data from LibriSpeech and Fisher-English corpora. The highest improvement in the corre lation coefficient between the scores from the formulations and the expert ratings is found to be 14.89% (relative) better with the proposed approach compared to the best of the existing for mulations that don’t include STPs.

26th September 2025 : Speech Intelligibility Prediction by Amartyaveer
26th September 2025 : Speech Intelligibility Prediction by Amartyaveer

Talk summary:

  • Speech intelligibility is a measure of how well a listener can understand spoken language. It is strongly influenced by environmental factors such as noises or transmission through different processing systems, which degrades the clarity of spoken communication in real-world conditions. Conventional approaches typically depend on subjective human listening tests, which are costly, time-intensive, and lack scalability. As a result, objective evaluation techniques have gained attention, as they enable automated predict the intelligibility without human involvement. Among these, intrusive methods require access to clean reference signals, which are rarely available in practical applications. In contrast, non-intrusive methods employ neural network models to estimate quality directly from degraded speech, eliminating the need for reference signals. These techniques are designed to deliver consistent, reliable, and real-time assessments. In this work, we examine STOI algorithm and several non-intrusive approaches and introduce our proposed method, demonstrating its performance across both seen and unseen noisy environments.

19th September 2025 : Intro to Hough Transform by Murali
19th September 2025 : Intro to Hough Transform by Murali

Talk summary:

  • The Hough Transform is a fundamental technique in computer vision for detecting geometric shapes like lines and circles in images. It works by transforming image points into a "parameter space" where potential shapes are represented. For lines, each edge point (x,y) in the image contributes votes to all possible lines that could pass through it in a 𝜌-θ parameter space. An accumulator array collects these votes, and strong peaks in this array correspond to the most prominent lines in the original image. This voting mechanism makes the Hough Transform highly robust to noise, gaps, and partial occlusions. While computationally intensive for complex shapes, its reliability and interpretability make it valuable for applications such as lane detection, document analysis, and medical imaging.

12th September 2025 : Summary of PINNs Work by Veerababu
12th September 2025 : Summary of PINNs Work by Veerababu

Talk summary:

  • Talk is about the summary of work carried out on PINNs in the SPIRE lab

2nd September 2025 : Speech Time-Scale Modification With GANs by Jayin Khanna
2nd September 2025 : Speech Time-Scale Modification With GANs by Jayin Khanna

Talk summary:

  • The aim of Speech Time-Scale Modification (TSM) is to alter the speaking rate of audio while preserving naturalness and intelligibility. Recent approaches such as ScalerGAN have shown promise by learning an unsupervised mapping between original and time-scaled signals, without requiring explicit duration labels. We have successfully reproduced the results of ScalerGAN on the LJSpeech and SPIRE Lab datasets, validating its ability to perform content-agnostic scaling through adversarial spectrum learning and consistency decoding. However, a key limitation of ScalerGAN is that it uniformly stretches or compresses the signal, without regard for phonetic or linguistic content. To address this, our ongoing work explores incorporating content-aware mechanisms inspired by FastSpeech and WaveGlow, specifically leveraging length regulators and duration predictors to replace the interpolator operators within the ScalerGAN architecture. This hybrid framework aims to selectively rescale speech segments based on linguistic content (e.g., vowels vs. stop consonants), thereby improving perceptual quality and naturalness. The goal of this project is to develop a model that bridges the gap between content-agnostic and content-aware TSM, with the potential to advance state-of-the-art methods in flexible and intelligible speech rate modification.

15th August 2025 : Phoneme Classification using CNNs and Mel Spectrograms by Shahanaj khan
15th August 2025 : Phoneme Classification using CNNs and Mel Spectrograms by Shahanaj khan

Talk summary:

  • This project uses Cnn to classify phonemes from audio by converting speech into Mel spectrograms. These spectrograms capture time and frequency details, enabling the CNN to learn patterns for distinguishing 12 phoneme classes. The model achieves good accuracy and generalizes well across data splits, showing the effectiveness of deep learning in phoneme recognition tasks.

25th July 2025 : Exploration of VAD Architectures for Short Pause Detection by Abhinith D
25th July 2025 : Exploration of VAD Architectures for Short Pause Detection by Abhinith D

Talk summary:

  • Voice Activity Detection (VAD) is a critical component in speech processing systems. In the context of language learning applications, accurately identifying short pauses is essential for providing feedback on user articulation, rhythm, and fluency. This report details the exploration and comparative analysis of three distinct deep learning architectures for this specific task: a Convolutional Neural Network (CNN), a Convolutional Long Short-Term Memory Deep Neural Network (CLDNN), and an attention-based LSTM network. The models were trained and evaluated on a custom dataset of English speech recordings, with performance measured using standard classification metrics including accuracy, precision, recall, and F1-score. This study presents the results of this comparison, highlighting the relative strengths and weaknesses of each architecture. The findings provide crucial insights for selecting a robust and accurate model suitable for integration into our English learning platform.

18th July 2025 : Project Report on Automation of Spirometric Coeffecient Prediction by Bhavya Gaur
18th July 2025 : Project Report on Automation of Spirometric Coeffecient Prediction by Bhavya Gaur

Talk summary:

  • Spirometry is a standard but equipment-dependent test. In remote or resource-limited settings, spirometry is inaccessible. Breath sounds carry biomarkers that might correlate with pulmonary function. Goal: Can we predict Spirometric-Values (FEV1, FVC) from audio alone?

18th July 2025 : Design and evaluation of audio annotation and classification pipeline using CNN and Random Forests models by Stuti Vats
18th July 2025 : Design and evaluation of audio annotation and classification pipeline using CNN and Random Forests models by Stuti Vats

Talk summary:

  • This project presents a comprehensive pipeline for annotating and classifying spoken words in audio recordings. It integrates a PyQt5-based GUI and Audacity for efficient manual and semi-automated word-level annotation, followed by the development of machine learning models for classification. Convolutional Neural Networks (CNNs) and Random Forest classifiers were trained on three datasets with varying user and word-class distributions. The study compares their performance under different data regimes and analyzes how factors like silence trimming, class imbalance, and speaker variability impact model accuracy. While Random Forests perform better on small, low-class datasets, CNNs demonstrate scalability and are better suited for complex, high-class tasks.

18th July 2025 : Interference in Acoustic Waves: An acoustic-optical analogue to YDSE by Aditya Prakash
18th July 2025 : Interference in Acoustic Waves: An acoustic-optical analogue to YDSE by Aditya Prakash

Talk summary:

  • This experiment aims to verify the Young’s Double Slit Experiment (YDSE) using an acoustic and signal-processing approach. Instead of conventional slits, two speakers are positioned behind a thin reflective film to create controlled acoustic disturbances. A laser beam is directed onto the reflective film, and the reflected light is detected by a pho- toresistor. As the interference pattern forms on the reflective surface due to the acoustic modulation, variations in light intensity are captured over time by the photoresistor. The recorded signals are then subjected to digital signal processing techniques to extract in- terference characteristics. The presence of constructive and destructive interference in the data confirms the fundamental principles of wave interference as outlined by YDSE, thus validating the experimental setup and providing a novel method of verification.

18th July 2025 : Sensitivity analysis of sentence embedding vector due to position dependent insertion, deletion and substitution errors in Hindi language by Sarth Santosh Shah
18th July 2025 : Sensitivity analysis of sentence embedding vector due to position dependent insertion, deletion and substitution errors in Hindi language by Sarth Santosh Shah

Talk summary:

  • Sentence embeddings are critical for enabling semantic understanding across tasks such as translation, paraphrase detection, and cross-lingual information retrieval. Multilingual embedding models like LaBSE (Language-agnostic BERT Sentence Embedding) aim to project sentences from diverse languages—including low-resource ones like Hindi—into a shared semantic space. While LaBSE achieves strong performance on clean benchmark datasets, its robustness to real-world noise remains underexplored, particularly in morphologically rich and syntactically flexible languages. In this study, we systematically evaluate how LaBSE\'s Hindi embeddings respond to various types of synthetic noise, including word insertions, deletions, substitutions, reordering, and character-level corruption. Through cosine similarity analysis and embedding-space visualizations, we quantify semantic drift under distortion and assess the model’s resilience. Our findings reveal both vulnerabilities and strengths, offering practical guidance for improving embedding robustness in noisy multilingual settings.

18th July 2025 : Sensitivity analysis of sentence embedding vector due to position dependent insertion, deletion and substitution errors in telugu language by SNEHITH KUMAR MATTE
18th July 2025 : Sensitivity analysis of sentence embedding vector due to position dependent insertion, deletion and substitution errors in telugu language by SNEHITH KUMAR MATTE

Talk summary:

  • Multilingual models like LaBSE map sentences from different languages—including low-resource ones like Telugu—into a shared meaning space. While effective on clean data, LaBSE’s robustness to noise is unclear. This study tests how Telugu embeddings respond to synthetic noise (e.g., word edits, character corruption). Using cosine similarity and visualizations, we measure semantic drift and highlight LaBSE’s strengths and weaknesses, offering insights to improve its performance in noisy settings.

18th July 2025 : ATB-Contour Based Speaker Segmentation by Guhan Balaji
18th July 2025 : ATB-Contour Based Speaker Segmentation by Guhan Balaji

Talk summary:

  • This study presents a frame-wise speaker classification framework using Air-Tissue Boundary (ATB) contours derived from real-time MRI videos of the vocal tract. Each video frame\'s contours are treated as an independent sample, enabling the analysis to focus solely on static anatomical configurations. The approach aims to classify speakers based on distinctive morphological features encoded in these static contours and further identifies which local regions or segments in the contours hold maximal speaker-discriminative potential. This methodology highlights the utility of structural vocal tract cues, independent of temporal information, for robust biometric speaker identification and provides new insights into the anatomical correlates of speaker individuality

18th July 2025 : Residual Based Adaptive Refinement for PINN's by Gautam Sivakumar
18th July 2025 : Residual Based Adaptive Refinement for PINN's by Gautam Sivakumar

Talk summary:

  • Residual‐based adaptive refinement (RAR) for PINNs iteratively swaps collocation points toward regions of highest PDE residual, concentrating training effort where the model performs worse. This simple, dynamic resampling sharply accelerates convergence and cuts the number of epochs and points needed compared with uniform sampling.

11th July 2025 : A Generative Unsupervised Approach to Voice Activity Detection via Threshold Variance by Aditya Pandey
11th July 2025 : A Generative Unsupervised Approach to Voice Activity Detection via Threshold Variance by Aditya Pandey

Talk summary:

  • This study proposes an unsupervised voice activity detection (VAD) framework based on a generative U‑Net that jointly optimizes reconstruction loss and a novel multi‑threshold dissimilarity loss. The latter leverages multiple weak VAD estimators—each defined by soft thresholding at mean ± τ·σ with τ∈{0.05,0.10,0.15}—and measures output divergence via polynomial xor approximation. A U‑Net with 16 base features is trained for 100 epochs, decaying the reconstruction‑to‑dissimilarity weight α from 1 to 0.5, and ablation studies show that increasing model capacity and lowering α’s floor further tighten speech/non‑speech clustering. Embedding the pre‑trained latent space into a 50/50 semi‑supervised schedule yields consistent gains (Test ACC 96.6%, F1 97.0%) over both fully supervised (ACC 95.6%, F1 96.5%) and semi‑supervised baselines without custom loss.

11th July 2025 : Vowel consonant vowel sequence classification based on air tissue boundary segmentation by Kisalay Srivastav
11th July 2025 : Vowel consonant vowel sequence classification based on air tissue boundary segmentation by Kisalay Srivastav

Talk summary:

  • This work presents an automated system for classifying Vowel-Consonant-Vowel (VCV) speech sequences using features derived from Air-Tissue Boundary (ATB) segmentation in real-time Magnetic Resonance Imaging (rtMRI) data. The study utilizes the USC-TIMIT corpus, comprising rtMRI videos and synchronized audio from 75 speakers, focusing on 40 subjects for detailed analysis. Each VCV sequence is represented by seven frames, with anatomical landmarks and contours extracted per frame.

3rd July 2025 : Final Presentation of SRFP 25 by Marita Jimmy
3rd July 2025 : Final Presentation of SRFP 25 by Marita Jimmy

Talk summary:

  • A summary of my internship at spire labs. It is the evaluation of effect of silence on intelligibility of speech using Short Time Objective Intelligibility (STOI)

4th July 2025 : Accelerating RT-MRI Annotation Through Similarity-Based Contour Matching by Adarsh V V
4th July 2025 : Accelerating RT-MRI Annotation Through Similarity-Based Contour Matching by Adarsh V V

Talk summary:

  • This talk presents a semi-automated method for segmenting real-time MRI (RT-MRI) data of speech production. The approach leverages structural similarity (SSIM) to identify visually similar annotated frames and transfers their anatomical contours to unannotated targets. To assess accuracy, Average Hausdorff Distance (AHD) is used to compare the predicted contours against ground truth, revealing a strong inverse correlation between SSIM and AHD across thousands of frame pairs. The method achieves consistently low contour error while reducing manual annotation time by over 80%, offering a scalable and interpretable framework for articulatory segmentation in RT-MRI studies.

13th June 2025 : Pitch Annotation using ToBI And AuToBI by Mrinal Naveen
13th June 2025 : Pitch Annotation using ToBI And AuToBI by Mrinal Naveen

Talk summary:

  • Pitch annotation helps analyze how speech sounds rise and fall, affecting meaning and emotion. ToBI is a system used to label these patterns, but it’s slow and needs expert input. AuToBI and its Python version AuToBI_py automate this process, making pitch and prosody annotation faster, scalable, and useful for speech analysis tasks like emotion detection and conversational AI.

30th May 2025 : Speech-Based ALS Classification Using CNN-LSTM, leveraging Wav2Vec2.0 Features by Deepthi S B
30th May 2025 : Speech-Based ALS Classification Using CNN-LSTM, leveraging Wav2Vec2.0 Features by Deepthi S B

Talk summary:

  • This study explores how to further pretraining the Wav2Vec 2.0 speech model and using that pretrained model as a feature extractor for classification task to tell the difference between healthy voices and those affected by ALS. It uses large audio datasets recorded on different devices. The process includes preparing the audio, creating manifest, and training the model using Fairseq. To check how well the model works, it uses advanced techniques to extract features from the audio and a deep learning method (CNN-LSTM) to classify the voices. The goal is to help with reliable speech analysis and tracking ALS progression

28th March 2025 : LIMMITS’25: Multilingual Streaming TTS With Neural Codecs for Indian Languages by Jesuraj
28th March 2025 : LIMMITS’25: Multilingual Streaming TTS With Neural Codecs for Indian Languages by Jesuraj

Talk summary:

  • This work provides a summary of the Multilingual streaming TTS with neural codecs for Indian languages challenge (LIMMITS’25), organized as part of the ICASSP 2025 signal processing grand challenge. Towards this, 278 hours of TTS data in 4 Indian languages Gujarati, Indian English, Bhojpuri, and Kannada got released. The challenge focuses on advancing research in neural codec-based and streaming TTS systems. The top teams in the challenge attained high subjective scores on naturalness and similarity, thus contributing to the progress in text-to-speech generation systems.

28th March 2025 : Improving Dialect Identification in Indian Languages Using Multimodal Features from Dialect Informed ASR by Saurabh Kumar
28th March 2025 : Improving Dialect Identification in Indian Languages Using Multimodal Features from Dialect Informed ASR by Saurabh Kumar

Talk summary:

  • Dialect identification (DID) addresses the challenge of recognizing regional variations within a language. The current deep learning approaches focus on audio-only, text-only, or multi-task setups combining automatic speech recognition (ASR) with DID. This work introduces a novel multimodal architecture that leverages speech and text features to enhance DID performance. Our method integrates ASR-generated speech representations with text embeddings derived from ASR hypotheses using a RoBERTa-based encoder. Additionally, we perform a layer-wise analysis of the IndicWav2Vec model to identify the layers most effective for extracting dialectal features. We evaluate our approach on a subset of the RESPIN dataset featuring eight Indian languages and 33 dialects. Experimental results show that our proposed multimodal DID system achieves an average DID accuracy of 79.81%, consistently outperforming baseline methods. This study is the first to analyse comprehensively DID in Indian languages, providing new insights into their dialectal diversity.

28th March 2025 : Physics-Informed Neural Networks for Predicting Acoustic Pressure Inside Ducts by Akanksha Singh
28th March 2025 : Physics-Informed Neural Networks for Predicting Acoustic Pressure Inside Ducts by Akanksha Singh

Talk summary:

  • We present a novel demonstration of physics-informed neural networks (PINNs) for accurately predicting acoustic pressure fields in ducts without any reliance on ground-truth measurement data. By embedding the Helmholtz equation directly into a neural network training process, we showcase how knowledge of the physical properties of the duct—such as boundary conditions and material parameters—alone suffices to generate robust acoustic pressure predictions. Our method eliminates the need for experimentally recorded datasets, dramatically reducing costs and logistical complexity. In this demonstration, we train small-scale networks over just a few iterations for a range of frequencies. The results highlight the remarkable convergence and accuracy that can be achieved solely through the governing equation and domain specifications, even with minimal computational overhead. We further extend this approach to more complex duct geometries, including shapes that are varying in the length direction such as in vocal tracts. This enables direct modeling of acoustic wave propagation within anatomically relevant structures—a powerful tool for speech science and related audio processing applications. Attendees will be able to see how rapidly and effectively our PINN-based models adapt to different duct configurations, revealing the strong potential for future real-world uses. Our interactive session features step-by-step explanations of the training process, a display of the network configurations, and visualizations of the resulting pressure fields. Participants will gain firsthand insight into how a physics-based loss function can guide neural networks to learn the underlying physics without requiring extensive labeled data.

28th March 2025 : Role of the Pretraining and the Adaptation data sizes for low-resource real-time MRI video segmentation by Vinayaka Hegde
28th March 2025 : Role of the Pretraining and the Adaptation data sizes for low-resource real-time MRI video segmentation by Vinayaka Hegde

Talk summary:

  • Real-time Magnetic Resonance Imaging (rtMRI) is frequently used in speech production studies as it provides a complete view of the vocal tract during articulation. This study investigates the effectiveness of rtMRI in analyzing vocal tract movements by employing the SegNet and UNet models for Air-Tissue Boundary (ATB) segmentation tasks. We conducted pretraining of a few base models using increasing numbers of subjects and videos, to assess performance on two datasets. First, consisting of unseen subjects with unseen videos from the same data source, achieving 0.33% and 0.91% (Pixel-wise Classification Accuracy (PCA) and Dice Coefficient respectively) better than its matched condition. Second, comprising unseen videos from a new data source, where we obtained an accuracy of 99.63% and 98.09% (PCA and Dice Coefficient respectively) of its matched condition performance. Here, matched condition performance refers to the performance of a model trained only on the test subjects which was set as a benchmark for the other models. Our findings highlight the significance of fine-tuning and adapting models with limited data. Notably, we demonstrated that effective model adaptation can be achieved with as few as 15 rtMRI frames from any new dataset.

20th December 2024 : Denoising Diffusion Probabilistic Models by Jesuraj Bandekar
20th December 2024 : Denoising Diffusion Probabilistic Models by Jesuraj Bandekar

Talk summary:

  • We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.

13th December 2024 : Automatic Assessment of Speech Quality by Amartyaveer
13th December 2024 : Automatic Assessment of Speech Quality by Amartyaveer

Talk summary:

  • Automatic Assessment of Speech Quality (AASQ) is a critical task in evaluating the intelligibility and overall quality of speech signals, especially in noisy environments or after processing through various systems. Traditional methods often rely on subjective human ratings, which are time-consuming and not scalable. Therefore, Objective Assessment methods are becoming popular which predict speech quality automatically without requiring human input. However, intrusive methods still require clean reference signals, often unavailable in real-world scenarios. In contrast, non-intrusive methods leverage neural networks to predict speech quality without using clean reference signals. These methods aim to provide consistent, reliable, and real-time assessments. We explore various non-intrusive AASQ methods and present our method along with its performance on both seen and unseen data in different noisy conditions.

6th December 2024 : Intro to Bottleneck adapters and LoRa by Murali
6th December 2024 : Intro to Bottleneck adapters and LoRa by Murali

Talk summary:

  • Finetuning ASR models using pretrained models have been effective in cases where there is substantial data available. Are there other techniques that employ lesser resources while giving similar performing models? There are many types of adapters that are a work around. This talk would be a very brief introduction to two types of adapters.

29th November 2024 : Applying Physics-Informed Neural Networks to Duct Acoustics by Akanksha Singh
29th November 2024 : Applying Physics-Informed Neural Networks to Duct Acoustics by Akanksha Singh

Talk summary:

  • Recent advances in neural network architectures have led to significant breakthroughs across various domains, one of which includes physical modeling using Physics-Informed Neural Networks (PINNs). This presentation delves into the application of PINNs to solve the complex problem of sound wave propagation in uniform ducts. Unlike traditional neural networks that learn solely from data, PINNs incorporate physical laws, in this case, the Helmholtz equation, as part of their loss function, thus adhering to established physical principles. The talk will outline the problem statement of duct acoustics, elucidate the types of boundary conditions relevant to this physical system, and discuss their significance. It will introduce two solutions. First approach using the λ-method (Lagrange Multiplier Method) to enforce boundary conditions, discuss its limitations, and present the second approach called Trial Solution method that seeks to overcome these limitations.

15th November 2024 : Dialect Identification in Indian Languages by Saurabh Kumar
15th November 2024 : Dialect Identification in Indian Languages by Saurabh Kumar

Talk summary:

  • Dialect identification (DID) addresses the challenge of recognizing regional variations within a language. The current deep learning approaches focus on audio-only, text-only, or multi-task setups combining automatic speech recognition (ASR) with DID. This work aims to utilize multimodal architecture that leverages speech and text features to enhance DID performance.

20th May 2004 : Gumbel-Softmax Trick by Atharva Jeevannavar
20th May 2004 : Gumbel-Softmax Trick by Atharva Jeevannavar

Talk summary:

  • The Gumbel-Softmax Trick is a technique used in machine learning to approximate discrete categorical samples in a differentiable manner, which enables the use of gradient-based optimization methods, like backpropagation. This technique is particularly useful when dealing with categorical data that is inherently non-differentiable, such as in natural language processing or reinforcement learning tasks. By introducing randomness into the logits through Gumbel noise and adjusting the distribution using a temperature parameter, the Gumbel-Softmax trick allows models to generate diverse predictions while maintaining the differentiability required for training neural networks. It addresses challenges in working with discrete variables by reparameterizing them, making it a crucial tool for optimization in models with categorical variables. The Gumbel-Softmax technique has demonstrated effectiveness in tasks such as generative modeling using variational autoencoders (VAEs) and semi-supervised classification on datasets like MNIST. It also improves performance in environments with discrete action spaces, such as reinforcement learning.

9th August 2024 : Website development of English Gyaani Audio Fetching Dashboard by Darsh Kumar
9th August 2024 : Website development of English Gyaani Audio Fetching Dashboard by Darsh Kumar

Talk summary:

  • During my internship at IISC Bangalore, I contributed to the English-Gyaani project by developing an audio fetching dashboard using React.js and Firebase. I enhanced my skills in web development, backend integration, and communication, overcoming challenges with Firebase limits and React.js state management through self-learning and online resources.

26th July 2024 : Quantification of co-articulation for alveolar consonant production in VCV using rtMRI data by SHIEKH MAHAMMAD ARZU
26th July 2024 : Quantification of co-articulation for alveolar consonant production in VCV using rtMRI data by SHIEKH MAHAMMAD ARZU

Talk summary:

  • In this work, Annotations are carried out for t and d (Alveolar consonants) paired with vowels a, e and u. The Euclidean distance is estimated from point of contact (point where tongue touches the alveolar ridge) of consonant frame with respect to point of contact of remaining frames. Two fixed points tip of nose and projection of velum are considered for normalizing the contours. The results undergo statistical test like T-test and it is observed that there is no significant difference between -3 to +3 similarly for -2 to +2 and -1 to +1 frames. The distance from +1 to +3 is significantly increasing and similarly for -1 to -3 frames. Also, the last two frames (+2 and +3) after consonant frame are statistically the same.

12th July 2024 : Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis by Ahan P Shetty
12th July 2024 : Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis by Ahan P Shetty

Talk summary:

  • Machine translation has become a critical tool in bridging linguistic gaps, especially between languages as diverse as English and Hindi. This project comprehensively evaluates various machine translation models for translating between English and Hindi. We assess the performance of these models using a diverse set of automatic evaluation metrics, both lexical and machine learning-based metrics. Our evaluation leverages an 18000+ corpus of English-Hindi parallel dataset and a custom FAQ dataset comprising questions from government websites. The study aims to provide insights into the effectiveness of different machine translation approaches in handling both general and specialized language domains. Results indicate varying performance levels across different metrics, highlighting strengths and areas for improvement in current translation systems.

21st June 2024 : Pause and Speech Pattern Analysis in ALS and Parkinson Disease Affected Patients by Nisha Johnson
21st June 2024 : Pause and Speech Pattern Analysis in ALS and Parkinson Disease Affected Patients by Nisha Johnson

Talk summary:

  • This study is significant as it addresses a critical gap in the objective analysis of speech and pause characteristics among normal individuals, ALS patients, and Parkinson\'s patients. By identifying distinct patterns in pause segments and speech segments, the research has the potential to revolutionize diagnostic and monitoring practices for these neurodegenerative diseases. The findings will contribute to the development of precise, non-invasive tools for early detection and tracking disease progression. Moreover, the study\'s evaluation of model accuracy in classifying different groups and determination of pause ratios will enhance our understanding of speech biomarkers, leading to improved clinical interventions and better patient outcomes.

2nd February 2024 : Correction of both APA,AKA files and measure what is the DTW between each of the 74 contours by Lalit Singh Kharayat
2nd February 2024 : Correction of both APA,AKA files and measure what is the DTW between each of the 74 contours by Lalit Singh Kharayat

Talk summary:

  • Speech Recognition in Agriculture and Finance for the Poor in India

12th January 2024 : My experience as an intern by A Shanmukha Priya
12th January 2024 : My experience as an intern by A Shanmukha Priya

Talk summary:

  • Abstract: Dive into the dynamic realm of AI and ML, witnessing their impact on Automatic Speech Recognition (ASR) and Natural Language Processing (NLP). In my role at Project Vaani, supported by Google, we address the linguistic tapestry of India\'s 780 languages. I share the challenges faced in categorizing 6300 Hindi transcribed files, from API constraints to a Python-based strategy. Leveraging models like IndicBert and the universal sentence encoder, discover our exploration of sentence embeddings and clustering techniques

29th December 2023 : Speech to Speech Translation - An Overview by Saurabh Kumar
29th December 2023 : Speech to Speech Translation - An Overview by Saurabh Kumar

Talk summary:

  • In a country like India, where a majority of languages exist primarily in spoken forms, the development of reliable Speech-to-Speech Translation (S2ST) systems becomes crucial. This presentation seeks to offer a concise overview of the latest advancements in S2ST systems and highlight the significant challenges associated with their development for Indian languages.

15th December 2023 : Achieving stable convergence of neural networks for estimating acoustic field in uniform ducts by Ashwin A Raikar
15th December 2023 : Achieving stable convergence of neural networks for estimating acoustic field in uniform ducts by Ashwin A Raikar

Talk summary:

  • The ability of the neural networks as function approximators can be exploited to solve several governing differential equations. In this work, 1-D Helmholz equation is solved to predict the acoustic pressure in a uniform duct. Solving the Helmholtz equation across a range of frequencies, especially at higher frequencies is challenging as the loss function destabilizes the training process, thereby preventing it from converging to the true solution with the desired accuracy. To overcome this issue, a dynamic learning rate technique is proposed that helps to stabilize the training process and improve overall accuracy of the network. The efficiency of the method is demonstrated by comparing the results with a static learning rate method and the analytical solutions. A good agreement is observed between the predicted solution with dynamic learning rate and the analytical solution up to 2000 Hz. Without dynamic learning rate, the relative errors are observed tobe 2% and 58% at 500 and 2000 Hz, respectively, whereas they reduced to 0.6% and 0.1%, respectively, with the dynamic learning rate at the same frequencies. The proposed dynamic learning rate method is found to be effective for different types of boundary conditions.

6th October 2023 : Human, Face Detection in Images by Mauli Mehulkumar Patel
6th October 2023 : Human, Face Detection in Images by Mauli Mehulkumar Patel

Talk summary:

  • Many of images available on the internet today have people and various other details that can be exploited. The motivation behind detecting recognizable humans and faces in images is to disallow any personal details and making it open source compatible

29th September 2023 : EXPLORING SYLLABLE DISCRIMINABILITY WITH INCREASING DYSARTHRIA SEVERITY FOR PATIENTS WITH AMYOTROPHIC LATERAL SCLEROSIS by Neelesh Samptur and Anirudh Chakravarty
29th September 2023 : EXPLORING SYLLABLE DISCRIMINABILITY WITH INCREASING DYSARTHRIA SEVERITY FOR PATIENTS WITH AMYOTROPHIC LATERAL SCLEROSIS by Neelesh Samptur and Anirudh Chakravarty

Talk summary:

  • Dysarthria due to Amyotrophic Lateral Sclerosis (ALS) progres- sively impairs speech production compromising the discriminability among different speech sounds. The form and extent of the com- promise vary with the dysarthria severity level. Though the changes in the discriminability among different vowels and fricatives with increasing severity have been studied in the literature, the effects on syllables are yet unexplored. In this work, we perform manual and automatic classification of three syllables - /pa/, /ta/ and /ka/, at varied severity levels. Manual classification is performed through listening tests, whereas, spectral features and self-supervised speech representations obtained from pretrained models are explored along with different neural network classifiers for the purpose of automatic classification. Experiments with 100 ALS patients and 35 healthy subjects indicate that, though both manual and automatic classifica- tion accuracies decline with increasing severity, automatic methods significantly outperform humans at all severity levels achieving 5.56% and 24.45% higher classification accuracies than humans on utterances from healthy subjects and the most severe patients, respectively. This might indicate that though it becomes perceptu- ally difficult to differentiate the syllables with increasing dysarthria severity, discriminative acoustic cues persist in the utterances which data-driven methods can capture.

22nd September 2023 : Speaker verification for ALS patients by Akash Kaushik
22nd September 2023 : Speaker verification for ALS patients by Akash Kaushik

Talk summary:

  • Speaker verification with PLDA and X vector training embeddings for ALS patients.

8th September 2023 : Spontaneous Indian English Speech Corpus by Charu Samir Shah
8th September 2023 : Spontaneous Indian English Speech Corpus by Charu Samir Shah

Talk summary:

  • This presentation will be a discussion of the work I have done in the corpus paper. The corpus paper has been submitted to COCOSDA

7th July 2023 : Estimating articulatory movements in speech production using neural networks. by Kirann Mahendran
7th July 2023 : Estimating articulatory movements in speech production using neural networks. by Kirann Mahendran

Talk summary:

  • Estimating articulatory movements from speech acoustic features is known as acoustic-to-articulatory inversion (AAI). Large amount of parallel data from speech and articulatory motion is required for training an AAI model. Electromagnetic articulograph (EMA) is a promising technology to record such parallel data. Electromagnetic articulograph (EMA) is a technique for measuring the positions and movements of the articulators, such as the tongue, lips, and jaw, during speech production. EMA data can provide more accurate and detailed information about the articulatory movements than acoustic features alone.

7th April 2023 : Beam Search by Amartyaveer
7th April 2023 : Beam Search by Amartyaveer

Talk summary:

  • Beam Search is a greedy algorithm that examines a graph by extending the most promising node in a limited set (beam size) B. It always expands the B number of the best nodes at each level. It progresses level by level and moves downwards only from the best B nodes. Generally used in encoder-decoder models such as machine translation to find the most probable sequence.

7th April 2023 : Language Modelling by Shreya Som
7th April 2023 : Language Modelling by Shreya Som

Talk summary:

  • A language model learns the probability of word occurence based on text dataset . This project involves data preprocessing from many webpages and clean it accordingly . Then nltk based n gram model training ,smoothing is done to train the data .Later estimation of how well the model fits the testing data was done by determining perplexity.

24th March 2023 : Internship Completion Talk by Pranayak Uniyal
24th March 2023 : Internship Completion Talk by Pranayak Uniyal

Talk summary:

  • Modelling wave equations in Python: the individual has demonstrated proficiency in reading and understanding MATLAB code, summarizing research articles, and implementing MATLAB code in Python. They also possess knowledge of the multilayer perceptron architecture and can create a function model that computes the outputs of the deep learning model. Analytical problem-solving: the individual has demonstrated the ability to solve second-order differential equations using complementary and particular solutions and understand how to find analytical solutions to problems involving nonlinear partial differential equations. Coding and plotting solutions: the individual can implement analytical solutions of ODE equations using Python, use the bvp Python library function to solve boundary value problems, and use the matplotlib library for data representation and plotting the solutions of wave equations. Understanding COMSOL Physics View software: the individual has demonstrated proficiency in using the software to solve physics-based problems, including the absorptive muffler problem. They also understand how to apply physics-based solutions to real-world engineering problems

24th February 2023 : Decoding graph construction in Kaldi by Saurabh Kumar
24th February 2023 : Decoding graph construction in Kaldi by Saurabh Kumar

Talk summary:

  • In this session, we will briefly discuss about the decoding graph creation in Kaldi.

9th December 2022 : An efficient method to solve transcendental equations by Veerababu Dharanalakota
9th December 2022 : An efficient method to solve transcendental equations by Veerababu Dharanalakota

Talk summary:

  • A transcendental function is defined as a function that is not algebraic and cannot be expressed in terms of a finite sequence of algebraic operations. You will encounter transcendental equations in many engineering problems. The best example of a transcendental equation is sin(x) = x. There are no standard methods to solve equations of this category. Most of the researchers use iterative methods like Bisection, Newton Raphson, etc. In this talk we are going to see an alternate and efficient method to solve transcendental equations.

25th November 2022 : Attention in End to end speech recognition systems by Sathvik
25th November 2022 : Attention in End to end speech recognition systems by Sathvik

Talk summary:

  • Attention was used in 2015, and since then it has been a prominent formulation to learn alignments in end to end speech recognition. I will first introduce the task of speech recognition and go over how attention is incorporated.

18th November 2022 : Detection of Overlapped Speech by V Kartikeya
18th November 2022 : Detection of Overlapped Speech by V Kartikeya

Talk summary:

  • Automatic Speech Recognition systems, Speaker Diarization systems etc, do not respond very well when there are more than one speakers speaking at the same time. In this project, a neural network is built to classify overlapped speech segments and non overlapped speech segments. Dataset was prepared by mixing audios of single speaker speech that was separated based on labels. MFCCs were extracted for frames of different lengths and input to the model. A CNN model was trained on the MFCCs with accuracy and precision as metrics.

28th October 2022 : Speech based age classification ​ by MAREDDY SAI KRISHNA REDDY
28th October 2022 : Speech based age classification ​ by MAREDDY SAI KRISHNA REDDY

Talk summary:

  • Using Librosa converted the train and test audio files into mel-spectrogram images. These images are given to the Resnet-34 neural network.

16th September 2022 : Multilingual Sentence Representation & its applications by Abhayjeet Singh
16th September 2022 : Multilingual Sentence Representation & its applications by Abhayjeet Singh

Talk summary:

  • Discussing the need and applications of Multilingual sentence respresentations. And what techniques and learning methods are used to learn these language independent sentence respresentations.

9th September 2022 : Sequence to Sequence Model by POOJA V H
9th September 2022 : Sequence to Sequence Model by POOJA V H

Talk summary:

  • Increasing acquisition of digitization over the information storing and processing in our daily lives has increased the demand of digitization in multiple facts including in investigation processes as well. Over the past years machine translation, text summarization, and image captioning has become a topic of research. Various techniques of Natural Language Processing (NLP) enabling researchers to generate efficient results for a wide spectrum of documents. Seq2Seq Architecture with RNN is used to perform the mentioned machine translation, text summarization etc. Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

26th August 2022 : Mock presentation for INTERSPEECH by Sathvik
26th August 2022 : Mock presentation for INTERSPEECH by Sathvik

Talk summary:

  • It will be a 15 mins presentation for our accepted paper in INTERSPEECH 2022 - Streaming model for Acoustic to Articulatory Inversion with transformer networks

12th August 2022 : Working for English Gyani Project during Summer Internship by Chinmoi Das
12th August 2022 : Working for English Gyani Project during Summer Internship by Chinmoi Das

Talk summary:

  • English Gyani App is designed to help a user learn English Grammar and Pronunciation. To teach Grammar, with the help of a teacher, Grammar lessons for various English topics is developed. This large data, corresponding to various topics is created on google docs which is to be incorporated in the app backend. Since the format and structure in the lessons in doc are free flowing and do not always adhere to a particular structure, there is a need to have a consistent format for all kinds of topics covered. This content is formatted into 4 fixed types in excel sheets, which will enable convenient conversion of Grammar content into JSON format and ultimately to the backend of the app. This talk is a brief overview of the involved process and the things learnt during it.

5th August 2022 : Neurolinguistics: Aphasia: The Growing Understanding by Sharmistha Chakrabarti
5th August 2022 : Neurolinguistics: Aphasia: The Growing Understanding by Sharmistha Chakrabarti

Talk summary:

  • -

29th July 2022 : Speaker Diarization by Kushagra Parmeshwar
29th July 2022 : Speaker Diarization by Kushagra Parmeshwar

Talk summary:

  • Speaker diarization is the process of splitting an audio stream /clip into a number of homogenous segments, specifically governed by the number of people involved in the conversation. In a nutshell, it answers the question "who spoke when?". We have tried to implement a simple diarizer that would help us avoid manual transcription of conversations involving two people, that have taken place in many different languages spoken nation-wide. A typical diarizer usually comprises of the following steps: 1) voice activity detection 2)speech segmentation 3)feature extraction 4) clustering, followed by labeling them. ( The number of clusters should be equal to the number of speakers ideally) 5)The final step, transcribing the conversation. The diarizer implemented by us uses various modules that have already been built by other people. A large part of it is done with the help of the library speechbrain. A small code has further been added to evaluate the diarization error rate of the built system ,given the manual transcription in a csv file.

22nd July 2022 : Application of Digital Games for Speech Therapy in Children: A Systematic Review of Features and Challenges by Betty Kurian
22nd July 2022 : Application of Digital Games for Speech Therapy in Children: A Systematic Review of Features and Challenges by Betty Kurian

Talk summary:

  • Treatment of speech disorders during childhood is essential. Many technologies can help speech and language pathologists (SLPs) to practice speech skills, one of which is digital games. This study aimed to systematically investigate the games developed to treat speech disorders and their challenges in children. The articles in which a digital game was developed to treat speech disorders in children were included in the study. Then, the features of the designed games and their challenges were extracted from the studies. In these articles, 59.25% of the games had been developed in English language and children with hearing impairments had received much attention from researchers compared to other patients. Also, the Mel-Frequency Cepstral Coefficients (MFCC) algorithm and the PocketSphinx speech recognition engine had been used more than any other speech recognition algorithm and tool. The evaluation of games showed a positive effect on children’s satisfaction, motivation, and attention during speech therapy exercises. The biggest barriers and challenges mentioned in the studies included sense of frustration, low self-esteem after several failures in playing games, environmental noise, contradiction between games levels and the target group’s needs, and problems related to speech recognition. The results of this study showed that the games positively affect children’s motivation to continue speech therapy, and they can also be used as the SLPs’ aids. Before designing these tools, the obstacles and challenges should be considered, and also, the solutions should be suggested.

15th July 2022 : Age classification of ALS/PD patients from their dysarthric speech by Stuti Prasad
15th July 2022 : Age classification of ALS/PD patients from their dysarthric speech by Stuti Prasad

Talk summary:

  • ALS and PD are neuro-logically degenerative diseases whose study can help offer an opportunity to help researchers find better ways to safely detect, treat, or prevent ALS/PD and therefore hope for individuals now and in the future. This work focuses on determining patient\'s age based on their speech, since the disease leads to locomotive issues and thus dysarthric speech produces distinctive features to be studies. The use of i-vectors along with advanced plda has produced results with accuracy of 74.63%.As a future scope, by quantifying the extent of damage to the voice box,a tool can be developed to help people with severely damaged or absent voice boxes, by converting their whisper-like speech into normal-sounding speech.

8th July 2022 : Hindi Language Modelling by Ayush Raj
8th July 2022 : Hindi Language Modelling by Ayush Raj

Talk summary:

  • Next word prediction for Hindi Language. Presentation of work done from dataset preparation, preprocessing and modelling.

1st July 2022 : Automatic detection of mismatched audio/text pair in Marathi by Nancy Meshram
1st July 2022 : Automatic detection of mismatched audio/text pair in Marathi by Nancy Meshram

Talk summary:

  • The purpose of this project was to detect the mismatch between the audio/text pairs. For this 1D CNN model was used so that we can classify whether each of the audio/text pair is matched or not.

24th June 2022 : Error Correction of Hindi OCR by Atul Raj
24th June 2022 : Error Correction of Hindi OCR by Atul Raj

Talk summary:

  • Optical Character Recognition(OCR) is the process to translate paper-based books into digital e-books. Output from OCR systems are erroneous and inaccurate as they produce misspellings in the recognized text especially when the source document is of low printing quality.Our problem is bifurcated into two major issues, i) Error detection ii) Error correction . Our focus language for this project is Hindi; The majority of Indian sripts are composed in two dimensions, which distinguishes them from English sripts.As a result, the methods developed for Roman scripts do not apply directly to Indian sripts, we face majorly two problems, the poor image quality and the inability of the OCR to extract the correct text due to the highly inflectional characteristic of the language; we used lookup dictionary for error detection and BERT for error correction.

17th June 2022 : Analysis of Pause Pattern in Kannada and Bengali Subjects having Amyotrophic Lateral Sclerosis or Parkinson’s Disease by Tanuka Bhattacharjee
17th June 2022 : Analysis of Pause Pattern in Kannada and Bengali Subjects having Amyotrophic Lateral Sclerosis or Parkinson’s Disease by Tanuka Bhattacharjee

Talk summary:

  • We analyze the differences in the speaking and pausing pattern among Amyotrophic Lateral Sclerosis (ALS) patients, Parkinson’s disease (PD) patients and healthy subjects during spontaneous speech and image description utterances in native Kannada and Bengali languages. To the best of our knowledge, no such analysis has yet been conducted in the Indian language scenario. Moreover, ALS and PD speech-pause pattern comparison has not been done in any language whatsoever. Another contribution of this work lies in inter-language comparison of the speech-pause pattern for the three subject groups. All these analyses consider cognitive, prosodic and breath pauses. Particularly, four features are studied - pause time fraction, transition frequency, pause duration and speech segment duration.

10th June 2022 : Blind word error rate prediction using ASR features by Kunal Sah
10th June 2022 : Blind word error rate prediction using ASR features by Kunal Sah

Talk summary:

  • An automatic speech recognition (ASR) system decodes any speech signal. Measuring the quality of automatic speech recognition (ASR) systems is critical for creating user-satisfying voice-driven applications. Word Error Rate (WER) has been traditionally used to evaluate the ASR system's quality. WER also gives an idea about the correctness of the text predicted by any ASR system (the higher the WER, the lower the confidence that the predicted text is correct). Computation of WER requires Target transcription. Hence, for unlabeled data, WER can't be computed directly. Our goal is to predict audio samples' WER without their target transcription in this project. Using a pre-trained ASR system, different features of the audio samples can be computed (for example, N-best Hypothesis, or word level confidences of most probable hypothesis, or Posterior Distribution of phonemes); these features can then be further processed to get a close approximation of the WER. We used Root Mean Square Error (RMSE) between actual WER and predicted WER as a Loss function. We also used the correlation coefficient (CC) and Mean Absolute Error (MAE) between predicted WER and actual WER of audio samples as an evaluation metric to evaluate the model's performance. In our work, we have worked with three different ASR trained on Clean, Noisy, and Reverb Speech Data and two ASR features, namely the N-best Hypothesis and word-level confidence of the most probable hypothesis to predict the WER using different deep learning methods.

3rd June 2022 : Selection of acoustically similar sentences based on phone error rate in the context of ASR by Saurabh Kumar
3rd June 2022 : Selection of acoustically similar sentences based on phone error rate in the context of ASR by Saurabh Kumar

Talk summary:

  • For many languages, state-of-the-art ASR systems are reported to perform poorly due to the lack of acoustically and phonetically rich speech data available for system building. Even for resource-rich languages such as English, little efforts have been made to finding an efficient method to select training data similar to the testing conditions. Instead, state-of-the-art ASR systems are data hungry and require lots of speech data for training. Therefore, data selection plays a crucial role in the development of robust and computationally efficient ASR systems. In the last few years, several methods have been reported that ensure both acoustic and phonetic richness of the speech data. In this study, several recently reported data selection methods have been explored and efforts have been made to improve them.

27th May 2022 : WER–BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm by Abhishek Kumar
27th May 2022 : WER–BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm by Abhishek Kumar

Talk summary:

  • Automatic Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER), which is calculated by comparing the number of errors between the ground truth and the transcription of the ASR system. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have been developed to automatically predict the WER of a speech system by only relying on the transcription and the speech signal features. While WER is a continuous variable, previous works have shown that positing e-WER as a classification problem is more effective than regression. However, while converting to a classification setting, these approaches suffer from heavy class imbalance. In this paper, we propose a new balanced paradigm for e-WER in a classification setting. Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER. Furthermore, we introduce a distance loss function to tackle the ordinal nature of e-WER classification. The proposed approach and paradigm are evaluated on the Librispeech dataset and a commercial (black box) ASR system, Google Cloud’s Speech-to-Text API. The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.

20th May 2022 : Unsupervised representation learning for speaker verification. by Prajesh Rana
20th May 2022 : Unsupervised representation learning for speaker verification. by Prajesh Rana

Talk summary:

  • The objective of speaker verification is authentication of a claimed identity from measurements on the voice signal. For speaker verification I am exploring contrastive loss based self supervised learning(SSL). My work on speaker verification consist of two parts. In the first part I trained the Self supervised model and in second part I am using pretrained model as a feature extractor and trained the PLDA as a backend model. I will compare my results with the TDNN-PLDA and TDNN-ECAPA algorithms.

13th May 2022 : Extracting features using Self Supervised learning using ASR by Abhishek
13th May 2022 : Extracting features using Self Supervised learning using ASR by Abhishek

Talk summary:

  • Automatic Speech Recognition, or ASR for short, is a technique of providing transcription to a speech or in simple terms ASR is also termed as Speech-to-Text conversion. So, here we aim to learn a feature that not only utilizes data but are also robust to noise. To support this argument for our target feature I have evaluated the performance of MFCC and F-bank features with the Features learnt using wav2vec] which is learnt using a self- supervised representation learning, for ASR. The metric used for comparison is PER and CER.

6th May 2022 : A Stage Match For Query-By-Example Spoken Term Detection Based On Structure Information Of Query by Deekshitha G
6th May 2022 : A Stage Match For Query-By-Example Spoken Term Detection Based On Structure Information Of Query by Deekshitha G

Talk summary:

  • The state-of-the-art of query-by-example spoken term detection (QbE-STD) strategies are usually based on segmental dynamic time warping (S-DTW). However, the sliding window in S-DTW may separate signal of a word into different segments and produce many illegal candidates required to be compared with the query, which significantly reduce the accuracy and efficiency of detection. This paper propose a stage match strategy based on the structure information of the query, represented with the unvoiced-voiced attribute of the portions in itself. The strategy first locates potential candidates with similar structure against the query in utterances,and further matches the query with Type-Location DTW (TLDTW), which is a modified DTW with the constraints of pronunciation types and relative positions of paired frames in the voiced sub-segments. Experiments on AISHELL-1 Corpus showed that the proposed approach achieved a relative improvement S-DTW and speeded up the retrieval.

29th April 2022 : Paper review by Sathvik Udupa
29th April 2022 : Paper review by Sathvik Udupa

Talk summary:

  • 1. Understanding the Role of Self Attention for Efficient Speech Recognition Transformer neural networks are increasingly used in automatic speech recognition (ASR). This work investigates the inner working of such networks in ASR and introduces techniques to reduce recognition latency. 2. Chunked Autoregressive GAN for Conditional Waveform Synthesis Generative adversarial networks (GAN) based neural vocoders have been performing well in speech synthesis in recent years. The authors show that these networks are unable to generate accurate pitch and periodicity, and introduce an autoregressive GAN based vocoder to tackle the issues.

15th April 2022 : Broadcasted Residual Learning for Efficient Keyword Spotting by Siddarth
15th April 2022 : Broadcasted Residual Learning for Efficient Keyword Spotting by Siddarth

Talk summary:

  • We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension.

25th March 2022 : Wave Equation and Fundamentals by Veerababu Dharanalakota
25th March 2022 : Wave Equation and Fundamentals by Veerababu Dharanalakota

Talk summary:

  • It is necessary to know the mathematical description of the speech signals (sound waves) not just for the reason the study exists but for it has a potential to mimic the reality under certain conditions. In order to do so, it is necessary to know the derivation of wave equation and the underlying assumptions. This talk covers the derivation of wave equation from the fundamental of fluid dynamics equations: continuity, momentum and energy equations, which in turn are derived from natural laws. Further, the talk covers the general terminology used in the study of sound.

11th March 2022 : Pnoi: Development and Challenges by Syed Fahad
11th March 2022 : Pnoi: Development and Challenges by Syed Fahad

Talk summary:

  • Discussion of the developments in regards to creation of a specialized digital stethoscope called Pnoi for capturing lung and breathing sounds. These sounds can be used for medical diagnosis at an substantially cheaper cost than the current standards.

4th March 2022 : Large Text Corpus Creation using Web Scraping for Language Modelling by Hemantha Krishna Bharadwaj
4th March 2022 : Large Text Corpus Creation using Web Scraping for Language Modelling by Hemantha Krishna Bharadwaj

Talk summary:

  • The collection of large datasets for training language models requires the use of techniques that extract data from the world wide web in a systematic manner. Collectively known as web scraping, these techniques have been well established by previous research, but there is little research on their use for the collection of data other than that in the English language. This talk will detail improved methods of extracting domain-specific non-English language data from the internet using a combination of HTML parsing libraries and frameworks in Python. The proposed methodology can be utilized to provide large non-English language text datasets in an automated fashion.

18th February 2022 : Attention and Transformers by Abhayjeet Singh
18th February 2022 : Attention and Transformers by Abhayjeet Singh

Talk summary:

  • Intuitive and mathematical understanding of Attention and Transformer Networks

11th February 2022 : wav2vec 2.0 by Siddarth C
11th February 2022 : wav2vec 2.0 by Siddarth C

Talk summary:

  • We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.

28th January 2022 : An error correction scheme for improved air-tissue boundary in real-time MRI video for speech production by Anwesha
28th January 2022 : An error correction scheme for improved air-tissue boundary in real-time MRI video for speech production by Anwesha

Talk summary:

  • The best performance in Air-tissue boundary (ATB) segmentation of real-time Magnetic Resonance Imaging (rtMRI) videos in speech production is known to be achieved by a 3-dimensional convolutional neural network (3D-CNN) model. However, the evaluation for this model, as well as other ATB segmentation techniques reported in the literature, is done using Dynamic Time Warping (DTW) distance between the entire original and predicted contours. Such an evaluation measure may not capture local errors in the entire contour. Careful analysis of predicted contours reveals errors in regions like the velum part of contour1 and tongue base section of contour2, which are not captured in a global evaluation metric like DTW distance. In this work, we automatically detect such errors and propose a correction scheme for the same. We also propose two new evaluation metrics for ATB segmentation separately in contour1 and contour2 to explicitly capture two types of errors in these contours.

14th January 2022 : Data Visualization by Jeevan

7th January 2022 : Sociolinguistics: Language Variation and Dialect by Sharmistha

31st December 2021 : A brief introduction to muscle synergies in speech by Navaneetha

24th December 2021 : Vocal and Non-vocal segmentation based on the analysis of formant structure\u200b by Pranaswi
24th December 2021 : Vocal and Non-vocal segmentation based on the analysis of formant structure\u200b by Pranaswi

Talk summary:

  • A pulmonary system is a network of organs and tissues that help us to breath. A typical pulmonary system in humans consists of lungs, larynx, trachea, bronchi, bronchioles, alveoli and thoracic diaphragm. Inspiratory sounds measured simultaneously over the extrathoracic trachea and at the chest surface contain highly unique regional information. The characteristic patterns in the recorded data are associated with the conditions affecting airway patency such as asthma and obstructive sleep apnea. There is a potential for the recorded sounds to be used in clinical practices for the diagnosis and monitoring of various respiratory conditions. In the proposed research work, an acoustic model of the pulmonary system will be developed by treating tracheobronchial tree and lungs as flexible branched duct system and plenums, respectively.

17th December 2021 : Overlapped Speech Detection using CNN Architectures by Pooja
17th December 2021 : Overlapped Speech Detection using CNN Architectures by Pooja

Talk summary:

  • The ability to estimate the overlapped sentences spoken by an individual over a certain period of time is valuable in language acquisition, healthcare, and assessing language development. However, establishing a robust automatic framework to achieve high accuracy is non-trivial in realistic/naturalistic scenarios due to various factors such as different styles of conversation or types of noise that appear in audio recordings, especially in multi-party conversations. Therefore, overlapping speech detection has become an important front-end triage step for speech technology applications. This is crucial for large-scale datasets where manual labeling in not possible. A block-based CNN architecture is proposed to address modeling overlapping speech in audio streams with frames as short as 25 ms. The architecture is robust to both: (i) shifts in distribution of network activations due to the change in network parameters during training, (ii) local variations from the input features caused by feature extraction, environmental noise, or room interference.

10th December 2021 : Introduction to G2P Systems by Priyanshi
10th December 2021 : Introduction to G2P Systems by Priyanshi

Talk summary:

  • Orthography of a language does not always have a predictable relationship with it’s pronunciation. Certain languages have predictable and consistent relationships, however, for languages like English which have multiple inconsistencies and loan words from other languages, mapping this relationship becomes challenging. Ability to map this relationship can help in producing better performing ASR and TTS systems. Grapheme to phoneme conversion systems are used to find pronunciation of a word given it’s written form. We look at where it plays a role in the aforementioned systems, what are the challenges involved in it and also look at one approach to do it.

3rd December 2021 : Acoustic Modeling and Analysis of Pulmonary System by Veerababu
3rd December 2021 : Acoustic Modeling and Analysis of Pulmonary System by Veerababu

Talk summary:

  • A pulmonary system is a network of organs and tissues that help us to breath. A typical pulmonary system in humans consists of lungs, larynx, trachea, bronchi, bronchioles, alveoli and thoracic diaphragm. Inspiratory sounds measured simultaneously over the extrathoracic trachea and at the chest surface contain highly unique regional information. The characteristic patterns in the recorded data are associated with the conditions affecting airway patency such as asthma and obstructive sleep apnea. There is a potential for the recorded sounds to be used in clinical practices for the diagnosis and monitoring of various respiratory conditions. In the proposed research work, an acoustic model of the pulmonary system will be developed by treating tracheobronchial tree and lungs as flexible branched duct system and plenums, respectively.

26th November 2021 : Hindi Language Modelling using text data from domains of agriculture, finance, healthcare and general by Sneha
26th November 2021 : Hindi Language Modelling using text data from domains of agriculture, finance, healthcare and general by Sneha

Talk summary:

  • Data corresponding to Hindi Text is collected from four different domains: general, agriculture, healthcare, and finance. Different statistics like word frequency, total number of sentences, words are determined from the combined cleaned data of all the domains. To validate the data, a Decision Tree Algorithm is used for text classification where it can classify an unknown text into pre-determined groups.

19th November 2021 : Bengali text data classification of different domains by Sanchari Chatterjee
19th November 2021 : Bengali text data classification of different domains by Sanchari Chatterjee

Talk summary:

  • Bengali texts from 4 different domains (general, agriculture, healthcare, and finance) a decision tree algorithm is used in separating the classes and predicting the domain of any unknown Bengali text. It is an approach by which any unknown text can be easily classified as to which domain it belongs to.

5th November 2021 : Selection of acoustically and phonetically rich sentences in the context of ASR by Saurabh Kumar

28th October 2021 : Neural speech synthesis models by Navneet Kaur
28th October 2021 : Neural speech synthesis models by Navneet Kaur

Talk summary:

  • The most recent advancements in the field of speech sythesis have been brought by deep learning. In current state-of-the-art models, the task of text to speech conversion is accomplished through two steps: i) Conversion of text to lower resolution intermediate representation generally mel-spectrogram using seq2seq model (Frontend), ii) Generation of speech waveform from mel-spectrogram using generative models(Backend). In this talk, I will be discussing and comparing different techniques and models for both front-end and backend. Specifically, for seq-to-seq model I will cover Tacotron-2, Fastspeech-2, Transformer-TTS, and GlowTTS. Among generative models, I will discuss WaveNet, WaveGlow, and MelGAN.

22nd October 2021 : Analysis of vocal sounds in asthmatic patients by Shivani

15th October 2021 : TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis probabilistic models in speech synthesis by Sathvik
15th October 2021 : TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis probabilistic models in speech synthesis by Sathvik

Talk summary:

  • In recent years, there has been progress in a type of generative modelling known as diffusion probabilistic models. The latent features are learnt through a 'diffusion' process, which iteratively adds noise to the data to transform it into a noise distribution. During inference, this process can be reversed to generate data samples from noise. This learning technique has been applied in problem statements in speech synthesis by modifying it to a conditional generative process.

13th October 2021 : Analysis of vocal sounds in asthmatic patients by Shivani

24th September 2021 : Study of ALS/PD classification using slurred speech by Aayushman
24th September 2021 : Study of ALS/PD classification using slurred speech by Aayushman

Talk summary:

  • Monitoring disease progression in patients with Amyotrophic Lateral Sclerosis (ALS), and Parkinson’s disease (PD) can be done by analyzing their speech waveforms. Many works in the past have used different acoustic features for the classification of patients with ALS and PD with healthy controls (HC). In this project, I studied a data-driven approach to learn representations from raw speech waveform. The model comprises of 1-D Convolutional Neural Network (CNN) layer to extract representations from raw speech followed by a Bi-directional Long Short Term Memory (BLSTM) layers for the classification tasks. Three different classification tasks (ALS vs HC), (PD vs HC), and (ALS vs PD) were considered. The model performs classification task using four different speech stimuli, namely, image description (IMAG), spontaneous speech (SPON), diadochokinetic rate (DIDK), and sustained phoneme production (PHON). Experiments were performed with 90 ALS, 90 PD, and 90 HC patient.

17th September 2021 : Speech Synergies by Chirag Vasist

3rd September 2021 : TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis by Siddarth
3rd September 2021 : TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis by Siddarth

Talk summary:

  • This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes. To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strategy to exploit the much larger amounts of acoustic-only data available to train TTS models. For this, a Tacotron 2 TTS model is first trained, and then the parameters of its decoder are transferred to the TaLNet decoder.

20th August 2021 : Kannada language modeling using text data from the domains of agriculture, finance, healthcare and general by Karthik S Vasisht
20th August 2021 : Kannada language modeling using text data from the domains of agriculture, finance, healthcare and general by Karthik S Vasisht

Talk summary:

  • Language modelling involves decomposing texts into smaller sections: sentences and words, and statistically analyzing them to make accurate predictions of phrases and sentences. The N-GRAM model is a statistical analysis tool that predicts the likelihood of certain words combining to form a meaningful sentence based on the conditional probabilities of each of the words in the sentence given the occurrence of the others. This presentation would discuss the work done during the period of 2 months towards link collection, web-scraping, text cleaning and validation of collected data for building a Kannada language model.

6th August 2021 : AAI and Palate contour estimation by Anish
6th August 2021 : AAI and Palate contour estimation by Anish

Talk summary:

  • AAI is a model which maps MFCC to EMA points. EMA data corresponds to movement of articulators in the mouth. Predicting EMA points helps us visualize the movement of our mouth while we speak, but predicting the palate contour along with the EMA points would help us better understand the movement of our mouth. The presentation would focus on different models which were trained along with different preprocessing techniques which were employed to predict the palate contours.

23rd July 2021 : Language Identification of ALS patients using X-Vector model by Yasaswini
23rd July 2021 : Language Identification of ALS patients using X-Vector model by Yasaswini

Talk summary:

  • Amyotrophic lateral sclerosis (ALS) is a rare neurological disease that primarily affects the nerve cells (neurons) responsible for controlling voluntary muscle movements like chewing, walking, and talking. As ALS hampers speech by a great deal, speech recognition techniques become predominant. So in order to build a model to identify their speech, language identification is the foremost crucial part.

16th July 2021 : Accent conversion using Cotatron by Chinmay
16th July 2021 : Accent conversion using Cotatron by Chinmay

Talk summary:

  • Accent conversion (AC) aims to make non-native speech sound as if the speaker has a certain native accent. Typical AC methods attempt to convert only the native speaker voice to that of a non-native native speaker, leaving the basic content and pronunciation unchanged. This hinders their practical use in real-world applications, because native-accented utterances are required at conversion stage. Students who get a second language after “critical age” often speak a language other than their mother tongue.This can lead to low understanding and speakers may face discriminatory situations. Therefore, students who communicate with native speakers have much to gain by improving their pronunciation. The presentation would discuss the work done during the period of 2 months towards testing, comparing, and improving the existing methods for the accent conversion

9th July 2021 : Age Estimation for ALS Patients Speech Utterance Based on LSTM by Lavanya
9th July 2021 : Age Estimation for ALS Patients Speech Utterance Based on LSTM by Lavanya

Talk summary:

  • Speaker age is part of the non-verbal information contained in speech. Age estimation consists of automatically determining the age of a speaker in a given segment of the speech utterance. Age estimation from speech has recently received increased interest as it is useful for many applications such as user-profiling, targeted marketing, or personalized call-routing. This kind of applications need to quickly estimate the age of the speaker and might greatly benefit from real-time capabilities. Long short-term memory (LSTM) have shown to outperform state-of the-art approaches in related speech-based tasks, such as language identification or voice activity detection, especially when an accurate real-time response is required.

2nd July 2021 : Speech-based classification of ALS patients and Healthy subjects by Sonakshi
2nd July 2021 : Speech-based classification of ALS patients and Healthy subjects by Sonakshi

Talk summary:

  • Amyotrophic Lateral Sclerosis (ALS) is a rare neurological disease that affects the motor neurons, hence causing loss of ability to speak, eat, move and breath There is no cure for ALS yet. Early detection is crucial for so that the therapeutic measures can be started at an early stage which would help in prolonging the life expectancy and quality of living for the patients. But unfortunately, the diagnosis of the disease is difficult and time consuming. Hence, there is need to develop an automatic device/app that can detect the disease, which would be beneficial for the early beginning of therapy leading to greater life expectancy. The presentation would discuss the work done during the period of 6 months towards testing,comparing and improving the existing methods for the classification purpose.

18th June 2021 : Segnet based ATB segmentation in rtMRI videos by Jelwin

11th June 2021 : Brain Stroke Segmentation using Deep Learning by Nikhil
11th June 2021 : Brain Stroke Segmentation using Deep Learning by Nikhil

Talk summary:

  • Stroke is one of the main reasons for adult deaths around the globe, impacting 6.2 million people per annum. Over the past 20 years, there has been a 26 percent increase in stroke deaths, worldwide. Across the world, stroke is the second leading cause of death. In recent years, machine and deep learning algorithms have created a huge impact on addressing research challenges in several domains includes health care, natural language processing, speech processing, and more. The medical field also greatly benefits from the utilization of improving deep learning models which save time and produce accurate results. Typically, the manual segmentation of strokes is done by expert radiologists or doctors who excelled in this field. It is said that the manual segmentation is time-consuming (takes nearly three to four hours to diagnose the problem) and also introduces inter and intra rater variability among the radiologists. It impacts brain stroke-affected patients if careful clinical decision-making is not made in less amount of time. To augment a radiologist's or doctor's effort, deep learning algorithms can be used effectively for segmenting clinical brain images and can be a valuable tool for this work.

21st May 2021 : A Scalable Deep Learning Model for Arbitrary Transmitter Configurations in Inverse Scattering by Karthik

7th May 2021 : Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling by Abhayjeet

30th April 2021 : Learning Rate Warmups and the Variance of Adaptive Learning Rates by Bheshaj

23rd April 2021 : ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing by Bhargava

9th April 2021 : Speech to EMG mapping by Navaneetha

19th March 2021 : A literature survey on audio recording device identification by Bhavuk

12th March 2021 : A brief tutorial on Android app development and design patterns by Shankar

5th March 2021 : Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains by

26th February 2021 : Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains by

19th February 2021 : Principal Component Analysis (PCA), Kernel PCA and Independent Component Analysis (ICA) by Anwesha

13th February 2021 : Graph Convolutional Networks by Manthan

5th February 2021 : Tutorial on Equivariant Networks by

29th January 2021 : A Brief Introduction to Density Based Spatial Clustering for Applications with Noise by Priyanshi

22nd January 2021 : Graph Neural Networks for solving PDEs by Karthik

15th January 2021 : Capsule Networks by Siddharth

8th January 2021 : Whisper to neutral speech conversion by Subhadeep, Pritam and Debojyoti

1st January 2021 : A closer look at loss functions by Achuth

24th December 2020 : Approximate inference methods by Sathvik

14th December 2020 : Acoustic-Articulatory Mapping: Analysis and Improvements with Neural Network Learning Paradigms by Aravind

4th December 2020 : Fader Network by Tanuka

27th November 2020 : An unsupervised segmentation of vocal breath sounds by Shivani

19th November 2020 : Pulmonary function test graph digitizer by Sandhya

13th November 2020 : Overview of Microphones by Jeevan and Shaique

6th November 2020 : Feasibility of Learning(Continued) by Karthik

30th October 2020 : Feasibility of Learning by Karthik

17th October 2020 : Inverse scattering using two stage networks by Karthik

9th October 2020 : Introduction to Phonetics by Sharmistha

11th September 2020 : Introduction to Git and Docker by Sanjeev

27th August 2020 : On the power of curriculum learning in training deep networks by Siddharth

21st August 2020 : Generative models based on normalizing flows by Achuth

14th August 2020 : Generative models based on normalizing flows by Achuth

7th August 2020 : Electrical Impedance Tomography: A portable low cost setup for biomedical imaging and other applications by Karthik

31st July 2020 : Neural Networks and Differential Equations by Avni

24th July 2020 : Regularization for deep learning by Aravind

17th July 2020 : Thesis defence by Renuka

9th July 2020 : Speech task-specific representation learning using acoustic-articulatory data by Renuka

3rd July 2020 : Neural Turing Machines by Chiranjeevi

26th June 2020 : An overview of gradient descent optimization algorithms by Abinay

19th June 2020 : Speech rate estimation using representations learned from speech with convolutional neural network by Renuka

12th June 2020 : Acoustic-to-articulatory inversion of dysarthric speech by utilizing cross-corpus acoustic-articulatory data

5th June 2020 : Temporal decomposition of Speech by Tilak

29th May 2020 : Quantum Computing for breaking encryption by Pavan Kumar J

13th March 2020 : Linguistics: An Introduction by Sharmistha

6th March 2020 : AUTOMATIC CLASSIFICATION OF VOLUMES OF WATER USING SWALLOW SOUNDS FROM CERVICAL AUSCULTATION by Siddharth

28th February 2020 : Web Interface for acoustic feature analysis by Heena and Vaibhav

21st February 2020 : Deep Canonical Correlation Analysis by Sanjeev
21st February 2020 : Deep Canonical Correlation Analysis by Sanjeev

Talk summary:

  • Introduction to correlation and also linear algebra concepts like Eigen Decomposition, SVD, and PCA. This would be followed by how Canonical Correlation Analysis (CCA) uses all these to find ideal transformations.

14th February 2020 : Voice-based classification of patients with ALS, Parkinson's disease and healthy controls with CNN-LSTM using transfer learning by Jhansi

7th February 2020 : Hypothesis Testing by Shivani

31st January 2020 : Improving fundamental frequency generation in EMG-to-Speech Conversion using a Quantization Approach by Tejas

17th January 2020 : Comparison of interpolation schemes for the perception of speech in the presence of missing samples by Amit

3rd January 2020 : Inverse Scattering by Mahima

1st January 2020 : Out-of-Pronunciation Distribution Detection: An Unsupervised Approach by Parth

27th December 2019 : Data driven analysis of critical articulators in speech production by Anusuya

13th December 2019 : Multichannel Acoustic Source Localization by Tarun

6th December 2019 : Variational Methods by Achuth

29th November 2019 : Computational wave scattering by Karthik

22nd November 2019 : Basics of Graph Signal Processing by Aravind

11th October 2019 : Medical image segmentation on GPUs – A comprehensive review by Divya

4th October 2019 : Dynamic Programming: An Overview and Some Optimization Techniques by Shankar

27th September 2019 : Origins of Fourier Series by Pavan Kumar
27th September 2019 : Origins of Fourier Series by Pavan Kumar

Talk summary:

  • Introduction The Heat Equation Solution of PDE Fourier Series

2nd September 2019 : Comparison of automatic syllable stress detection quality with time-aligned boundaries and context dependencies by Manoj

2nd September 2019 : A comparative study of noise robustness of goodness of pronunciation (GoP) measures and its modifications based on teacher's utterance by Manoj

2nd September 2019 : Whisper to neutral mapping using cosine similarity maximization in i-vector space for speaker verification by Abinay

2nd September 2019 : An investigation on speaker specific articulatory synthesis with speaker independent articulatory inversion by Aravind

31st August 2019 : Low resource automatic intonation classification using gated recurrent unit (GRU) networks pre-trained with synthesized pitch patterns by Atreyee

23rd August 2019 : Achievements and the goals of the lab by Prasanta Kumar Ghosh

23rd August 2019 : ASR inspired syllable stress detection for pronunciation evaluation without using a supervised classifier and syllable level features by Manoj

23rd August 2019 : An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering HMM transition probabilities. by Manoj

16th August 2019 : Acoustic and articulatory feature based speech rate estimation using a convolutional dense neural network. by Renuka
16th August 2019 : Acoustic and articulatory feature based speech rate estimation using a convolutional dense neural network. by Renuka

Talk summary:

  • In this paper, we propose a speech rate estimation approach using a convolutional dense neural network (CDNN). The CDNN based approach uses the acoustic and articulatory features for speech rate estimation. The Mel Frequency Cepstral Coefficients (MFCCs) are used as acoustic features and the articulograms representing time-varying vocal tract profile are used as articulatory features. The articulogram is computed from a real-time magnetic resonance imaging (rtMRI) video in the midsagittal plane of a subject while speaking. However, in practice, the articulogram features are not directly available, unlike acoustic features from speech recording. Thus, we use an Acoustic-to-Articulatory Inversion method using a bidirectional long-short-term memory network which estimates the articulogram features from the acoustics. The proposed CDNN based approach using estimated articulatory features requires both acoustic and articulatory features during training but it requires only acoustic data during testing. Experiments are conducted using rtMRI videos from four subjects each speaking 460 sentences. The Pearson correlation coefficient is used to evaluate the speech rate estimation. It is found that the CDNN based approach gives a better correlation coefficient than the temporal and selected sub-band correlation (TCSSBC) based baseline scheme by 81.58% and 73.68% (relative) in seen and unseen subject conditions respectively.

2nd August 2019 : Rethinking Model Scaling for Convolutional Neural Networks by Aparna

23rd July 2019 : Breath cycle segmentation by Shruthi
23rd July 2019 : Breath cycle segmentation by Shruthi

Talk summary:

  • Segmentation of individual inhale and exhale from an audio recording of continuous breaths using two approaches - Spectral entropy approach and Parzen window-based approach.

19th July 2019 : Performance characterization of microphones by Suhas
19th July 2019 : Performance characterization of microphones by Suhas

Talk summary:

  • we look at the parameters that affect the recording from a microphone, what to look for in a specifications sheet and also look to assess different microphones' performance qualitatively

16th July 2019 : Construction of an anthropomorphic thorax phantom using CT scan segmentation and 3D printing by Srishti
16th July 2019 : Construction of an anthropomorphic thorax phantom using CT scan segmentation and 3D printing by Srishti

Talk summary:

  • Phantoms that mimic human physiology have long been used for designing and testing various diagnostic medical/imaging techniques. The most important advantage of using phantoms is easy access to ground truth information which in most cases cannot be obtained from a human subject. The objective of this project is to construct an anthropomorphic thorax phantom that can be used to develop a system for multi-channel active/passive acoustic characterisation of lungs. However, to construct anthropomorphic phantoms we need a suitable way to capture anthropomorphic parameters and replicate them in the form of a phantom. To this end, we first obtain the anthropomorphic parameters of a human thorax using a CT scan. The CT scan is then segmented into various regions which would be finally printed using a 3D printer.

12th July 2019 : The task of Sound Event Detection by Shoureen
12th July 2019 : The task of Sound Event Detection by Shoureen

Talk summary:

  • The task of Sound Event Detection can be broadly classified into two categories, namely- classification and localization, the former catering to simple audio tagging while the latter requiring the additional task of specifying the onset and offset times of each event which is taking place in the given audio stream. The main challenge involved in audio tagging is the lack of availability of frame wise ground truths which essentially turns this into a Multiple Instance Learning problem. In my work, I have tested multiple pooling functions by incorporating them at various stages in order to maximize the F-score of the Audio Tagging System

12th July 2019 : Call recording app with some additional features. by Utkarsh

12th July 2019 : Trend Statistics Network and Channel invariant EEG Network for sleep arousal study by Achuth
12th July 2019 : Trend Statistics Network and Channel invariant EEG Network for sleep arousal study by Achuth

Talk summary:

  • Sleep is a very important part of life and lack of sleep or sleep disorder can cause a negative impact on day to day life and can have long term serious consequences. In this work, we propose an end-to-end trainable neural network for automated arousal scoring. The network consists of two main parts. Firstly, a trend statistics network that computes the moving average of the filtered signals at different scales. Secondly, we propose a channel invariant EEG network to detect the EEG arousals in any channel. Finally, we combine the features from various channels through a convolution network and bi-directional long short-term memory to predict the probability of the arousal. Further, we propose an objective function that uses only respiratory effort related arousal (RERA) and non-arousal regions to optimize the network. We also propose method to estimate the respiratory disturbance index (RDI) from the probability predicted by the network. Evaluation on Physionet Challenge 2018 database shows that the proposed method detects the RERA with area under the precision-recall curve (AUPRC) of 0.50 in a 10-fold cross validation setup. The mean absolute error of RDI prediction is 6.11, while a two-class RDI severity prediction yields a specificity of 75% and sensitivity of 83%

5th July 2019 : Effect of consonant context in TIMIT vcv sequences on pitch trend by Vaibhav
5th July 2019 : Effect of consonant context in TIMIT vcv sequences on pitch trend by Vaibhav

Talk summary:

  • This analysis describe that how the pitch trend in vcv sequences depend on voicing characteristics of consonant in the vowel region followed by consonant.

5th July 2019 : An exhaustive study on involvement articulators in the production of plosives. by Minulakshmi
5th July 2019 : An exhaustive study on involvement articulators in the production of plosives. by Minulakshmi

Talk summary:

  • The study focuses on the occurrence and duration of constriction for bilabial and laminal-alveolar plosives across the vowels /a,e,i,o,u/ in a symmetric VCV sequence.

28th June 2019 : Quantitative Trading by Sanjeev
28th June 2019 : Quantitative Trading by Sanjeev

Talk summary:

  • Exploring methods used by Quants in trading - Risk Model, Alpha Model, and strategies.

28th June 2019 : A Comparison of Different Methods for Audio Declipping by Sandhiya
28th June 2019 : A Comparison of Different Methods for Audio Declipping by Sandhiya

Talk summary:

  • A deep dive into state-of-the-art algorithms for audio declipping - Constrained Blind Amplitude Reconstruction, Constrained Orthogonal Matching Pursuit Reconstruction, and two variants Sparse Audio declippers.

20th June 2019 : An acoustic investigation on the effect of consonant context and speaking rate on vowel space and coarticulation in Toda VCV sequences. by Nayan
20th June 2019 : An acoustic investigation on the effect of consonant context and speaking rate on vowel space and coarticulation in Toda VCV sequences. by Nayan

Talk summary:

  • This study analyzes the effect of consonant context and speaking rate on vowel space and coarticulation in Toda vowel-consonant-vowel (VCV) sequences. The vowels /a/,/e/, /i/, /o/, /u/, and two intervocalic consonants, /p/ (labial) and /t/ (alveolar), are considered to form asymmetrical VCV sequences in slow and very fast speaking rates. Results from these acoustic analyses indicate that there are differences in the nature in which rate and consonant context affect the coarticulatory organization.

20th June 2019 : Acoustic analysis of swallow sounds in individuals with head and neck cancer by Divya
20th June 2019 : Acoustic analysis of swallow sounds in individuals with head and neck cancer by Divya

Talk summary:

  • This paper describes the effect of volume of water swallowed by healthy controls on the acoustic sound signals captured by means of cervical auscultation This study indicates that peak intensity of the second swallow segment is found to be the best parameter to differentiate different volumes since it changes significantly across three volumes of water considered in this study.

14th June 2019 : A study on the problem of heart-rate estimation from facial videos by Vishay

7th June 2019 : Unsupervised syllable stress detection by Manoj
7th June 2019 : Unsupervised syllable stress detection by Manoj

Talk summary:

  • Estimate stress markings in automatic speech recognition (ASR) framework involving finite-state-transducer (FST) without using annotated stress markings and segmental information.

6th May 2019 : AIR-TISSUE BOUNDARY SEGMENTATION IN REAL TIME MAGNETIC RESONANCE IMAGING VIDEO USING A CONVOLUTIONAL ENCODER-DECODER NETWORK by Renuka

6th May 2019 : AIR-TISSUE BOUNDARY SEGMENTATION IN REAL TIME MAGNETIC RESONANCE IMAGING VIDEO USING A CONVOLUTIONAL ENCODER-DECODER NETWORK by Renuka

6th May 2019 : AN IMPROVED AIR TISSUE BOUNDARY SEGMENTATION TECHNIQUE FOR REAL TIME MAGNETIC RESONANCE IMAGING VIDEO USING SEGNET by Renuka

26th April 2019 : Representation learning using convolution neural network for acoustic-to-articulatory inversion by Aravind

26th April 2019 : A Study on Robustness of Articulatory Features for Automatic Speech Recognition of Neutral and Whispered Speech by Gokul

19th April 2019 : FORMANT-GAPS FEATURES FOR SPEAKER VERIFICATION USING WHISPERED SPEECH by Abhinay

12th April 2019 : K-SVD by Karthik

15th March 2019 : Methods to work with class imbalanced datasets by Shivani Yadav
15th March 2019 : Methods to work with class imbalanced datasets by Shivani Yadav

Talk summary:

  • What is the class imbalance How they affect classifiers performance Methods to handle class imbalance

8th March 2019 : Initial value problem by Prasanta Kumar Ghosh
8th March 2019 : Initial value problem by Prasanta Kumar Ghosh

Talk summary:

  • Solving initial value problem in the context of ordinary differential equations (ODEs) using numerical methods, which are often required when ODEs are not analytically solvable. In this regard, both the theory and Matlab coding of Runga-Kutta family of methods will be discussed.

22nd February 2019 : Font and Background Color Independent Text Binarization by Vishay
22nd February 2019 : Font and Background Color Independent Text Binarization by Vishay

Talk summary:

  • Starting with the motivation for binarization of text images, then a discussion on global and adaptive thresholding techniques, and end with a discussion on a novel approach for Text Binarization

22nd February 2019 : Prediction of articulatory motion at different rates by Abhay
22nd February 2019 : Prediction of articulatory motion at different rates by Abhay

Talk summary:

  • Predicting the articulatory trajectories in speech production from Neutral to Fast or Slow rates using Encoder-Decoder based model with some alterations

15th February 2019 : A SegNet Based Image Enhancement Technique for Air-Tissue Boundary Segmentation in Real-Time Magnetic Resonance Imaging Video by Renuka

8th February 2019 : Weighted Finite-State Transducers in Speech Recognition by Manoj
8th February 2019 : Weighted Finite-State Transducers in Speech Recognition by Manoj

Talk summary:

  • Different operations on WFSTs How WFSTs are used in decoding an utterance

1st February 2019 : Articulatory Phonology by Renuka
1st February 2019 : Articulatory Phonology by Renuka

Talk summary:

  • Description of articulatory phonology Gestural computational model Gestural analysis

25th January 2019 : Learning deep features for one-class classification by Shankar
25th January 2019 : Learning deep features for one-class classification by Shankar

Talk summary:

  • We take a look at the problem of one-class classification and a deep learning-based solution for feature learning for one-class classification

11th January 2019 : Solving a system of linear equations by Karthik

28th December 2018 : On Utility of Multi-taper Modified Group Delay by Narendra
28th December 2018 : On Utility of Multi-taper Modified Group Delay by Narendra

Talk summary:

  • Function Representations for Speaker and Language Recognition The abstract can be found in the attached document

14th December 2018 : Visualizing High Dimensional Data using t-SNE by Aravind
14th December 2018 : Visualizing High Dimensional Data using t-SNE by Aravind

Talk summary:

  • Basic concepts of Information theory Introduction to visualization t-SNE (t-distributed stochastic neighbor embedding)

7th December 2018 : Overview of ASR by Avni
7th December 2018 : Overview of ASR by Avni

Talk summary:

  • Mathematical equation of ASR Overview of HMM-GMM ASR Viterbi decoding Advantages of WFST over tree based structures

30th November 2018 : How does Netflix & Amazon Prime recommend movies by Sweekar
30th November 2018 : How does Netflix & Amazon Prime recommend movies by Sweekar

Talk summary:

  • The talk is about how Matrix Factorization & Gradient Descent collectively work towards suggesting the best content possible for the viewer

9th November 2018 : Introduction to MIR, Audio licensing and blockchain technology by Suhas
9th November 2018 : Introduction to MIR, Audio licensing and blockchain technology by Suhas

Talk summary:

  • In this talk, we look at what music information retrieval is, why audio licensing is required and how audio watermarking and blockchains make data secure, accurate and reliable

2st November 2018 : Attention in the neural network by Achuth
2st November 2018 : Attention in the neural network by Achuth

Talk summary:

  • We will see how basic attention works in neural networks and understand how attention is used general seq2seq mapping problem including ASR, TTS, machine translation and image captions

12th October 2018 : Necessity for cloud computing by Valliappan

12th October 2018 : We’re creating a dystopia of misinformation and emotional manipulation by Aparna

5th October 2018 : Learning better models to sparsify yellow marks in manuscripts by Nisha
5th October 2018 : Learning better models to sparsify yellow marks in manuscripts by Nisha

Talk summary:

  • Faulty writing practices leading to "yellowing" of submitted drafts Factors in writing that are inversely proportional to dimension of "yellow marks" subspace "Check-List" algorithm to improve writing

28th September 2018 : Fisher Linear Discriminant by Chiranjeevi

24th August 2018 : Subband Weighting for Binaural Speech Source Localization by Karthik

24th August 2018 : Speech Enhancement Using Deep Mixture of Experts Based on Hard Expectation Maximization by Pavan

24th August 2018 : Relating Articulatory Motions in Different Speaking Rates by Astha

24th August 2018 : Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs by Nisha

17th August 2018 : Air-Tissue Boundary Segmentation in Real-Time Magnetic Resonance Imaging Video Using Semantic Segmentation with Fully Convolutional Networks by Valliappan

17th August 2018 : Automatic visual augmentation for concatenation based synthesized articulatory videos from real-time MRI data for spoken language training by Chiranjeevi

17th August 2018 : Inferring speaker identity from articulatory motion during speech by Aravind

17th August 2018 : Low resource acoustic-to-articulatory inversion using bi-directional long short-term memory by Aravind

3rd August 2018 : Responsive Website Tool to Rate the Pronunciation Quality by Abhishek Gaonkar

3rd August 2018 : Richer convolutional features for edge detection by Renuka

27th July 2018 : Gottal Segmentation GUI in python by Varun

27th July 2018 : STUDY OF USE OF ARTICULATORY INFORMATION FOR ASR OF NEUTRAL AND WHISPERED SPEECH by Gokul

24th July 2018 : Interpretibility in Machine learning (ML) by Deep
24th July 2018 : Interpretibility in Machine learning (ML) by Deep

Talk summary:

  • We have been deploying the ML algorithms ("black box models") in various problems (e.g. classification tasks), and as a result it has become imperative that we develop tools for interpretibility of these "black boxes" so as to enable their deployment in real life applications. My aim is to give a brief overview of this science of interpretibility

20th July 2018 : A study on acoustic-to-articulatory inversion for understanding inter-speaker dependency by Siddant

20th July 2018 : Intonation classification using temporal structures in pitch contour by Atreyee

13th July 2018 : Rendering head gestures based on MO-CAP (OptiTrack) data by Varshini

13th July 2018 : Prediction of the air-tissue boundary in the upper airway of the vocal tract by Avinash

13th July 2018 : Implementation of frame selective dynamic programming based pitch estimation by Aswin

13th July 2018 : A Maximum Likelihood Formulation to Exploit Heart Rate Variability for Robust Heart Rate Estimation from Facial Video by Raseena

6th July 2018 : Detection and Delineation of P and T waves in an ECG signal by Prakhar

6th July 2018 : Broad Phoneme Class Specific Deep Neural Network Based Speech Enhancement by Pavan

6th July 2018 : Classification between story-telling and poem recitation using head gesture of the talker by Anurag

6th July 2018 : Comparison of Cough, Wheeze and Sustained Phonations for Automatic Classification between Healthy subjects and Asthmatic patients by Shivani

29th June 2018 : A Maximum Likelihood Formulation to Exploit Heart Rate Variability for Robust Heart Rate Estimation from Facial Video by Raseena
29th June 2018 : A Maximum Likelihood Formulation to Exploit Heart Rate Variability for Robust Heart Rate Estimation from Facial Video by Raseena

Talk summary:

  • Motivation for Non contact heart rate measurement Challenges in Non contact heart rate measurement The proposed maximum likelihood approach Experiments and results

29th June 2018 : Comparison of Cough, Wheeze and Sustained Phonations for Automatic Classification between Healthy subjects and Asthmatic patients by Shivani
29th June 2018 : Comparison of Cough, Wheeze and Sustained Phonations for Automatic Classification between Healthy subjects and Asthmatic patients by Shivani

Talk summary:

  • Introduction, motivation, proposed method, dataset, experimental setup, results, conclusion and future work.

22nd June 2018 : Automatic visual augmentation for articulatory videos from real-time MRI data by Chandana

15th June 2018 : A Brief introduction to Lungs anatomy, physiology, pathology and pulmonary function tests by Shivani
15th June 2018 : A Brief introduction to Lungs anatomy, physiology, pathology and pulmonary function tests by Shivani

Talk summary:

  • Lungs anatomy and physiology: Explanation and overview of lung function tests. Pulmonary diseases: Overview of two papers related to sound-based analysis or detection of pulmonary diseases.

4th May 2018 : Git and GitHub by Nitin
4th May 2018 : Git and GitHub by Nitin

Talk summary:

  • What is Git and GitHub Demonstration of tool

23rd Feb 2018 : Introduction to Bootstrap by Kausthubha
23rd Feb 2018 : Introduction to Bootstrap by Kausthubha

Talk summary:

  • Introduction to Bootstrap File Structure Typography Different classes of buttons used in Bootstrap

10th Feb 2018 : Joint Learning of Phonetic Units and Word Pronunciations for ASR by Avni
10th Feb 2018 : Joint Learning of Phonetic Units and Word Pronunciations for ASR by Avni

Talk summary:

  • Problem statement.
    • Background.
    • Model formulation for the underlying problem.
    • Discussion.

2nd Feb 2018 : Why Deep learning? by Valliappan
2nd Feb 2018 : Why Deep learning? by Valliappan

Talk summary:

  • DNN vs Linear Classifier Back Propagation Understanding Back-Propagation for Batch Normalisation Layer Introduction to CNN GPU's Speech and CNN

5th Dec 2018 : Enhanced voice user interface employing spatial filtration of signals from acoustic vector sensor by Abinay Reddy
5th Dec 2018 : Enhanced voice user interface employing spatial filtration of signals from acoustic vector sensor by Abinay Reddy

Talk summary:

  • One of the current challenges in automatic speech recognition (ASR) is robust recognition in noisy conditions. We will discuss the idea of using acoustic vector sensor to improve ASR in noisy conditions.

29th Dec 2017 : Connectionist temporal classification (CTC) by Achuth Rao
29th Dec 2017 : Connectionist temporal classification (CTC) by Achuth Rao

Talk summary:

  • CTC is one of a key component in the recent state of the automatic speech recognition by Google and deep speech2. We will discuss the key ideas and motivation involved in developing CTC.

22nd Dec 2017 : Time Scaling of Articulatory Motion in Speech Production by Astha Singh
22nd Dec 2017 : Time Scaling of Articulatory Motion in Speech Production by Astha Singh

Talk summary:

  • Introduction Problem Statement Approaches : Interpolation, Affine Invariant DTW and Interpolation Some Results

15th Dec 2017 : Arc-cosine kernels and neural networks by Pavan Karjol
15th Dec 2017 : Arc-cosine kernels and neural networks by Pavan Karjol

Talk summary:

  • Kernel functions Arc-cosine kernels Neural networks Conclusions

8th Dec 2017 : On the importance/unimportance of phase in speech signal processing by Prasanta
8th Dec 2017 : On the importance/unimportance of phase in speech signal processing by Prasanta

Talk summary:

  • Definition of Phase Key results (perception) Role of phase in speech enhacenment, watermarking, synthesis, recognition

1st Dec 2017 : The Quantum Bit (Qubit) by Karthik
1st Dec 2017 : The Quantum Bit (Qubit) by Karthik

Talk summary:

  • States of a qubit Information and Measurement of a qubit state Single qubit gates Multi qubit gates Bell States/ EPR pairs Quantum Entanglement

10th Nov 2017 : Wagner-Fisher string-to-string correction algorithm and its optimality by Chiranjeevi Yarra
10th Nov 2017 : Wagner-Fisher string-to-string correction algorithm and its optimality by Chiranjeevi Yarra

Talk summary:

  • Problem definition Wagner-Fisher algorithm Objective function Modifications to the objective function Optimality

3rd Nov 2017 : The impact of speaking rate on acoustic-to-articulatory inversion by Aravind Illa
3rd Nov 2017 : The impact of speaking rate on acoustic-to-articulatory inversion by Aravind Illa

Talk summary:

  • Speech production Acoustic to articulatory inversion Effect of rate on inversion

13th Oct 2017 : simple introduction to Blind Source Separation by Karthik
13th Oct 2017 : simple introduction to Blind Source Separation by Karthik

Talk summary:

  • Introduction to blind source separation Ambiguities due to permutation, scaling and Gaussianity Principle of Independent Component Analysis (ICA) Maximum Likelihood based algorithm for ICA

6th Oct 2017 : Partial Least Squares Regression (contd.) by Nisha Meenakshi
6th Oct 2017 : Partial Least Squares Regression (contd.) by Nisha Meenakshi

Talk summary:

  • Issues in Multiple Linear Regression Nonlinear Iterative Partial Least Squares (NIPALS) Algorithm Discussion

9th June 2017 : Audio-Visual Keyword Spotting by Astha Singh
9th June 2017 : Audio-Visual Keyword Spotting by Astha Singh

Talk summary:

  • Introduction Idea for implementing AV-KWS Feature extraction Audio, Visual HMM - Overview Fusion Strategy for audio and visual modalities output Expected Results

2nd June 2017 : Audio-Visual Speech Enhancement by Ajay Mahender Singh
2nd June 2017 : Audio-Visual Speech Enhancement by Ajay Mahender Singh

Talk summary:

  • Introduction to the problem statement Motivation Initial work - just speech Feature extraction The Menpo Project Visual features Enhancement techniques Conclusion and future work

26th May 2017 : Illumination Variation-Resistant Video-Based Heart Rate Measurement Using Joint Blind Source Separation and Ensemble Empirical Mode Decomposition by Raseena KT
26th May 2017 : Illumination Variation-Resistant Video-Based Heart Rate Measurement Using Joint Blind Source Separation and Ensemble Empirical Mode Decomposition by Raseena KT

Talk summary:

  • Photoplethysmography Joint Blind Source Seperation Estimating Heart Rate from face Video

19th May 2017 : non-ASR based keyword spotting by Samik Sadhu
19th May 2017 : non-ASR based keyword spotting by Samik Sadhu

Talk summary:

  • Motivation Recap of Poisson Process Models in Keyword Spotting Discriminative Training of Poisson Process Models in Keyword Spotting Unsupervised Online Learning of Poisson Process Models Posteriorgram Filtering based Keyword Spotting Future Scope of Work

5th May 2017 : Variational RNN by Pavan Karjol
5th May 2017 : Variational RNN by Pavan Karjol

Talk summary:

  • Dynamic Bayesian Networks Recurrent Neural Networks Variational Recurrent Neural Networks Experiments

28th April 2017 : Finite State Transducers and its Application in KALDI by Avni Rajpal
28th April 2017 : Finite State Transducers and its Application in KALDI by Avni Rajpal

Talk summary:

  • Motivation Basic terms and definitions Operations: particularly composition and determinization Speech Recognition using FST

21st April 2017 : WaveNet: A Generative Model for Raw Audio by Achuth Rao
21st April 2017 : WaveNet: A Generative Model for Raw Audio by Achuth Rao

Talk summary:

  • Recall the generative and discriminative models Generative models used in speech Why modeling direct audio is difficult How Wavenet overcome these difficulties How Wavenet combine both feature of generative and discriminative model features How single model can be used to solve 4-different problem in speech - (a) TTS (b) Multi speaker speech generation (c) Music generation (d) speech recognition.

14th April 2017 : Video Editing with Blender by Gaurav Fotedar
14th April 2017 : Video Editing with Blender by Gaurav Fotedar

Talk summary:

  • Introduction to the Blender VSE Extracting audio from video Cutting/cropping videos Replacing audio in a video with audio from another source changing frame rates Adding Subtitles Video Overlay Making Compilation Videos

31st March 2017 : Hypothesis Testing by Prasanta Ghosh
31st March 2017 : Hypothesis Testing by Prasanta Ghosh

Talk summary:

  • Definition Null and alternative hypothesis Test procedure Error in hypothesis testing Significance level Tests about a population mean Tests concerning a population proportion P-value

24th March 2017 : Hypothesis Testing by Prasanta Ghosh
24th March 2017 : Hypothesis Testing by Prasanta Ghosh

Talk summary:

  • Definition Null and alternative hypothesis Test procedure Error in hypothesis testing Significance level Tests about a population mean Tests concerning a population proportion P-value

10th March 2017 : Variational Auto encoder by Pavan Karjol
10th March 2017 : Variational Auto encoder by Pavan Karjol

Talk summary:

  • Introduction Stochastic gradient variational Bayes (SGVB) estimator Experiments and conclusion

24th February 2017 : Mock Presentations by
24th February 2017 : Mock Presentations by

Talk summary:

  • Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery by Achuth Rao A Comparative Study on the Effect of Different Codecs on Speech Recognition Accuracy Using Various Acoustic Modeling Techniques by Nisha Meenakshi Classification of Healthy Subjects and Patients with Essential Vocal Tremor using Empirical Mode Decomposition of High-Resolution Pitch Contour by Mekhala H S

17th February 2017 : automatic detection of syllable stress using sonority based\tprominence features by Chiranjeevi Yarra
17th February 2017 : automatic detection of syllable stress using sonority based\tprominence features by Chiranjeevi Yarra

Talk summary:

  • How sonority is useful? Existing works on measuring sonority Proposed approach Results Conclusion

10th February 2017 : A COMPARATIVE STUDY OF ACOUSTIC-TO-ARTICULATORY INVERSION FOR NEUTRAL AND WHISPERED SPEECH by Aravind Illa
10th February 2017 : A COMPARATIVE STUDY OF ACOUSTIC-TO-ARTICULATORY INVERSION FOR NEUTRAL AND WHISPERED SPEECH by Aravind Illa

Talk summary:

  • Introduction.
    • Data collection.
    • Experimental set-up.
    • Results.
    • Conclusion.

3rd February 2017 : Automatic detection and diagnosis of phoneme pronunciation quality: a review by Chiranjeevi Yarra
3rd February 2017 : Automatic detection and diagnosis of phoneme pronunciation quality: a review by Chiranjeevi Yarra

Talk summary:

  • Introduction.
    • Mispronunciation detection.
    • Error diagnosis.
    • Conclusion.

20th January 2017 : Classification of Voluntary Cough Airflow Patterns for Prediction of Abnormal Spirometry by Shivani Yadav
20th January 2017 : Classification of Voluntary Cough Airflow Patterns for Prediction of Abnormal Spirometry by Shivani Yadav

Talk summary:

  • Introduction: What is spirometry and its variables, what is the need for automatic classification using cough flow pattern.
    • Study design.
    • Method used.
    • Result.

13th January 2017 : Poisson Process Based Keyword Spotting and its Variants by Samik Sadhu
13th January 2017 : Poisson Process Based Keyword Spotting and its Variants by Samik Sadhu

Talk summary:

  • Introduction Generating Events Model Phonetic Events Keyword Searching with Poisson Process Models Bayesian Approach to training A Better phonetic event selection technique One of our works in PPM - Discriminative Training of PPM Receiver Operating Curves (ROC) and Figure of Merit (FOM) Conclusion

6th January 2017 : The Task Dynamic Model of Speech Production by Nisha Meenakshi
6th January 2017 : The Task Dynamic Model of Speech Production by Nisha Meenakshi

Talk summary:

  • What is articulatory phonology? How do you model the movement of articulators? Applications of the task dynamic model.

30th December 2016 : Degenerate Unmixing Estimation Technique (DUET) by Girija Ramesan Karthik
30th December 2016 : Degenerate Unmixing Estimation Technique (DUET) by Girija Ramesan Karthik

Talk summary:

  • What does that even mean? W-Disjoint Orthogonality Approximate W-Disjoint Orthogonality of Speech ML Parameter Estimation for 2-mixture Speech Separation The phase wrapping problem Weighted histogram based Estimators for 2-mixture Speech Separation

16th December 2016 : Why does deep and cheap learning work so well? -- Part II by Achuth Rao
16th December 2016 : Why does deep and cheap learning work so well? -- Part II by Achuth Rao

Talk summary:

  • We focus out attention approximation of radial function[fun(|x|)]. We construct simple radial function and show how that can approximated by three layer network with complexity poly(d), but the 2-layer network requires exp(d) units.(d is the dimension of input) We show that the simple function can generalize to any radial function.

2nd December 2016 : Associative Networks by Karthik Ramesan
2nd December 2016 : Associative Networks by Karthik Ramesan

Talk summary:

  • What is association? Type of Associative Networks Linear & Non linear Associators Linear & Non linear Associators Energy function Conclusion

25th November 2016 : Mail server by Kausthubha
25th November 2016 : Mail server by Kausthubha

Talk summary:

  • Introduction How mail server works SMTP/POP3 Summary

18th November 2016 : Keyword Spotting in Continuous Speech; An Overview of Different Approaches to Keyword Spotting by Samik Sadhu
18th November 2016 : Keyword Spotting in Continuous Speech; An Overview of Different Approaches to Keyword Spotting by Samik Sadhu

Talk summary:

  • Keyword Spotting What is So Special in That? Going Deep! - DNNs, CNNs Going Sparse! - Dictionary Learning Go Semi(pseudo!) Unsupervised!- Query by Example Go Completely Unsupervised!

11th November 2016 : Automatic Prosodic Event Detection by Vijayakrishna
11th November 2016 : Automatic Prosodic Event Detection by Vijayakrishna

Talk summary:

  • Introduction ToBI convention explaination Existing methods to tackle the prosodic event detection problem Conclusion and future work

28th October 2016 : Why does deep and cheap learning work so well? by Achuth Rao
28th October 2016 : Why does deep and cheap learning work so well? by Achuth Rao

Talk summary:

  • Intro about neural network Proof overview of neural network universal approximation. Visual proof How the depth helps.

21st October 2016 : Return of the Savitzky-Golay (SG) filters by Nisha Meenakshi
21st October 2016 : Return of the Savitzky-Golay (SG) filters by Nisha Meenakshi

Talk summary:

  • Differentiation filter Moment preservation property of SG filters. When is an SG filter an optimal filter?

14th October 2016 : Savitzky-Golay (SG) filters by Nisha Meenakshi
14th October 2016 : Savitzky-Golay (SG) filters by Nisha Meenakshi

Talk summary:

  • Filter formulation Properties of the SG filter with illustrative examples Exemplary application: ECG denoising. Conclusions

7th October 2016 : Generalized Triangular Decomposition in Transform coding by Aravind Illa
7th October 2016 : Generalized Triangular Decomposition in Transform coding by Aravind Illa

Talk summary:

  • Theorem definition Karhunen-Loeve Transform (KLT) Prediction-based lower triangular transform (PLT) GTD Transform Coder conclusions

30th September 2016 : The Perron-Frobenius theorem and It's applications -- Part 2 by Pavan Karjol
30th September 2016 : The Perron-Frobenius theorem and It's applications -- Part 2 by Pavan Karjol

Talk summary:

  • Theorem definition Proof of the theorem. Applications involving Markov chains and page rank algorithm conclusions

23rd September 2016 : The Perron-Frobenius theorem and It's applications by Pavan Karjol
23rd September 2016 : The Perron-Frobenius theorem and It's applications by Pavan Karjol

Talk summary:

  • Theorem definition Proof of the theorem. Applications involving Markov chains and page rank algorithm conclusions

16th September 2016 : Allpass modeling of phase spectrum of speech signal by Prasanta Ghosh
16th September 2016 : Allpass modeling of phase spectrum of speech signal by Prasanta Ghosh

Talk summary:

  • significance of phase spectrum in speech signal processing Allpass modeling of phase spectrum of speech Applications of allpass modeling including formant tracking and GCI identification conclusions

2nd September 2016 : A General Regression Neural Network by Aravind Illa
2nd September 2016 : A General Regression Neural Network by Aravind Illa

Talk summary:

  • Introduction to GRNN Advantages Limitations Conclusion

19th August 2016 : Linear regression fit under Laplacian noise by Chiranjeevi Yarra
19th August 2016 : Linear regression fit under Laplacian noise by Chiranjeevi Yarra

Talk summary:

  • Problem definition Problem formulation Solution under special cases -- When C=0, When m=0 Generic solution -- Alternating minimization, Samples based solution Experimental results

12th August 2016 : Automatic recognition of social roles using long term role transitions in small group interactions by Gaurav Fotedar
12th August 2016 : Automatic recognition of social roles using long term role transitions in small group interactions by Gaurav Fotedar

Talk summary:

  • Introduction to Roles Data Proposed Method Experiments & Results Conclusion & Future Work

5th August 2016 : Distributed Maximum Likelihood estimation of GMM parameters by Varsha Satish
5th August 2016 : Distributed Maximum Likelihood estimation of GMM parameters by Varsha Satish

Talk summary:

  • Introduction Basic EM Distributed EM Application of Distributed EM Two distributed EM algorithms implementation of EM Algorithm in C problems faced while implementing it on C

29th July 2016 : Classification of Healthy Subjects and Patients with Essential vocal tremor using Empirical Mode Decomposition of High Resolution Pitch Contour by Mekhala
29th July 2016 : Classification of Healthy Subjects and Patients with Essential vocal tremor using Empirical Mode Decomposition of High Resolution Pitch Contour by Mekhala

Talk summary:

  • Introduction Obtaining High Resolution Pitch using Glottal Closure Instants(GCIs) Pitch Oscillation Characteristics (POC) extraction using Empirical Mode Decomposition Experimentation - Baseline and Evaluation metric Results and discussion

29th July 2016 : Audio Visual Synthesis by Valliappan
29th July 2016 : Audio Visual Synthesis by Valliappan

Talk summary:

  • Introduction to the Problem Dataset (PRAV Corpus) Approaches (Dynamic programming and LSTM - RNN) Results Conclusion and Further Work

22nd July 2016 : Music Reconstruction, Separation and synthesis by Anurendra
22nd July 2016 : Music Reconstruction, Separation and synthesis by Anurendra

Talk summary:

  • Problem formulation Background Our new model Theoretical derivations and solutions Implementation issues

15th July 2016 : Finding a relation between acoustic features of speech and head motion of the speaker by Pranav
15th July 2016 : Finding a relation between acoustic features of speech and head motion of the speaker by Pranav

Talk summary:

  • Introduction Brief idea about HMM Data Acquisition and Preparation Different methods that have been adopted previously to cluster head motion The different tests we performed to check for a relation between speech and head motion Scope for further research

11th July 2016 : Implementing an intonation practice environment for voisTUTOR by Anand
11th July 2016 : Implementing an intonation practice environment for voisTUTOR by Anand

Talk summary:

  • Why intonation is important in speech The methodology undertaken in creating the stylisation (Intonation practice) Results Further work

8th July 2016 : Sparse modelling of Residual for Vocal tract estimation at high pitch by Chaithya
8th July 2016 : Sparse modelling of Residual for Vocal tract estimation at high pitch by Chaithya

Talk summary:

  • Introduction Problem Statement Earlier Methods Sparse Modelling of residual Properties GCI Corrective Algorithm

8th July 2016 : Detection and delineation of SLEEP APNEA and HYPOPNEA from EDR-ECG Derived Respiratory Signal by Salma B
8th July 2016 : Detection and delineation of SLEEP APNEA and HYPOPNEA from EDR-ECG Derived Respiratory Signal by Salma B

Talk summary:

  • Initial work understanding of the ECG link of respiration and ECG mathematical implementation.

4th July 2016 : Identification and labelling of prosodic groups in utterances using ToBI by Vaidhya
4th July 2016 : Identification and labelling of prosodic groups in utterances using ToBI by Vaidhya

Talk summary:

  • What is ToBI What are its applications Explanation of the 4 tiers in brief Explaining the tone tier in detail

4th July 2016 : Carnatic Music app in Android by Priyadarshini S
4th July 2016 : Carnatic Music app in Android by Priyadarshini S

Talk summary:

  • Different components of carnatic music Implementation of various aspects of carnatic music training. Online feedback for practice sessions

30th June 2016 : Comparative study of the pulse rate estimation from facial video under different video compression schemes by Paridhi Maheshwari
30th June 2016 : Comparative study of the pulse rate estimation from facial video under different video compression schemes by Paridhi Maheshwari

Talk summary:

  • Prior Methods Sparse Spectral Peak Tracking Algorithm Motivation Database & Recording Setup I & II Results Conclusion

30th June 2016 : Glottal source modelling for improving text-to-speech (TTS) systems by Tom Francis
30th June 2016 : Glottal source modelling for improving text-to-speech (TTS) systems by Tom Francis

Talk summary:

  • Introduction A biologically inspired glottal model for TTS A novel parameterization for the glottal waveform using the beta distribution Conclusion

24th June 2016 : Rank sparsity incoherence for matrix decomposition by Pavan Karjol
24th June 2016 : Rank sparsity incoherence for matrix decomposition by Pavan Karjol

Talk summary:

  • Introduction.
    • Problem formulation.
    • Conditions for unique decomposition.
    • Results and conclusions.

17th June 2016 : Speaker Verification by Achuth Rao
17th June 2016 : Speaker Verification by Achuth Rao

Talk summary:

  • introduction to speaker verification Models for handling inter speaker variability Models for handling inter session variability Models that can handle both- i-vector

3rd June 2016 : Face and Body Gesture Recognition and Analysis by Dr. Tanaya Guha
3rd June 2016 : Face and Body Gesture Recognition and Analysis by Dr. Tanaya Guha

Talk summary:

  • We will cover two aspects of gesture understanding - recognition and analysis. In the first part, we will discuss sparse representation-based classification algorithms for recognizing face and body gestures in videos. In the second part, we'll concentrate on analyzing facial gestures of children with autism using motion capture (mocap) data.

27th May 2016 : SPIRE-ABC: An online tool for acoustic-unit boundary correction (ABC) via crowdsourcing by Kausthubha N K
27th May 2016 : SPIRE-ABC: An online tool for acoustic-unit boundary correction (ABC) via crowdsourcing by Kausthubha N K

Talk summary:

  • Introduction to Annotation Motivation for annotation Online tool for the annotation (wavesurfer.js) and it's limitations Modifications made to achieve the proposed system Hands-on session to use the online tool

20th May 2016 : Highs in my Life and my Selling Mishaps by Sanjeev Mittal
20th May 2016 : Highs in my Life and my Selling Mishaps by Sanjeev Mittal

Talk summary:

  • SPIRE recipe for super speaking skills to sell your ideas from the stage.
    • Showcase videos: Two great public talks based on the similar recipe.
    • Open discussion on the recipe.
    • Open platform: Opportunity for anyone wishing to practice their ideas to utilize the occasion to deliver a flash talk based on the introduced framework.

29th April 2016 : Comparison of acoustic to articulatory inversion of ALS patients and healthy controls by Neha Koundal
29th April 2016 : Comparison of acoustic to articulatory inversion of ALS patients and healthy controls by Neha Koundal

Talk summary:

  • Comparison of acoustic to articulatory inversion of ALS patients and healthy controls

16th April 2016 : Acoustic based speech rate estimation using data driven approaches by Chiranjeevi Yarra
16th April 2016 : Acoustic based speech rate estimation using data driven approaches by Chiranjeevi Yarra

Talk summary:

  • Introduction NMF based speech rate estimation Mode-shape based peak detection strategy for speech rate estimation Experimental results Conclusions

17th March 2016 : ICASSP poster talks by Chiranjeevi, Navaneet, Prasanta
17th March 2016 : ICASSP poster talks by Chiranjeevi, Navaneet, Prasanta

Talk summary:

  • A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary, multiple spectral peak tracking for heart rate monitoring from photoplethysmography signal during intensive physical exercise, better acoustic normalization in subject-independent acoustic-to-articulatory inversion: benefit to recognition.

4th March 2016 : Spatial Hearing by Karthink Ramesan
4th March 2016 : Spatial Hearing by Karthink Ramesan

Talk summary:

  • Introduction to spatial hearing cues that help in spatial hearing, structural approximations for binaural hearing, conclusions.

19th February 2016 : A model selection approach to audio segmentation via the Bayesian Information Criterion (BIC) by Nisha Meenakshi
19th February 2016 : A model selection approach to audio segmentation via the Bayesian Information Criterion (BIC) by Nisha Meenakshi

Talk summary:

  • Problem1: Audio Segmentation Problem2: Model Selection BIC for model selection How can audio segmentation be viewed as a model selection problem? Literature: BIC in audio segmentation.

12th February 2016 : Phase Processing for Single-Channel Speech Enhancement by Pavan Karjol
12th February 2016 : Phase Processing for Single-Channel Speech Enhancement by Pavan Karjol

Talk summary:

  • INTRODUCTION ITERATIVE ALGORITHMS FOR PHASE ESTIMATION SINUSOIDAL MODEL-BASED PHASE ESTIMATION GROUP DELAY AND TRANSIENT PROCESSING RELATION BETWEEN PHASE AND MAGNITUDE ESTIMATION RESULTS AND CONCLUSION

5th February 2016 : Speaker Verification methods by Achuth Rao
5th February 2016 : Speaker Verification methods by Achuth Rao

Talk summary:

  • Introduction to speaker verification GMM based methods GMM UBM based methods(MAP adaptation) EMAP adaptation Eigen voices

29th January 2016 : HTML & CSS Building a Static Website by Gaurav Fotedar
29th January 2016 : HTML & CSS Building a Static Website by Gaurav Fotedar

Talk summary:

  • HTML Structure HTML Basic Elements HTML Forms Basic CSS Syntax (Inline and File) HTML 5 Elements CSS3 elements

15th January 2016 : SIGNAL SUBSPACE APPROACH FOR SPEECH ENHANCEMENT by Pavan Karjol
15th January 2016 : SIGNAL SUBSPACE APPROACH FOR SPEECH ENHANCEMENT by Pavan Karjol

Talk summary:

  • Speech enhancement overview Signal and noise models Signal and noise subspaces Linear estimators (TDC and LDC) Results and conclusions

1st January 2016 : Speech Analysis/Synthesis Based on a Sinusoidal Representation by Aravind Illa
1st January 2016 : Speech Analysis/Synthesis Based on a Sinusoidal Representation by Aravind Illa

Talk summary:

  • Sinusoidal Speech Model Estimation of Speech Parameters Frame-To-Frame Peak Matching Synthesis System Extension to Harmonic Models

11th December 2015 : Voice Conversion by Achuth Rao
11th December 2015 : Voice Conversion by Achuth Rao

Talk summary:

  • Overview GMM based voice conversion modifications to GMM for voice conversion Frequency warping and amplitude scaling for voice conversion.

11th December 2015 : On generative and discriminative models by Prasanta Kumar Ghosh
11th December 2015 : On generative and discriminative models by Prasanta Kumar Ghosh

Talk summary:

  • What is generative model (including examples)? What is discriminative model (including examples)? Asymptotic performance of generative and discriminative models Discriminative training of generative model - blending generative and discriminative models

20th November 2015 : Blind Source Separation Using Wigner-Ville Distribution (WVD) by Chiranjeevi Yarra
20th November 2015 : Blind Source Separation Using Wigner-Ville Distribution (WVD) by Chiranjeevi Yarra

Talk summary:

  • Problem statement Problem formulation with WVD Joint diagonalization Simulation results.

5th November 2015 : Language models and an introduction to the IRSTLM toolkit by Nisha Meenakshi
5th November 2015 : Language models and an introduction to the IRSTLM toolkit by Nisha Meenakshi

Talk summary:

  • What are language models?
    • Where are they used?
    • How does the IRSTLM toolkit perform language modeling?
    • A few examples of IRSTLM implementation.

30th October 2015 : vi basics & survival skills by Sanjeev Mittal
30th October 2015 : vi basics & survival skills by Sanjeev Mittal

Talk summary:

  • Various states and transitions in vi.
    • Quick basic survival commands.
    • Day-to-day commands.
    • Advanced commands.
    • Pitfalls and troubleshooting.

22nd October 2015 : Understanding OOPS Concepts using Java by Gaurav Fotedar
22nd October 2015 : Understanding OOPS Concepts using Java by Gaurav Fotedar

Talk summary:

  • Classes and objects, polymorphism, inheritance, method and operator override, interfaces and abstract classes, arrays and string class, generics.

16th October 2015 : Robust real-time pulse rate estimation from facial video using sparse spectral peak tracking by Aditya Gaonkar
16th October 2015 : Robust real-time pulse rate estimation from facial video using sparse spectral peak tracking by Aditya Gaonkar

Talk summary:

  • Estimating pulse rate of subjects from facial videos. Usage of "Independent Component Analysis" in Biomedical Signal Processing. An overview of the proposed method. Discussion on the obtained results.

9th October 2015 : Video Lecture by
9th October 2015 : Video Lecture by

Talk summary:

  • Finite state transducer Speech recognition using finite state transducer

1st October 2015 : Low Rank and Sparse Matrix Decomposition by Jitendra Kumar Dhiman
1st October 2015 : Low Rank and Sparse Matrix Decomposition by Jitendra Kumar Dhiman

Talk summary:

  • Type of the problem: Inverse problem. Problem formulation for low-rank and sparse matrix decomposition. Problem solution using the "Augmented Lagrangian Method of Multipliers." Application to musical noise removal for speech signal. Abstract: Inverse problems arise in many applications of science and engineering. There have been several approaches to solving such problems. We will discuss one particular type of inverse problem: "Low-rank and sparse matrix decomposition." In this problem, we are given data in the form of a matrix, which exhibits the property of being the sum of two unknown matrices—one of which is low-rank, and the other is sparse. The goal is to achieve such decomposition of the given matrix. This problem can be solved in an optimization framework. Although many algorithms are available to solve this problem, we will focus on one particular optimization algorithm (Augmented Lagrangian Method). Finally, we will apply the algorithm for musical noise separation from speech signals. In order to separate musical noise from the denoised speech signal, the algorithm exploits the structure of musical noise, which is sparse in the time-frequency domain, and speech, which is low-rank.

25th September 2015 : Estimation of the air-tissue boundaries (ATBs) of the vocal tract in the mid-sagittal plane from electromagnetic articulograph (EMA) data by Pattem Ashok Kumar
25th September 2015 : Estimation of the air-tissue boundaries (ATBs) of the vocal tract in the mid-sagittal plane from electromagnetic articulograph (EMA) data by Pattem Ashok Kumar

Talk summary:

  • Introduction to EMA and real time magnetic resonance imagining (rtMRI) Co-registration of the EMA data and ATBs in the rtMRI Estimation of the ATBs from registered EMA Results Discussion on the quality of estimated ATBs

18th September 2015 : Speech Beyond Speech - IS2015? by Prasanta Ghosh
18th September 2015 : Speech Beyond Speech - IS2015? by Prasanta Ghosh

Talk summary:

  • SPIRE lab's paper presentations Some good/significant works latest trends Challenges New Tools/Datasets Dresden

11th September 2015 : Fundamentals of HMM-based speech synthesis. by Achuth Rao
11th September 2015 : Fundamentals of HMM-based speech synthesis. by Achuth Rao

Talk summary:

  • Vocoding techniques: Speech parameter modeling and generation algorithm, spectrum parameter, F0 parameter, context-clustering, advantages, and disadvantages.

4th September 2015 : A discriminative analysis within and across voiced and unvoiced consonants in neutral and whispered speech in multiple Indian languages. by Nisha Meenakshi
4th September 2015 : A discriminative analysis within and across voiced and unvoiced consonants in neutral and whispered speech in multiple Indian languages. by Nisha Meenakshi

Talk summary:

  • Typically, voiced consonants are voiceless when whispered, as whispered speech lacks the vocal chord vibrations. Therefore, we ask the following questions. Is the discrimination between the voiced and unvoiced (V-UV) consonants still preserved in whispered speech? Is the variation of the acoustics from neutral to whispered speech, consonant specific? Does language affect V-UV consonant discrimination?

29th August 2015 : Speech recognition using HTK by Amber Afshan
29th August 2015 : Speech recognition using HTK by Amber Afshan

Talk summary:

  • Brief introduction about HMM and Viterbi recognition Tutorial on building a basic recognition system using HTK -- Preparing data, Training, Recognition. HLDA transform. Speaker adaptation. GMM using HTK.

28th August 2015 : Video Lecture by
28th August 2015 : Video Lecture by

Talk summary:

  • A brief history of Speech recognition The Probabilistic Approach Feature Extraction, Acoustic Modelling, Language Modelling, Search. Where we stand.

21st August 2015 : Android App Development by Ataur Rehman
21st August 2015 : Android App Development by Ataur Rehman

Talk summary:

  • What is Android? Android internals. Briefing about applicaion Building Blocks. Steps to develop Hello Android App in android studio. Demo of some crazy android Apps.

31st July 2015 : Implementation of Automatic Gender Classification using Normal and Whispered speech in Android by Shashidhar Prabhu
31st July 2015 : Implementation of Automatic Gender Classification using Normal and Whispered speech in Android by Shashidhar Prabhu

Talk summary:

  • Finding the pitch in neutral speech
    • Recording whisper and finding the MFCC features of the recorded speech
    • SVM modeling
    • Results

31st July 2015 : Classification of Voiced and Unvoiced frames in speech using Periodicity Transforms by Shashidhar Prabhu
31st July 2015 : Classification of Voiced and Unvoiced frames in speech using Periodicity Transforms by Shashidhar Prabhu

Talk summary:

  • Introduction to periodicity transforms
    • Choosing the best algorithm in periodicity transforms for the classification
    • Feature extraction and SVM modeling
    • Results

31st July 2015 : Part-1: Insight into real time magnetic resonance imagining (rtMRI) and its applications towards the understanding of human speech.
Part-2: Automatic classification of eating conditions from speech using acoustic feature selection and a set of hierarchical support vector machine classifiers
by Abhay Prasad

31st July 2015 : Part-1: Insight into real time magnetic resonance imagining (rtMRI) and its applications towards the understanding of human speech.
Part-2: Automatic classification of eating conditions from speech using acoustic feature selection and a set of hierarchical support vector machine classifiers
by Abhay Prasad

Talk summary:

  • Introduction to rtMRI voice activity detection using rtMRI.
    • Broad-class phonetic recognition with rtMRI.
    • Extraction of variant and invariant components of speech and its application towards speaker identification using rtMRI.

3rd July 2015 : An introduction to the nature of whispered speech and an overview of LTLEV as a feature for whisper activity detection (WAD) by Nisha Meenakshi
3rd July 2015 : An introduction to the nature of whispered speech and an overview of LTLEV as a feature for whisper activity detection (WAD) by Nisha Meenakshi

Talk summary:

  • The differences between the nature of whispered speech and neutral speech will be discussed, with a few illustrations to aid understanding.
    • The use of the newly developed signal characteristic—Long Term Logarithmic Energy Variation (LTLEV) will be explained.
    • The performance assessment of this feature and four other baseline schemes, for WAD in the presence of eight different noises.

26th June 2015 : Light on Photoplethysmography signal which requires processing by Vijitha Periyasamy
26th June 2015 : Light on Photoplethysmography signal which requires processing by Vijitha Periyasamy

Talk summary:

  • Insight into Photoplethysmography (PPG)
    • Challenges in processing PPG
    • Paths to analyze PPG
    • Opportunities for contribution in PPG

22nd June 2015 : Exploring the use of discriminative dictionary learning for enhancement of audio in additive magnetic resonance imaging noise by Ataur Rehman
22nd June 2015 : Exploring the use of discriminative dictionary learning for enhancement of audio in additive magnetic resonance imaging noise by Ataur Rehman

Talk summary:

  • Introduction to MRI recording: advantages and problems (noisy speech recording) with MRI recording.
    • Techniques to enhance the noisy speech recording, i.e., NMF, PLCA, and DDL algorithms-based audio enhancement.
    • Discuss the consequences of the aforementioned algorithms.
    • Conclusions and future work.