


default search action
25th Interspeech 2024: Kos, Greece
- Itshak Lapidot, Sharon Gannot:
25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024. ISCA 2024
Keynote 1 ISCA Medallist
- Isabel Trancoso:
Towards Responsible Speech Processing.
L2 Speech, Bilingualism and Code-Switching
- Sarah Wesolek, Piotr Gulgowski
, Joanna Blaszczak, Marzena Zygis:
The influence of L2 accent strength and different error types on personality trait ratings. - Jie Chi, Electra Wallington, Peter Bell:
Characterizing code-switching: Applying Linguistic Principles for Metric Assessment and Development. - Wei Xue
, Ivan Yuen, Bernd Möbius:
Towards a better understanding of receptive multilingualism: listening conditions and priming effects. - Debasish Ray Mohapatra, Victor Zappi, Sidney Fels:
2.5D Vocal Tract Modeling: Bridging Low-Dimensional Efficiency with 3D Accuracy.
Speaker Diarization 1
- Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna:
Investigating Confidence Estimation Measures for Speaker Diarization. - Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan:
Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization. - Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang:
On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization. - Marc Härkönen, Samuel J. Broughton, Lahiru Samarakoon:
EEND-M2F: Masked-attention mask transformers for speaker diarization. - Yongkang Yin, Xu Li, Ying Shan, Yuexian Zou:
AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild. - Arunav Arya, Murtiza Ali, Karan Nathwani:
Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network Framework.
Speech and Audio Analysis and Representations
- Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma:
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning. - Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto:
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation. - Yusuke Fujita, Tatsuya Komatsu:
Audio Fingerprinting with Holographic Reduced Representations. - David Meyer, Eitan Abecassis, Clara Fernandez-Labrador, Christopher Schroers:
RAST: A Reference-Audio Synchronization Tool for Dubbed Content. - Xuefei Li, Hao Huang, Ying Hu, Liang He
, Jiabao Zhang, Yuyi Wang:
YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation. - Asad Ullah, Alessandro Ragano
, Andrew Hines
:
Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech Models. - Jaden Pieper, Stephen Voran:
AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators.
Acoustic Event Detection and Classification 2
- Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz:
Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation. - Da Mu, Zhicheng Zhang, Haobo Yue:
MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection. - Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee
, Yong-Hwa Park:
Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection. - Tuan Vu Ho, Kota Dohi, Yohei Kawaguchi:
Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring. - Anbai Jiang, Bing Han, Zhiqiang Lv
, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan
:
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection. - Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu:
FakeSound: Deepfake General Audio Detection. - Shabnam Ghaffarzadegan, Luca Bondi, Wei-Cheng Lin, Abinaya Kumar, Ho-Hsiang Wu, Hans-Georg Horst, Samarjit Das:
Sound of Traffic: A Dataset for Acoustic Traffic Identification and Counting.
Detection and Classification of Bioacoustic Signals
- Sahil Kumar, Jialu Li, Youshan Zhang:
Vision Transformer Segmentation for Visual Bird Sound Denoising. - Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Björn W. Schuller:
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition. - Jules Cauzinille, Benoît Favre, Ricard Marxer
, Dena J. Clink, Abdul Hamid Ahmad, Arnaud Rey:
Investigating self-supervised speech models' ability to classify animal vocalizations: The case of gibbon's vocal signatures. - Xihang Qiu
, Lixian Zhu, Zikai Song, Zeyu Chen, Haojie Zhang
, Kun Qian, Ye Zhang, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller:
Study Selectively: An Adaptive Knowledge Distillation based on a Voting Network for Heart Sound Classification. - Jie Lin, Xiuping Yang, Li Xiao, Xinhong Li, Weiyan Yi, Yuhong Yang, Weiping Tu, Xiong Chen:
SimuSOE: A Simulated Snoring Dataset for Obstructive Sleep Apnea-Hypopnea Syndrome Evaluation during Wakefulness.
Acoustic Echo Cancellation
- Premanand Nayak, Kamini Sabu, M. Ali Basha Shaik:
Multi-mic Echo Cancellation Coalesced with Beamforming for Real World Adverse Acoustic Conditions. - Vahid Khanagha, Dimitris Koutsaidis, Kaustubh Kalgaonkar, Sriram Srinivasan:
Interference Aware Training Target for DNN based joint Acoustic Echo Cancellation and Noise Suppression. - Yi Gao, Xiang Su:
Low Complexity Echo Delay Estimator Based on Binarized Feature Matching. - Ye Ni
, Cong Pang
, Chengwei Huang, Cairong Zou:
MSA-DPCRN: A Multi-Scale Asymmetric Dual-Path Convolution Recurrent Network with Attentional Feature Fusion for Acoustic Echo Cancellation. - Ofer Schwartz, Sharon Gannot:
Efficient Joint Bemforming and Acoustic Echo Cancellation Structure for Conference Call Scenarios. - Fei Zhao, Jinjiang Liu, Xueliang Zhang:
SDAEC: Signal Decoupling for Advancing Acoustic Echo Cancellation.
Speech Synthesis: Voice Conversion 1
- Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari:
Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals. - Alan Baade, Puyuan Peng, David Harwath:
Neural Codec Language Models for Disentangled and Textless Voice Conversion. - Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo:
Fine-Grained and Interpretable Neural Speech Editing. - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo:
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation. - Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi:
DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion. - Tianhua Qi, Shiyan Wang, Cheng Lu, Yan Zhao, Yuan Zong, Wenming Zheng:
Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity.
Neural Network Architectures for ASR 2
- Yu Nakagome, Michael Hentschel:
InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions. - Qinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao:
SEQ-former: A context-enhanced and efficient automatic speech recognition framework. - Robert Flynn, Anton Ragni:
How Much Context Does My Attention-Based ASR System Need? - Vincenzo Norman Vitale, Loredana Schettino, Francesco Cutugno:
Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers' ability to model hesitation phenomena. - Tian-Hao Zhang, Xinyuan Qian, Feng Chen, Xu-Cheng Yin:
Transmitted and Aggregated Self-Attention for Automatic Speech Recognition. - Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe:
MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels. - Koichi Miyazaki, Yoshiki Masuyama, Masato Murata:
Exploring the Capability of Mamba in Speech Applications. - Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye:
Lightweight Transducer Based on Frame-Level Criterion. - Ankit Gupta, George Saon, Brian Kingsbury:
Exploring the limits of decoder-only models trained on public speech recognition corpora. - Xun Gong, Anqi Lv, Zhiming Wang, Yanmin Qian:
Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model.
Decoding Algorithms
- Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu:
Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask. - Kun Zou, Fengyun Tan, Ziyang Zhuang, Chenfeng Miao, Tao Wei, Shaodan Zhai, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao:
E-Paraformer: A Faster and Better Parallel Transformer for Non-autoregressive End-to-End Mandarin Speech Recognition. - Martino Ciaperoni, Athanasios Katsamanis, Aristides Gionis, Panagiotis Karras:
Beam-search SIEVE for low-memory speech recognition. - Daniel Galvez, Vladimir Bataev
, Hainan Xu, Tim Kaldewey:
Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU. - Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Yanzhang He, Pedro Moreno Mengibar:
Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm. - Tatsunari Takagi, Yukoh Wakabayashi, Atsunori Ogawa, Norihide Kitaoka:
Text-only Domain Adaptation for CTC-based Speech Recognition through Substitution of Implicit Linguistic Information in the Search Space.
Pronunciation Assessment
- Xintong Wang, Mingqian Shi, Ye Wang:
Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis. - Yu-Wen Chen, Zhou Yu, Julia Hirschberg:
MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios. - Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi:
A Framework for Phoneme-Level Pronunciation Assessment Using CTC. - Mostafa Shahin
, Beena Ahmed:
Phonological-Level Mispronunciation Detection and Diagnosis. - Heejin Do, Wonjun Lee, Gary Geunbae Lee:
Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment. - Nhan Phan
, Anna von Zansen
, Maria Kautonen
, Ekaterina Voskoboinik, Tamás Grósz, Raili Hildén
, Mikko Kurimo:
Automated content assessment and feedback for Finnish L2 learners in a picture description speaking task.
Spoken Language Processing
- Zhenyu Wang, Shuyu Kong, Li Wan, Biqiao Zhang, Yiteng Huang, Mumin Jin, Ming Sun, Xin Lei, Zhaojun Yang:
Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning. - Youngmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoonyoung Cho:
Relational Proxy Loss for Audio-Text based Keyword Spotting. - Sichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han, Hoonyoung Cho:
CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting. - Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu:
Text-aware Speech Separation for Multi-talker Keyword Spotting. - Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee:
Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition. - Raul Monteiro:
Adding User Feedback To Enhance CB-Whisper. - Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo
, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe:
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer.
Spoken Machine Translation 2
- Nan Chen, Yonghe Wang, Feilong Bao:
Parameter-Efficient Adapter Based on Pre-trained Models for Speech Translation. - Badr M. Abdullah, Mohammed Maqsood Shaik, Dietrich Klakow:
Wave to Interlingua: Analyzing Representations of Multilingual Speech Transformers for Spoken Language Translation. - Nan Chen, Yonghe Wang, Feilong Bao:
Knowledge-Preserving Pluggable Modules for Multilingual Speech Translation Tasks. - Rastislav Rabatin, Frank Seide, Ernie Chang:
Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation. - Peidong Wang, Jian Xue, Jinyu Li, Junkun Chen, Aswin Shanmugam Subramanian:
Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation. - Dan Oneata, Herman Kamper:
Translating speech with just images. - Sameer Khurana, Chiori Hori, Antoine Laurent, Gordon Wichern, Jonathan Le Roux:
ZeroST: Zero-Shot Speech Translation.
Biosignal-enabled Spoken Communication
- Jinyu Li, Leonardo Lancia:
A multimodal approach to study the nature of coordinative patterns underlying speech rhythm. - Peter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W. Black, Rikky Muller, Gopala Krishna Anumanchipalli:
Towards EMG-to-Speech with Necklace Form Factor. - Chris Bras, Tanvina Patel, Odette Scharenborg
:
Using articulated speech EEG signals for imagined speech decoding. - Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang:
Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals. - Yudong Yang, Rongfeng Su, Rukiye Ruzi, Manwa L. Ng, Shaofeng Zhao, Nan Yan, Lan Wang:
Optical Flow Guided Tongue Trajectory Generation for Diffusion-based Acoustic to Articulatory Inversion. - Rishi Jain, Bohan Yu, Peter Wu, Tejas S. Prabhune, Gopala Anumanchipalli:
Multimodal Segmentation for Vocal Tract Modeling. - Jesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh:
Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss. - Yujie Yan, Xiran Xu, Haolin Zhu, Pei Tian, Zhongshu Ge, Xihong Wu, Jing Chen:
Auditory Attention Decoding in Four-Talker Environment with EEG. - Zijie Lin, Tianyu He, Siqi Cai, Haizhou Li:
ASA: An Auditory Spatial Attention Dataset with Multiple Speaking Locations. - Saurav Pahuja, Gabriel Ivucic, Pascal Himmelmann, Siqi Cai, Tanja Schultz, Haizhou Li:
Leveraging Graphic and Convolutional Neural Networks for Auditory Attention Detection with EEG.
Individual and Social Factors in Phonetics
- Tillmann Pistor, Adrian Leemann:
Echoes of Implicit Bias Exploring Aesthetics and Social Meanings of Swiss German Dialect Features. - Vivian Guo Li:
In search of structure and correspondence in intra-speaker trial-to-trial variability. - Irene Smith, Morgan Sonderegger, The Spade Consortium:
Modelled Multivariate Overlap: A method for measuring vowel merger. - Keiko Ochi, Koji Inoue, Divesh Lala, Tatsuya Kawahara:
Entrainment Analysis and Prosody Prediction of Subsequent Interlocutor's Backchannels in Dialogue. - James Tanner
, Morgan Sonderegger, Jane Stuart-Smith, Tyler Kendall, Jeff Mielke
, Robin Dodsworth
, Erik Thomas:
Exploring the anatomy of articulation rate in spontaneous English speech: relationships between utterance length effects and social factors. - Katelyn Taylor, Amelia Jane Gully, Helena Daffern:
Familiar and Unfamiliar Speaker Identification in Speech and Singing.
Paralinguistics
- Luis Felipe Parra-Gallego, Tilak Purohit
, Bogdan Vlasenko, Juan Rafael Orozco-Arroyave, Mathew Magimai-Doss
:
Cross-transfer Knowledge between Speech and Text Encoders to Evaluate Customer Satisfaction. - Manila Kodali
, Sudarsana Reddy Kadiri
, Paavo Alku
:
Fine-tuning of Pre-trained Models for Classification of Vocal Intensity Category from Speech Signals. - Alexander Kathan, Martin Bürger, Andreas Triantafyllopoulos, Sabrina Milkus, Jonas Hohmann, Pauline Muderlak, Jürgen Schottdorf, Richard Musil, Björn W. Schuller, Shahin Amiriparian:
Real-world PTSD Recognition: A Cross-corpus and Cross-linguistic Evaluation. - Debasmita Bhattacharya, Eleanor Lin, Run Chen, Julia Hirschberg:
Switching Tongues, Sharing Hearts: Identifying the Relationship between Empathy and Code-switching in Speech.
Speaker Recognition: Adversarial and Spoofing Attacks
- Eros Rosello, Angel M. Gomez, Iván López-Espejo, Antonio M. Peinado, Juan M. Martín-Doñas
:
Anti-spoofing Ensembling Model: Dynamic Weight Allocation in Ensemble Models for Improved Voice Biometrics Security. - Lin Zhang, Xin Wang, Erica Cooper, Mireia Díez, Federico Landini, Nicholas W. D. Evans, Junichi Yamagishi:
Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio. - Haochen Wu, Wu Guo, Zhentao Zhang, Wenting Zhao, Shengyu Peng, Jie Zhang:
Spoofing Speech Detection by Modeling Local Spectro-Temporal and Long-term Dependency. - Jingze Lu, Yuxiang Zhang, Zhuo Li, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang:
Improving Copy-Synthesis Anti-Spoofing Training Method with Rhythm and Speaker Perturbation. - Yip Keng Kan, Ke Xu, Hao Li, Jie Shi:
VoiceDefense: Protecting Automatic Speaker Verification Models Against Black-box Adversarial Attacks. - Xuanjun Chen, Jiawei Du
, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee:
Neural Codec-based Adversarial Sample Detection for Speaker Verification. - Sizhou Chen, Yibo Bai
, Jiadi Yao, Xiao-Lei Zhang, Xuelong Li:
Textual-Driven Adversarial Purification for Speaker Verification. - Zhuhai Li, Jie Zhang, Wu Guo, Haochen Wu:
Boosting the Transferability of Adversarial Examples with Gradient-Aligned Ensemble Attack for Speaker Recognition. - Duc-Tuan Truong
, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng:
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection.
Audio Event Detection and Classification 1
- Tiantian Feng, Dimitrios Dimitriadis, Shrikanth S. Narayanan:
Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling? - Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang:
Scaling up masked audio encoder learning for general audio classification. - Sarthak Yadav
, Zheng-Hua Tan
:
Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations. - Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin:
MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection. - Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux:
Sound Event Bounding Boxes. - Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He:
Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network.
Source Separation 2
- Hassan Taherian, Vahid Ahmadi Kalkhorani, Ashutosh Pandey, Daniel Wong, Buye Xu, DeLiang Wang:
Towards Explainable Monaural Speaker Separation with Auditory-based Training. - Iva Ewert, Marvin Borsdorf
, Haizhou Li, Tanja Schultz:
Does the Lombard Effect Matter in Speech Separation? Introducing the Lombard-GRID-2mix Dataset. - Zexu Pan, Gordon Wichern, François G. Germain, Kohei Saijo, Jonathan Le Roux:
PARIS: Pseudo-AutoRegressIve Siamese Training for Online Speech Separation. - Yiru Zhang, Linyu Yao, Qun Yang:
OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction. - Tsun-An Hsieh, Heeyoul Choi, Minje Kim:
Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation. - Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li:
SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech. - Yiwen Wang, Xihong Wu:
TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information. - Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux:
Enhanced Reverberation as Supervision for Unsupervised Speech Separation.
Noise Reduction, Dereverberation, and Echo Cancellation
- Fei Zhao, Chenggang Zhang, Shulin He, Jinjiang Liu, Xueliang Zhang:
Deep Echo Path Modeling for Acoustic Echo Cancellation. - Hongmei Guo, Yijiang Chen, Xiaolei Zhang, Xuelong Li:
Graph Attention Based Multi-Channel U-Net for Speech Dereverberation With Ad-Hoc Microphone Arrays. - Louis Bahrman
, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard:
Speech dereverberation constrained on room impulse response characteristics. - Kuang Yuan
, Shuo Han, Swarun Kumar
, Bhiksha Raj:
DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing. - Alexander Barnhill
, Elmar Nöth, Andreas K. Maier, Christian Bergler:
ANIMAL-CLEAN - A Deep Denoising Toolkit for Animal-Independent Signal Enhancement. - Premanand Nayak, M. Ali Basha Shaik:
Elucidating Clock-drift Using Real-world Audios In Wireless Mode For Time-offset Insensitive End-to-End Asynchronous Acoustic Echo Cancellation. - Shilin Wang, Haixin Guan
, Yanhua Long:
QMixCAT: Unsupervised Speech Enhancement Using Quality-guided Signal Mixing and Competitive Alternating Model Training.
Computationally-Efficient Speech Enhancement
- Hanbin Bae, Pavel Andreev, Azat Saginbaev, Nicholas Babaev, Won-Jun Lee, Hosang Sung, Hoon-Young Cho:
Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds. - Behnam Gholami, Mostafa El-Khamy, Kee-Bong Song:
Knowledge Distillation for Tiny Speech Enhancement with Latent Feature Augmentation. - Yuewei Zhang
, Huanbin Zou, Jie Zhu:
Sub-PNWR: Speech Enhancement Based on Signal Sub-Band Splitting and Pseudo Noisy Waveform Reconstruction Loss. - Zugang Zhao, Jinghong Zhang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He:
Streamlining Speech Enhancement DNNs: an Automated Pruning Method Based on Dependency Graph with Advanced Regularized Loss Strategies. - Zehua Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang:
Lightweight Dynamic Sparse Transformer for Monaural Speech Enhancement. - Zizhen Lin, Xiaoting Chen, Junyu Wang:
MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement. - Longbiao Cheng
, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Shih-Chii Liu:
Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement.
Zero-shot TTS
- Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li:
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model. - Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda:
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS. - Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima:
Lightweight Zero-shot Text-to-Speech with Mixture of Adapters. - Vikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva:
DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness.
Noise Robustness, Far-Field, and Multi-Talker ASR
- Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey:
LibriheavyMix: A 20, 000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization. - Xujiang Xing, Mingxing Xu, Thomas Fang Zheng:
A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification. - Ying Shi, Lantian Li, Shi Yin, Dong Wang, Jiqing Han:
Serialized Output Training by Learned Dominance. - Xianrui Zheng, Guangzhi Sun, Chao Zhang, Philip C. Woodland:
SOT Triggered Neural Clustering for Speaker Attributed ASR. - Yoshiaki Bando, Tomohiko Nakamura
, Shinji Watanabe:
Neural Blind Source Separation and Diarization for Distant Speech Recognition. - Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Mana Ihori, Naotaka Kawata, Shota Orihashi, Kazutoshi Shinoda, Taiga Yamane, Saki Mizuno, Keita Suzuki, Satoshi Suzuki, Nobukatsu Hojo, Takafumi Moriya, Atsushi Ando:
Unified Multi-Talker ASR with and without Target-speaker Enrollment.
Contextual Biasing and Adaptation
- Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet:
Keyword-Guided Adaptation of Automatic Speech Recognition. - Nguyen Manh Tien Anh, Thach Ho Sy:
Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor. - Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen:
Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer. - Chengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li:
Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition. - Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey:
Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation. - Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev
, Vitaly Lavrukhin, Boris Ginsburg:
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter. - Xizi Wei, Stephen McGregor:
Prompt Tuning for Speech Recognition on Unknown Spoken Name Entities. - Junzhe Liu, Jianwei Yu, Xie Chen:
Improved Factorized Neural Transducer Model For Text-only Domain Adaptation. - Pin-Yen Liu, Jen-Tzung Chien
:
Modality Translation Learning for Joint Speech-Text Model. - Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:
SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR. - Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura:
Factor-Conditioned Speaking-Style Captioning. - Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang:
Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR. - Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran:
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models. - Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang:
Domain-Aware Data Selection for Speech Classification via Meta-Reweighting.
Spoken Language Understanding
- Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe:
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model. - Dejan Porjazovski, Anssi Moisio, Mikko Kurimo:
Out-of-distribution generalisation in spoken language understanding. - Gaëlle Laperrière, Sahar Ghannay, Bassam Jabaian, Yannick Estève:
A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding. - Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier:
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. - Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang:
Using Large Language Model for End-to-End Chinese ASR and NER. - Alkis Koudounas, Flavio Giobergia, Eliana Pastor, Elena Baralis:
A Contrastive Learning Approach to Mitigate Bias in Speech Models.
Spoken Machine Translation 1
- Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri:
Investigating Decoder-only Large Language Models for Speech-to-text Translation. - Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi du Bois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang, Zoë Abrams, Morgan McGuire:
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation. - Nan Chen, Yonghe Wang, Feilong Bao:
Sign Value Constraint Decomposition for Efficient 1-Bit Quantization of Speech Translation Tasks. - Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung:
Lightweight Audio Segmentation for Long-form Speech Translation. - Haotian Tan, Sakriani Sakti:
Contrastive Feedback Mechanism for Simultaneous Speech Translation. - Cécile Macaire, Chloé Dion, Didier Schwab, Benjamin Lecouteux, Emmanuelle Esperança-Rodier:
Towards Speech-to-Pictograms Translation.
Hearing Disorders
- Seonwoo Lee, Sunhee Kim, Minhwa Chung:
Automatic Assessment of Speech Production Skills for Children with Cochlear Implants Using Wav2Vec2.0 Acoustic Embeddings. - Youngjin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim:
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. - Mark A. Huckvale, Gaston Hilkhuysen:
Evaluating a 3-factor listener model for prediction of speech intelligibility to hearing-impaired listeners. - Sophie Fagniart
, Brigitte Charlier, Véronique Delvaux, Bernard Harmegnies, Anne Huberlant, Myriam Piccaluga, Kathy Huet:
Production of fricative consonants in French-speaking children with cochlear implants and typical hearing: acoustic and phonological analyses. - Toshio Irino, Shintaro Doan, Minami Ishikawa:
Signal processing algorithm effective for sound quality of hearing loss simulators. - Yixiang Niu
, Ning Chen, Hongqing Zhu, Zhiying Zhu, Guangqiang Li, Yibo Chen:
Auditory Spatial Attention Detection Based on Feature Disentanglement and Brain Connectivity-Informed Graph Neural Networks. - Jessica Monaghan, Arun Sebastian, Nicky Chong-White, Vicky Zhang, Vijayalakshmi Easwar, Pádraig Kitterick:
Automatic Detection of Hearing Loss from Children's Speech using wav2vec 2.0 Features.
Speech Disorders 2
- Vrushank Changawala, Frank Rudzicz:
Whister: Using Whisper's representations for Stuttering detection. - Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti:
Improving Speech-Based Dysarthria Detection using Multi-task Learning with Gradient Projection. - Guanlin Chen, Yun Jin:
Cascaded Transfer Learning Strategy for Cross-Domain Alzheimer's Disease Recognition through Spontaneous Speech. - Loukas Ilias, Dimitris Askounis:
A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech. - Si-Ioi Ng, Lingfeng Xu
, Kimberly D. Mueller, Julie Liss, Visar Berisha:
Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance. - Katerina Papadimitriou, Gerasimos Potamianos:
Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning. - Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L. Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas K. Maier:
Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data. - Haojie Zhang, Tao Zhang, Ganjun Liu, Dehui Fu, Xiaohui Hou, Ying Lv:
DysArinVox: DYSphonia & DYSarthria mandARIN speech corpus. - Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli:
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. - Gábor Gosztolya, Veronika Svindt
, Judit Bóna, Ildikó Hoffmann:
Automatic Longitudinal Investigation of Multiple Sclerosis Subjects.
TAUKADIAL Challenge: Speech-Based Cognitive Assessment in Chinese and English (Special Session)
- Saturnino Luz, Sofia de la Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi, Ya-Ning Chang
, Chia-Ju Chou, Yi-Chien Liu:
Connected Speech-Based Cognitive Assessment in Chinese and English. - David Ortiz-Perez, José García Rodríguez, David Tomás:
Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis. - Gábor Gosztolya, László Tóth:
Combining Acoustic Feature Sets for Detecting Mild Cognitive Impairment in the Interspeech'24 TAUKADIAL Challenge. - Junwen Duan, Fangyuan Wei, Hong-Dong Li, Jin Liu:
Pre-trained Feature Fusion and Matching for Mild Cognitive Impairment Detection. - Benjamin Barrera-Altuna, Daeun Lee, Zaima Zarnaz, Jinyoung Han, Seungbae Kim:
The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal Approach. - Anna Favaro, Tianyu Cao, Najim Dehak, Laureano Moro-Velázquez:
Leveraging Universal Speech Representations for Detecting and Assessing the Severity of Mild Cognitive Impairment Across Languages. - Bao Hoang
, Yijiang Pang, Hiroko H. Dodge, Jiayu Zhou:
Translingual Language Markers for Cognitive Assessment from Spontaneous Speech. - Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Philipp Klumpp, Tobias Weise, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier:
Multilingual Speech and Language Analysis for the Assessment of Mild Cognitive Impairment: Outcomes from the Taukadial Challenge.
Show and Tell 1
- Takayuki Arai, Ryohei Suzuki, Chandler Earp, Shinya Tsuji, Keiko Ochi:
Production of phrases by mechanical models of the human vocal tract. - Vishal Gourav, Ankit Tyagi, Phanindra Mankale:
Faster Vocoder: a multi threading approach to achieve low latency during TTS Inference. - Aanchan Mohan, Monideep Chakraborti, Katelyn Eng, Nailia Kushaeva, Mirjana Prpa, Jordan Lewis, Tianyi Zhang, Vince Geisler, Carol Geisler:
A powerful and modern AAC composition tool for impaired speakers. - Grzegorz P. Mika, Konrad Zielinski, Pawel Cyrta, Marek Grzelec:
VoxFlow AI: wearable voice converter for atypical speech. - Sai Akarsh C, Vamshiraghusimha Narasinga, Anil Kumar Vuppala:
Stress transfer in speech-to-speech machine translation. - Takuma Okamoto, Yamato Ohtani, Hisashi Kawai:
Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis. - Alex Peiró Lilja, José Giraldo, Martí Llopart-Font, Carme Armentano-Oller, Baybars Külebi, Mireia Farrús:
Multi-speaker and multi-dialectal Catalan TTS models for video gaming. - Juliana Francis, Éva Székely, Joakim Gustafson:
ConnecTone: a modular AAC system prototype with contextual generative text prediction and style-adaptive conversational TTS. - Mahdin Rohmatillah, Bryan Gautama Ngo, Willianto Sulaiman, Po-Chuan Chen, Jen-Tzung Chien:
Reliable dialogue system for facilitating student-counselor communication. - Harm Lameris, Joakim Gustafson, Éva Székely:
CreakVC: a voice conversion tool for modulating creaky voice. - Yu-Sheng Tsao, Yung-Chang Hsu, Jiun-Ting Li, Siang-Hong Weng, Tien-Hong Lo, Berlin Chen:
EZTalking: English assessment platform for teachers and students.
Keynote 2
- Shoko Araki:
Frontier of Frontend for Conversational Speech Processing.
Phonetics and Phonology of Second Language Acquisition
- Paige Tuttösí
, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim:
Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation. - Anisia Popescu, Lori Lamel, Ioana Vasilescu, Laurence Devillers:
Automatic Speech Recognition with parallel L1 and L2 acoustic phone models to evaluate /l/ allophony in L2 English speech production. - Kevin Huang, Jack Goldberg, Louis Goldstein, Shrikanth Narayanan:
Analysis of articulatory setting for L1 and L2 English speakers using MRI data. - Ioana Colgiu, Laura Spinu, Rajiv Rao, Yasaman Rafat:
Bilingual Rhotic Production Patterns: A Generational Comparison of Spanish-English Bilingual Speakers in Canada. - Sylvain Coulange, Tsuneo Kato, Solange Rossato, Monica Masperi:
Exploring Impact of Pausing and Lexical Stress Patterns on L2 English Comprehensibility in Real Time. - Qi Wu:
Mandarin T3 Production by Chinese and Japanese Native Speakers.
Corpora-based Approaches in Automatic Emotion Recognition
- Sumit Ranjan, Rupayan Chakraborty, Sunil Kumar Kopparapu:
Reinforcement Learning based Data Augmentation for Noise Robust Speech Emotion Recognition. - Pravin Mote, Berrak Sisman, Carlos Busso:
Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion. - Jincen Wang, Yan Zhao, Cheng Lu, Hailun Lian, Hongli Chang, Yuan Zong, Wenming Zheng:
Confidence-aware Hypothesis Transfer Networks for Source-Free Cross-Corpus Speech Emotion Recognition. - Yuxuan Xi, Yan Song, Lirong Dai, Haoyu Song, Ian McLoughlin:
An Effective Local Prototypical Mapping Network for Speech Emotion Recognition. - Yuan Gao, Hao Shi, Chenhui Chu, Tatsuya Kawahara:
Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction.
Analysis of Speakers States and Traits
- Oliver Niebuhr
, Nafiseh Taghva:
How rhythm metrics are linked to produced and perceived speaker charisma. - Zhu Li, Xiyuan Gao
, Yuqing Zhang
, Shekhar Nayak
, Matt Coler
:
A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm. - John Murzaku, Adil Soubki, Owen Rambow:
Multimodal Belief Prediction. - Run Chen, Haozhe Chen, Anushka Kulkarni, Eleanor Lin, Linda Pang, Divya Tadimeti, Jun Shin, Julia Hirschberg:
Detecting Empathy in Speech. - Dehua Tao, Tan Lee, Harold Chui, Sarah Luk:
Learning Representation of Therapist Empathy in Counseling Conversation Using Siamese Hierarchical Attention Network. - Han Kunmei:
Modelling Lexical Characteristics of the Healthy Aging Population: A Corpus-Based Study. - Maurice Gerczuk
, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller:
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment.
Spoofing and Deepfake Detection
- Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury:
Source Tracing of Audio Deepfake Systems. - Tianchi Liu
, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li:
How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio? - Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi:
Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis. - Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali:
SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures. - Menglu Li, Xiao-Ping Zhang:
Interpretable Temporal Class Activation Representation for Audio Spoofing Detection. - Zirui Ge
, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang, Björn W. Schuller:
DGPN: A Dual Graph Prototypical Network for Few-Shot Speech Spoofing Algorithm Recognition.
Audio Captioning, Tagging, and Audio-Text Retrieval
- Jianyuan Sun, Wenwu Wang, Mark D. Plumbley:
PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio Captioning. - Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang:
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding. - Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou:
Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation. - Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang:
Streaming Audio Transformers for Online Audio Tagging. - Aryan Chaudhary, Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley:
Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging. - Xin Jing, Andreas Triantafyllopoulos, Björn W. Schuller:
ParaCLAP - Towards a general language-audio model for computational paralinguistic tasks. - Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley:
Efficient Audio Captioning with Encoder-Level Knowledge Distillation.
Generative Speech Enhancement
- Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu:
Universal Score-based Speech Enhancement with High Content Preservation. - Haici Yang, Jiaqi Su, Minje Kim, Zeyu Jin:
Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens. - Ante Jukic, Roman Korostik, Jagadeesh Balam, Boris Ginsburg:
Schrödinger Bridge for Generative Speech Enhancement. - Thanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich:
Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge. - Yiyuan Yang, Niki Trigoni, Andrew Markham:
Pre-training Feature Guided Diffusion Model for Speech Enhancement. - Dail Kim, Da-Hee Yang, Donghyun Kim, Joon-Hyuk Chang, Jeonghwan Choi, Moa Lee, Jaemo Yang, Han-gil Moon:
Guided conditioning with predictive network on score-based diffusion model for speech enhancement.
Speech Synthesis: Evaluation
- Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang:
SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models. - Srija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra:
Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies. - Jens Edlund, Christina Tånnander, Sébastien Le Maguer
, Petra Wagner
:
Assessing the impact of contextual framing on subjective TTS quality. - Adaeze Adigwe
, Sarenne Wallbridge, Simon King:
What do people hear? Listeners' Perception of Conversational Speech. - Hui Wang, Shiwan Zhao, Jiaming Zhou, Xiguang Zheng, Haoqin Sun, Xuechen Wang, Yong Qin:
Uncertainty-Aware Mean Opinion Score Prediction. - Félix Saget, Meysam Shamsi, Marie Tahon
:
Lifelong Learning MOS Prediction for Synthetic Speech Quality Evaluation.
Multilingual ASR
- Kwok Chin Yuen, Jia Qi Yip, Eng Siong Chng:
Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems. - Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe:
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets. - Andrés Piñeiro Martín
, Carmen García-Mateo, Laura Docío Fernández, Maria del Carmen Lopez Perez, Georg Rehm:
Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition. - A. F. M. Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen:
M2ASR: Multilingual Multi-task Automatic Speech Recognition via Multi-objective Optimization. - Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan:
MSR-86K: An Evolving, Multilingual Corpus with 86, 300 Hours of Transcribed Audio for Speech Recognition Research. - Brady Houston, Omid Sadjadi, Zejiang Hou, Srikanth Vishnubhotla, Kyu J. Han:
Improving Multilingual ASR Robustness to Errors in Language Input.
General Topics in ASR
- Jiwon Suh, Injae Na, Woohwan Jung:
Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions. - Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang:
A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting. - Mario Zusag, Laurin Wagner, Bernhard Thallinger:
CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions. - Péter Mihajlik, Yan Meng, Mate S. Kadar, Julian Linke
, Barbara Schuppler
, Katalin Mády:
On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition. - Dena F. Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss
, Caryn Herring, Jia Bin:
Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation. - Hao Tan
, Xiaochen Liu, Huan Zhang, Junjian Zhang, Yaguan Qian, Zhaoquan Gu:
DualPure: An Efficient Adversarial Purification Method for Speech Command Recognition. - Jan Lehecka, Josef V. Psutka, Lubos Smídl
, Pavel Ircing, Josef Psutka:
A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives. - Antón de la Fuente, Dan Jurafsky:
A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models. - Spyretta Leivaditi, Tatsunari Matsushima, Matt Coler
, Shekhar Nayak
, Vass Verkhodanova:
Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating the Impact of Healthy, Disease-Specific, and Speaker-Specific Data. - I-Ting Hsieh, Chung-Hsien Wu
:
Dysarthric Speech Recognition Using Curriculum Learning and Articulatory Feature Embedding. - Shiyao Wang, Shiwan Zhao, Jiaming Zhou, Aobo Kong, Yong Qin:
Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation. - Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou:
An efficient text augmentation approach for contextualized Mandarin speech recognition. - Sheng Li
, Chen Chen, Kwok Chin Yuen, Chenhui Chu, Eng Siong Chng, Hisashi Kawai:
Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses. - Lun Wang, Om Thakkar, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan:
Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping.
Spoken Language Understanding
- Emmy Phung, Harsh Deshpande, Ahmad Emami, Kanishk Singh:
AR-NLU: A Framework for Enhancing Natural Language Understanding Model Robustness against ASR Errors. - Mohan Li, Simon Keizer, Rama Doddipatla:
Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding. - Tuyen Tran, Khanh Le, Ngoc Dang Nguyen, Minh Vu, Huyen Ngo, Woomyoung Park, Thi Thu Trang Nguyen:
VN-SLU: A Vietnamese Spoken Language Understanding Dataset. - Shunsuke Kando, Yusuke Miyao, Jason Naradowsky, Shinnosuke Takamichi:
Textless Dependency Parsing by Labeled Sequence Prediction. - Yaoyao Yue, Michael Proctor
, Luping Zhou, Rijul Gupta, Tharinda Piyadasa, Amelia Gully, Kirrie Ballard
, Craig T. Jin:
Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI. - Alexander Johnson, Peter Plantinga, Pheobe Sun, Swaroop Gadiyaram, Abenezer Girma, Ahmad Emami:
Efficient SQA from Long Audio Contexts: A Policy-driven Approach.
Speech and Multimodal Resources
- Jan Pesán, Vojtech Jurík, Martin Karafiát, Jan Cernocký:
BESST Dataset: A Multimodal Resource for Speech-based Stress Detection and Analysis. - Arnon Turetzky, Or Tal, Yael Segal, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya Roni Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi:
HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing. - Wenbin Wang, Yang Song, Sanjay Jha:
GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech. - Yuexuan Kong, Viet-Anh Tran, Romain Hennequin:
STraDa: A Singer Traits Dataset. - Katharina Anderer
, Andreas Reich, Matthias Wölfel:
MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features. - Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh:
MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset. - Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer:
Towards measuring fairness in speech recognition: Fair-Speech dataset. - Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi:
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio. - Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem:
SER Evals: In-domain and Out-of-domain benchmarking for speech emotion recognition.
Pathological Speech Analysis 1
- Vidar Freyr Gudmundsson, Keve Márton Gönczi, Malin Svensson Lundmark
, Donna Erickson, Oliver Niebuhr
:
The MARRYS helmet: A new device for researching and training "jaw dancing". - Moreno La Quatra, Maria Francesca Turco, Torbjørn Svendsen, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi:
Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions. - Andreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian B. Pokorny, Maurice Gerczuk
, Shahin Amiriparian, Thomas M. Berghaus, Björn W. Schuller:
Sustained Vowels for Pre- vs Post-Treatment COPD Classification. - Mahdi Amiri, Ina Kodrasi:
Adversarial Robustness Analysis in Automatic Pathological Speech Detection Approaches. - Gahye Kim, Yunjung Eom, Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So:
Automatic Children Speech Sound Disorder Detection with Age and Speaker Bias Mitigation.
Speech and Language in Health: from Remote Monitoring to Medical Conversations - 1 (Special Session)
- Mojtaba Kadkhodaie Elyaderani, John Glover, Thomas Schaaf:
Reference-Free Estimation of the Quality of Clinical Notes Generated from Doctor-Patient Conversations. - Jihyun Mun, Sunhee Kim, Minhwa Chung:
Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder. - Vladimir Despotovic
, Abir Elbéji, Petr V. Nazarov
, Guy Fagherazzi:
Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention. - Stefano Goria, Roseline Polle, Salvatore Fara, Nicholas Cummins:
Revealing Confounding Biases: A Novel Benchmarking Approach for Aggregate-Level Performance Metrics in Health Assessments. - Anaïs Rameau, Satrajit Ghosh, Alexandros Sigaras
, Olivier Elemento, Jean-Christophe Bélisle-Pipon, Vardit Ravitsky, Maria Powell, Alistair Johnson, David Dorr, Philip R. O. Payne, Micah Boyer, Stephanie Watts, Ruth Bahr, Frank Rudzicz, Jordan Lerner-Ellis, Shaheen Awan, Don Bolser, Yael Bensoussan:
Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI. - Sri Harsha Dumpala, Katerina Dikaios, Abraham Nunes, Frank Rudzicz, Rudolf Uher, Sageev Oore:
Self-Supervised Embeddings for Detecting Individual Symptoms of Depression. - Daryush D. Mehta, Jarrad H. Van Stan, Hamzeh Ghasemzadeh, Robert E. Hillman:
Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction. - Jennifer Williams
, Eike Schneiders
, Henry Card, Tina Seabrooke
, Beatrice Pakenham-Walsh, Tayyaba Azim, Lucy Valls-Reed, Ganesh Vigneswaran
, John Robert Bautista, Rohan Chandra, Arya Farahi:
Predicting Acute Pain Levels Implicitly from Vocal Features. - Kubilay Can Demir, Belén Lojo Rodríguez, Tobias Weise, Andreas K. Maier, Seung Hee Yang:
Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis. - Gowtham Premananth
, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Y. Espy-Wilson:
A Multimodal Framework for the Assessment of the Schizophrenia Spectrum.
Speech and Brain
- Yuzhe Wang, Anna Favaro, Thomas Thebaud, Jesús Villalba, Najim Dehak, Laureano Moro-Velázquez:
Exploring the Complementary Nature of Speech and Eye Movements for Profiling Neurological Disorders. - Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling:
Refining Self-supervised Learnt Speech Representation using Brain Activations. - Yuejiao Wang, Xianmin Gong
, Lingwei Meng, Xixin Wu, Helen Meng:
Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder. - Kumar Neelabh, Vishnu Sreekumar:
From Sound to Meaning in the Auditory Cortex: A Neuronal Representation and Classification Analysis. - Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang:
Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models. - Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, Shrikanth Narayanan:
Toward Fully-End-to-End Listened Speech Decoding from EEG Signals.
Innovative Methods in Phonetics and Phonology
- Emily P. Ahn, Eleanor Chodroff, Myriam Lapierre, Gina-Anne Levow:
The Use of Phone Categories and Cross-Language Modeling for Phone Alignment of Panãra. - Sneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma:
Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN. - Harsha Veena Tadavarthy, Austin Jones, Margaret E. L. Renwick:
Phonological Feature Detection for US English using the Phonet Library. - Constantijn Kaland, Jeremy Steffman, Jennifer Cole
:
K-means and hierarchical clustering of f0 contours. - Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff:
Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment. - Lila Kim, Cédric Gendrot:
Using wav2vec 2.0 for phonetic classification tasks: methodological aspects. - Michael Lambropoulos
, Frantz Clermont, Shunichi Ishihara
:
The sub-band cepstrum as a tool for locating local spectral regions of phonetic sensitivity: A first attempt with multi-speaker vowel data. - Woo-Jin Chung, Hong-Goo Kang:
Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator. - Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Nöth, Björn Heismann, Andreas K. Maier, Seung Hee Yang:
Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech. - Anna Oura, Hideaki Kikuchi, Tetsunori Kobayashi:
Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech.
Voice, Tones and F0
- Chenyu Li, Jalal Al-Tamimi:
Impact of the tonal factor on diphthong realizations in Standard Mandarin with Generalized Additive Mixed Models. - Xiaowang Liu, Jinsong Zhang:
A Study on the Information Mechanism of the 3rd Tone Sandhi Rule in Mandarin Disyllabic Words. - Melanie Weirich, Daniel Duran, Stefanie Jannedy
:
Gender and age based f0-variation in the German Plapper Corpus. - Chenzi Xu
, Jessica Wormald, Paul Foulkes, Philip Harrison, Vincent Hughes, Poppy Welch, Finnian Kelly
, David van der Vloed:
Voice quality in telephone speech: Comparing acoustic measures between VoIP telephone and high-quality recordings. - Iona Gessinger
, Bistra Andreeva, Benjamin R. Cowan:
The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer Partners.
Emotion Recognition: Resources and Benchmarks
- Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain
:
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark. - Andreas Triantafyllopoulos, Anton Batliner, Simon David Noel Rampp, Manuel Milling, Björn W. Schuller:
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition. - Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni
, Mukhtar Mohamed, Muhammad Abdul-Mageed:
What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark. - Abinay Reddy Naini, Lucas Goncalves, Mary A. Kohler, Donita Robinson, Elizabeth Richerson, Carlos Busso:
WHiSER: White House Tapes Speech Emotion Recognition Corpus. - Siddique Latif
, Raja Jurdak
, Björn W. Schuller:
Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition. - Jincen Wang, Yan Zhao, Cheng Lu, Chuangao Tang, Sunan Li, Yuan Zong, Wenming Zheng:
Boosting Cross-Corpus Speech Emotion Recognition using CycleGAN with Contrastive Learning.
Speaker and Language Identification and Diarization
- Bilal Rahou, Hervé Bredin:
Multi-latency look-ahead for streaming speaker segmentation. - Christoph Boeddeker, Tobias Cord-Landwehr
, Reinhold Haeb-Umbach:
Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment. - Théo Mariotte
, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas:
ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings. - Gabriel Pirlogeanu, Octavian Pascu, Alexandru-Lucian Georgescu, Horia Cucu:
Hybrid-Diarization System with Overlap Post-Processing for the DISPLACE 2024 Challenge. - Shareef Babu Kalluri
, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K. T., S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy:
The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments. - Joonas Kalda, Tanel Alumäe, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer
:
TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024. - Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng:
Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification. - Martina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino:
Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech. - Rohit Paturi, Xiang Li, Sundararajan Srinivasan:
AG-LSEC: Audio Grounded Lexical Speaker Error Correction. - Hang Su, Yuxiang Kong, Lichun Fan, Peng Gao, Yujun Wang, Zhiyong Wu:
Speaker Change Detection with Weighted-sum Knowledge Distillation based on Self-supervised Pre-trained Models. - Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura:
SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization. - Hokuto Munakata, Ryo Terashima, Yusuke Fujita:
Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework.
Audio-Text Retrieval
- Yifei Xin, Xuxin Cheng, Zhihong Zhu, Xusheng Yang, Yuexian Zou:
DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval. - Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang:
Bridging Language Gaps in Audio-Text Retrieval. - Soham Deshmukh, Rita Singh, Bhiksha Raj:
Domain Adaptation for Contrastive Audio-Language Models. - Francesco Paissan, Elisabetta Farella:
tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models. - June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung:
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification. - Yuwu Tang, Ziang Ma, Haitao Zhang:
Enhanced Feature Learning with Normalized Knowledge Distillation for Audio Tagging.
Speech Enhancement
- Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie:
RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention. - Xi Liu, John H. L. Hansen:
DNN-based monaural speech enhancement using alternate analysis windows for phase and magnitude modification. - Li Li, Shogo Seki:
Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio. - Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu:
Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression. - Jizhen Li, Xinmeng Xu, Weiping Tu, Yuhong Yang, Rong Zhu:
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer. - Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, Haizhou Li:
An Exploration of Length Generalization in Transformer-Based Speech Enhancement. - Haixin Guan
, Wei Dai, Guangyong Wang, Xiaobin Tan, Peng Li, Jiaen Liang:
Reducing Speech Distortion and Artifacts for Speech Enhancement by Loss Function. - Candy Olivia Mawalim, Shogo Okada, Masashi Unoki:
Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments? - Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian:
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement.
Speech Coding
- Jinghong Zhang, Zugang Zhao, Yonghui Liu, Jianbing Liu, Zhiqiang He, Kai Niu:
TD-PLC: A Semantic-Aware Speech Encoding for Improved Packet Loss Concealment. - Zihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie:
BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation. - Kishan Gupta, Nicola Pia, Srikanth Korse
, Andreas Brendel, Guillaume Fuchs, Markus Multrus:
On Improving Error Resilience of Neural End-to-End Speech Coders. - Thomas Muller, Stéphane Ragot, Laetitia Gros, Pierrick Philippe, Pascal Scalart:
Speech quality evaluation of neural audio codecs. - Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, Zhen-Hua Ling:
A Low-Bitrate Neural Audio Codec Framework with Bandwidth Reduction and Recovery for High-Sampling-Rate Waveforms. - Haibin Wu, Yuan Tseng, Hung-yi Lee:
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems.
Speech Synthesis: Expressivity and Emotion
- Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong
, Pinxin Liu, Zhiyao Duan:
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis. - Donghyun Seong, Hoyoung Lee, Joon-Hyuk Chang:
TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech. - Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng:
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models. - Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie:
Text-aware and Context-aware Expressive Audiobook Speech Synthesis. - Thomas Bott, Florian Lux, Ngoc Thang Vu:
Controlling Emotion in Text-to-Speech with Natural Language Prompts. - Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li:
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining. - Pavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya:
Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation. - Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee:
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech. - Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang:
Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder. - Chin-Yun Yu
, György Fazekas:
Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis.
Speech Synthesis: Tools and Data
- Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari:
SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark. - Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra:
Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings. - Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani:
FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks. - Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie:
WenetSpeech4TTS: A 12, 800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark. - Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing Huai:
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis. - Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana:
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning. - Sewade Ogun, Abraham Toluwase Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin P. Adewumi:
1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis. - Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari:
SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis.
Speech Synthesis: Singing Voice Synthesis
- Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin Choi, Nam Soo Kim:
MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance. - Takuma Okamoto, Yamato Ohtani, Sota Shimizu, Tomoki Toda, Hisashi Kawai:
Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural Vocoder. - Taewoo Kim, Choongsang Cho, Young Han Lee:
Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis. - Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe:
Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing. - Ji-Sang Hwang, HyeongRae Noh, Yoonseok Hong, Insoo Oh:
X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning. - Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu:
An End-to-End Approach for Chord-Conditioned Song Generation.
LLM in ASR
- Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng:
Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions. - Frank Seide, Yangyang Shi, Morrie Doulaty, Yashesh Gaur, Junteng Jia, Chunyang Wu:
Speech ReaLLM - Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of Time. - Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie:
A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition. - Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang:
Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models.
Vision and Speech
- Jongsuk Kim, Jiwon Shin, Junmo Kim:
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning. - Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha:
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition. - Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu:
Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition. - Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang:
CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge.
Spoken Document Summarization
- Margaret Kroll, Kelsey Kraus:
Optimizing the role of human evaluation in LLM-based spoken document summarization systems. - Sangwon Ryu
, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok:
Key-Element-Informed sLLM Tuning for Document Summarization. - Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix:
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation. - Hengchao Shang, Zongyao Li, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Daimeng Wei, Hao Yang:
An End-to-End Speech Summarization Using Large Language Model. - Wonjune Kang, Deb Roy:
Prompting Large Language Models with Audio for General-Purpose Speech Summarization. - Khai Le-Duc, Khai-Nguyen Nguyen, Long Vo-Dang, Truong-Son Hy:
Real-time Speech Summarization for Medical Conversations.
Speech and Language in Health: from Remote Monitoring to Medical Conversations - 2 (Special Sessions)
- Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Ilja Baumann, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet, Adolfo M. García, Juan Rafael Orozco-Arroyave:
It's Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson's Disease. - Malo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard:
Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models. - Catarina Botelho
, John Mendonça, Anna Pompili, Tanja Schultz, Alberto Abad, Isabel Trancoso:
Macro-descriptors for Alzheimer's disease detection using large language models. - Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer:
Infusing Acoustic Pause Context into Text-Based Dementia Assessment. - Oliver Roesler, Jackson Liscombe, Michael Neumann, Hardik Kothare, Abhishek Hosamath, Lakshmi Arbatti, Doug Habberstad, Christiane Suendermann-Oeft, Meredith Bartlett, Cathy Zhang, Nikhil Sukhdev, Kolja Wilms, Anusha Badathala, Sandrine Istas, Steve Ruhmel, Bryan Hansen, Madeline Hannan, David Henley, Arthur Wallace, Ira Shoulson, David Suendermann-Oeft, Vikram Ramanarayanan:
Towards Scalable Remote Assessment of Mild Cognitive Impairment Via Multimodal Dialog. - Mara Barberis, Pieter De Clercq, Bastiaan Tamm
, Hugo Van hamme
, Maaike Vandermosten:
Automatic recognition and detection of aphasic natural speech. - Giulia Sanguedolce
, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor
, Fatemeh Geranmayeh
:
When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition. - Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James R. Glass:
Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer. - Hardik Kothare, Michael Neumann, Cathy Zhang, Jackson Liscombe, Jordi W. J. van Unnik, Lianne C. M. Botman, Leonard H. van den Berg, Ruben P. A van Eijk, Vikram Ramanarayanan:
How Consistent are Speech-Based Biomarkers in Remote Tracking of ALS Disease Progression Across Languages? A Case Study of English and Dutch. - Anika A. Spiesberger, Andreas Triantafyllopoulos, Alexander Kathan, Anastasia Semertzidou, Caterina Gawrilow, Tilman Reinelt
, Wolfgang A. Rauch, Björn W. Schuller:
"So . . . my child . . . " - How Child ADHD Influences the Way Parents Talk. - Judith Dineley
, Ewan Carr, Lauren L. White
, Catriona Lucas, Zahia Rahman, Tian Pan, Faith Matcham, Johnny Downs
, Richard J. B. Dobson
, Thomas F. Quatieri, Nicholas Cummins:
Variability of speech timing features across repeated recordings: a comparison of open-source extraction techniques. - Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier:
Zero-Shot End-To-End Spoken Question Answering In Medical Domain. - Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian:
Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition.
Show and Tell 2
- Kesavaraj V, Charan Devarkonda, Vamshiraghusimha Narasinga, Anil Kumar Vuppala:
Custom wake word detection. - Song Chen, Mandar Gogate, Kia Dashtipour, Jasper Kirton-Wingate, Adeel Hussain, Faiyaz Doctor, Tughrul Arslan, Amir Hussain:
Edged based audio-visual speech enhancement demonstrator. - Arif Reza Anway, Bryony Buck, Mandar Gogate, Kia Dashtipour, Michael Akeroyd, Amir Hussain:
Real-Time Gaze-directed speech enhancement for audio-visual hearing-aids. - Abhishek Kumar, Srikanth Konjeti, Jithendra Vepa:
Detection of background agents speech in contact centers. - Bramhendra Koilakuntla, Prajesh Rana, Paras Ahuja, Srikanth Konjeti, Jithendra Vepa:
Leveraging large language models for post-transcription correction in contact centers. - Leonie Schade, Nico Dallmann, Olcay Türk, Stefan Lazarov, Petra Wagner:
Understanding "understanding": presenting a richly annotated multimodal corpus of dyadic interaction. - João Vítor Possamai de Menezes, Arne-Lukas Fietkau, Tom Diener, Steffen Kürbis, Peter Birkholz:
A demonstrator for articulation-based command word recognition. - Nigel G. Ward, Andres Segura:
Pragmatically similar utterance finder demonstration. - Kai Liu, Ziqing Du, Huan Zhou, Xucheng Wan, Naijun Zheng:
Real-time scheme for rapid extraction of speaker embeddings in challenging recording conditions. - Szu-Yu Chen, Tien-Hong Lo, Yao-Ting Sung, Ching-Yu Tseng, Berlin Chen:
TEEMI: a speaking practice tool for L2 English learners.
Prosody
- Na Hu, Hugo Schnack, Amalia Arvaniti
:
Automatic pitch accent classification through image classification. - Tianqi Geng, Hui Feng:
Form and Function in Prosodic Representation: In the Case of 'ma' in Tianjin Mandarin. - Joyshree Chakraborty, Leena Dihingia, Priyankoo Sarmah, Rohit Sinha:
On Comparing Time- and Frequency-Domain Rhythm Measures in Classifying Assamese Dialects. - Chiara Riegger, Tina Bögel, George Walkden:
The prosody of the verbal prefix ge-: historical and experimental evidence. - Hongchen Wu, Jiwon Yun:
Influences of Morphosyntax and Semantics on the Intonation of Mandarin Chinese Wh-indeterminates. - Benazir Mumtaz, Miriam Butt:
Urdu Alternative Questions: A Hat Pattern.
Foundational Models for Deepfake and Spoofed Speech Detection
- Hoan My Tran, David Guennec, Philippe Martin, Aghilas Sini, Damien Lolive, Arnaud Delhay, Pierre-François Marteau:
Spoofed Speech Detection with a Focus on Speaker Embedding. - Juan M. Martín-Doñas
, Aitor Álvarez, Eros Rosello, Angel M. Gomez, Antonio M. Peinado:
Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection. - Zihan Pan, Tianchi Liu
, Hardik B. Sailor, Qiongqiong Wang:
Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection. - Haochen Wu, Wu Guo, Shengyu Peng, Zhuhai Li, Jie Zhang:
Adapter Learning from Pre-trained Model for Robust Spoof Speech Detection. - Kexu Liu, Yuanxin Wang, Shengchen Li, Xi Shao:
Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing Attacks. - Thien-Phuc Doan
, Long Nguyen-Vu, Kihun Hong, Souhwan Jung:
Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake Detection.
Speaker Recognition 1
- Shengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang:
Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker Verification. - En-Lun Yu, Kuan-Hsun Ho, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen:
Speaker Conditional Sinc-Extractor for Personal VAD. - Shiu-Hsiang Liou, Po-Cheng Chan, Chia-Ping Chen, Tzu-Chieh Lin, Chung-Li Lu, Yu-Han Cheng, Hsiang-Feng Chuang, Wei-Yu Chen:
Enhancing ECAPA-TDNN with Feature Processing Module and Attention Mechanism for Speaker Verification. - Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu:
MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms. - Kihyun Nam, Hee-Soo Heo, Jee-Weon Jung, Joon Son Chung:
Disentangled Representation Learning for Environment-agnostic Speaker Recognition. - Ladislav Mosner, Romain Serizel, Lukás Burget, Oldrich Plchot, Emmanuel Vincent, Junyi Peng, Jan Cernocký:
Multi-Channel Extension of Pre-trained Models for Speaker Verification. - Yishuang Li, Wenhao Guan, Hukai Huang, Shiyu Miao, Qi Su, Lin Li, Qingyang Hong:
Efficient Integrated Features Based on Pre-trained Models for Speaker Verification. - Tianhao Wang, Lantian Li, Dong Wang:
SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition. - Wei-Lin Xie, Yu-Xuan Xi, Yan Song, Jian-Tao Zhang, Hao-Yu Song, Ian McLoughlin:
DB-PMAE: Dual-Branch Prototypical Masked AutoEncoder with locality for domain robust speaker verification. - Matthew Maciejewski, Dominik Klement, Ruizhe Huang, Matthew Wiesner, Sanjeev Khudanpur:
Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language. - Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang:
A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition.
Source Separation 1
- Helin Wang, Jesús Villalba, Laureano Moro-Velázquez, Jiarui Hai, Thomas Thebaud, Najim Dehak:
Noise-robust Speech Separation with Fast Generative Correction. - Roland Hartanto, Sakriani Sakti, Koichi Shinoda:
MSDET: Multitask Speaker Separation and Direction-of-Arrival Estimation Training. - Jacob Kealey, John R. Hershey, François Grondin:
Unsupervised Improved MVDR Beamforming for Sound Enhancement. - Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin:
Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation. - Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang:
Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments. - Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma:
Towards Audio Codec-based Speech Separation.
Audio-Visual and Generative Speech Enhancement
- Zhengxiao Li, Nakamasa Inoue:
Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step Diffusion. - Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang:
Diffusion Gaussian Mixture Audio Denoise. - Bunlong Lay, Timo Gerkmann:
An Analysis of the Variance of Diffusion-based Speech Enhancement. - Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung:
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching. - Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic:
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement. - Junhui Li, Pu Wang, Jialu Li, Youshan Zhang:
Complex Image-Generative Diffusion Transformer for Audio Denoising. - Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng:
Noise-aware Speech Enhancement using Diffusion Probabilistic Model.
Speech Privacy and Bandwidth Expansion
- Mohammad Hassan Vali
, Tom Bäckström
:
Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization. - Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji:
SilentCipher: Deep Audio Watermarking. - Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv:
Frequency-mix Knowledge Distillation for Fake Speech Detection. - Nicolas M. Müller, Piotr Kawa, Shen Hu, Matthias Neu, Jennifer Williams, Philip Sperl, Konstantin Böttinger:
A New Approach to Voice Authenticity. - Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen:
TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking. - Liwei Liu, Huihui Wei, Dongya Liu, Zhonghua Fu:
HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature Fusion. - Denise Moussa, Sandra Bergmann, Christian Riess:
Unmasking Neural Codecs: Forensic Identification of AI-compressed Speech. - Yin-Tse Lin, Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee:
SWiBE: A Parameterized Stochastic Diffusion Process for Noise-Robust Bandwidth Expansion. - Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling:
MultiStage Speech Bandwidth Extension with Flexible Sampling Rate Control. - Xu Li, Qirui Wang, Xiaoyu Liu:
MaskSR: Masked Language Model for Full-band Speech Restoration.
Speech Synthesis: Prosody
- Yuliya Korotkova, Ilya Kalinovskiy, Tatiana Vakhrusheva:
Word-level Text Markup for Prosody Control in Speech Synthesis. - Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter:
Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. - Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao, Naoyuki Kanda:
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems. - Himanshu Maurya, Atli Sigurgeirsson:
A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer. - Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang:
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling. - Jinzuomu Zhong
, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu:
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of Speech-Silence and Word-Punctuation.
Accented Speech, Prosodic Features, Dialect, Emotion, Sound Classification
- Darshan Prabhu, Abhishek Gupta
, Omkar Nitsure, Preethi Jyothi, Sriram Ganapathy:
Improving Self-supervised Pre-training using Accent-Specific Codebooks. - Tejumade Afonja, Tobi Olatunji, Sewade Ogun
, Naome A. Etori, Abraham Toluwase Owodunni, Moshood Yekini:
Performant ASR Models for Medical Entities in Accented Speech. - Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho Ittan George, Kaushal Santosh Bhogale, Deovrat Mehendale, Mitesh M. Khapra:
LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems. - Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim:
LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech. - Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu, Qingyang Hong, Lin Li:
MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition. - Ying Hu, Huamin Yang, Hao Huang, Liang He
:
Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition. - Arnav Goel, Medha Hira, Anubha Gupta
:
Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning. - Hazim T. Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh:
SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios. - Martijn Bentum, Louis ten Bosch, Tom Lentz
:
The Processing of Stress in End-to-End Automatic Speech Recognition Models. - Tuan Nguyen, Huy Dat Tran:
LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection. - Rhiannon Mogridge, Anton Ragni:
Learning from memory-based models. - Meiling Chen, Pengjie Liu, Heng Yang, Haofeng Wang:
Towards End-to-End Unified Recognition for Mandarin and Cantonese.
Neural Network Adaptation
- Thomas Rolland, Alberto Abad:
Shared-Adapters: A Novel Transformer-based Parameter Efficient Transfer Learning Approach For Children's Automatic Speech Recognition. - Zhouyuan Huo, Dongseong Hwang, Gan Song, Khe Chai Sim, Weiran Wang:
AdaRA: Adaptive Rank Allocation of Residual Adapters for Speech Foundation Model. - Kyuhong Shim, Jinkyu Lee, Hyunjae Kim:
Leveraging Adapter for Parameter-Efficient ASR Encoder. - Ji-Hun Kang, Jae-Hong Lee, Mun-Hak Lee, Joon-Hyuk Chang:
Whisper Multilingual Downstream Task Tuning Using Task Vectors. - Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang:
Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR. - Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei:
Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition.
ASR and LLMs
- Ji Won Yoon, Beom Jun Woo, Nam Soo Kim:
HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition. - Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen:
MaLa-ASR: Multimedia-Assisted LLM-Based ASR. - HyunJung Choi, Muyeol Choi, Yohan Lim, Minkyu Lee, Seon Hui Kim, Seung Yun, Donghyun Kim, Sang-Hun Kim:
Spoken-to-written text conversion with Large Language Model. - Zhiqi Ai, Zhiyong Chen, Shugong Xu:
MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting. - Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogério Feris, James Glass:
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation. - K. R. Prajwal, Triantafyllos Afouras, Andrew Zisserman:
Speech Recognition Models are Strong Lip-readers.
Pathological Speech Analysis 3
- Ilja Baumann, Dominik Wagner, Maria Schuster, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet:
Towards Self-Attention Understanding for Automatic Articulatory Processes Analysis in Cleft Lip and Palate Speech. - Yin-Long Liu, Rui Feng, Jia-Hong Yuan, Zhen-Hua Ling:
Clever Hans Effect Found in Automatic Detection of Alzheimer's Disease through Speech. - Yeh-Sheng Lin, Shu-Chuan Tseng, Jyh-Shing Roger Jang:
Leveraging Phonemic Transcription and Whisper toward Clinically Significant Indices for Automatic Child Speech Assessment. - Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo:
Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features. - Wei-Tung Hsu, Chin-Po Chen, Yun-Shao Lin, Chi-Chun Lee:
A Cluster-based Personalized Federated Learning Strategy for End-to-End ASR of Dementia Patients. - Stefan Kalabakov, Monica González Machorro, Florian Eyben, Björn W. Schuller, Bert Arnrich:
A Comparative Analysis of Federated Learning for Speech-Based Cognitive Decline Detection. - Michael Neumann, Hardik Kothare, Jackson Liscombe, Emma C. L. Leschly, Oliver Roesler, Vikram Ramanarayanan:
Multimodal Digital Biomarkers for Longitudinal Tracking of Speech Impairment Severity in ALS: An Investigation of Clinically Important Differences.
Speech Disorders 3
- Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee:
Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design. - Neil Kumar Shah, Shirish S. Karande, Vineet Gandhi:
Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models. - Seyun Um, Doyeon Kim, Hong-Goo Kang:
PARAN: Variational Autoencoder-based End-to-End Articulation-to-Speech System for Speech Intelligibility. - Si Chen, Bruce Xiao Wang, Yitian Hong, Fang Zhou, Angel Chan, Po-yi Tang, Bin Li, Chunyi Wen, James Cheung, Yan Liu, Zhuoming Chen:
Acoustic changes in speech prosody produced by children with autism after robot-assisted speech training. - Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson:
Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility. - Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave
, Marilyn A. Ladewig, Rus Heywood, Jordan R. Green:
Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech. - Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze
:
Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis. - Gábor Gosztolya, Mercedes Vetráb, Veronika Svindt
, Judit Bóna, Ildikó Hoffmann:
Wav2vec 2.0 Embeddings Are No Swiss Army Knife - A Case Study for Multiple Sclerosis.
Speech Recognition with Large Pretrained Speech Models for Under-represented Languages (Special Session)
- Yi-Jen Shih, David Harwath:
Interface Design for Self-Supervised Speech Models. - Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu:
Comparing Discrete and Continuous Space LLMs for Speech Recognition. - Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang:
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text. - Kaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra:
Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling. - Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt:
Interleaved Audio/Audiovisual Transfer Learning for AV-ASR in Low-Resourced Languages. - Sathvik Udupa, Jesuraj Bandekar, Saurabh Kumar, Deekshitha G, Sandhya Badiger, Abhayjeet Singh Savitha Murthy, Priyanka Pai, Srinivasa Raghavan K. M., Raoul Nanavati, Prasanta Kumar Ghosh:
Adapter pre-training for improved speech recognition in unseen domains using low resource adapter tuning of self-supervised models. - Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie:
Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper. - Yaroslav Getman
, Tamás Grósz, Katri Hiovain-Asikainen, Mikko Kurimo:
Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on Northern Sámi. - Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, Mark J. F. Gales:
Learn and Don't Forget: Adding a New Language to ASR Foundation Models.
Speech Processing Using Discrete Speech Units (Special Session)
- Yuning Wu, Chunlei Zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin:
TokSing: Singing Voice Synthesis based on Discrete Tokens. - Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli:
How Should We Extract Discrete Audio Tokens from Self-Supervised Models? - Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin:
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units. - Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin:
SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models. - Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe:
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model. - Kunal Dhawan, Nithin Rao Koluguri, Ante Jukic, Ryan Langman, Jagadeesh Balam, Boris Ginsburg:
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations.
Keynote 3
- Elmar Nöth:
Analysis of Pathological Speech - Pitfalls along the Way.
Databases and Progress in Methodology
- Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung:
VoxSim: A perceptual voice similarity dataset. - Mewlude Nijat, Chen Chen, Dong Wang, Askar Hamdulla:
UY/CH-CHILD - A Public Chinese L2 Speech Database of Uyghur Children. - Prakash Kumar, Ye Tian, Yongwan Lim, Sophia X. Cui, Christina Hagedorn, Dani Byrd, Uttam K. Sinha, Shrikanth Narayanan, Krishna S. Nayak:
State-of-the-art speech production MRI protocol for new 0.55 Tesla scanners. - Mingyue Shi, Huali Zhou, Qinglin Meng, Nengheng Zheng:
DBD-CI: Doubling the Band Density for Bilateral Cochlear Implants. - Huihang Zhong, Yanlu Xie, ZiJin Yao:
Leveraging Large Language Models to Refine Automatic Feedback Generation at Articulatory Level in Computer Aided Pronunciation Training. - Bin Zhao, Mingxuan Huang, Chenlu Ma, Jinyi Xue
, Aijun Li, Kunyu Xu
:
Decoding Human Language Acquisition: EEG Evidence for Predictive Probabilistic Statistics in Word Segmentation.
Articulation, Convergence and Perception
- Jérémy Giroud, Jessica Lei, Kirsty Phillips
, Matthew H. Davis:
Behavioral evidence for higher speech rate convergence following natural than artificial time altered speech. - Qingye Shen, Leonardo Lancia, Noël Nguyen:
A novel experimental design for the study of listener-to-listener convergence in phoneme categorization. - Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, Guanglai Gao:
Cross-Attention-Guided WaveNet for EEG-to-MEL Spectrogram Reconstruction. - Nicolò Loddo, Francisca Pessanha
, Almila Akdag Salah
:
What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech Synthesis. - Malin Svensson Lundmark
:
Magnitude and timing of acceleration peaks in stressed and unstressed syllables.
Speech Emotion Recognition
- Shahin Amiriparian, Filip Packan, Maurice Gerczuk
, Björn W. Schuller:
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets. - Fabian Ritter Gutierrez, Kuan-Po Huang, Jeremy H. M. Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng:
Dataset-Distillation Generative Model for Speech Emotion Recognition. - Jialong Mai, Xiaofen Xing, Weidong Chen, Xiangmin Xu:
DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion Recognition. - Minxue Niu, Mimansa Jaiswal, Emily Mower Provost:
From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs.
Self-Supervised Models in Speaker Recognition
- Ju-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Minjae Lee, Ha-Jin Yu:
Self-supervised speaker verification with relational mask prediction. - Victor Miara, Théo Lepage
, Réda Dehak:
Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models. - Chan-yeong Lim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Kyo-Won Koo, Seung-bin Kim, Ha-Jin Yu:
Improving Noise Robustness in Self-supervised Pre-trained Model for Speaker Verification. - Abderrahim Fathan, Xiaolin Zhu, Jahangir Alam:
On the impact of several regularization techniques on label noise robustness of self-supervised speaker verification systems. - Zhe Li, Man-Wai Mak, Hung-yi Lee, Helen Meng:
Parameter-efficient Fine-tuning of Speaker-Aware Dynamic Prompts for Speaker Verification. - Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:
Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.
Speech Quality Assessment
- Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda:
Embedding Learning for Preference-based Speech Quality Assessment. - Sathvik Udupa, Soumi Maiti, Prasanta Kumar Ghosh:
IndicMOS: Multilingual MOS Prediction for 7 Indian languages. - Dan Wells, Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Erica Cooper, Aidan Pine, Junichi Yamagishi, Korin Richmond:
Experimental evaluation of MOS, AB and BWS listening test designs. - Bao Thang Ta, Minh Tu Le, Van Hai Do, Huynh Thi Thanh Binh:
Enhancing No-Reference Speech Quality Assessment with Pairwise, Triplet Ranking Losses, and ASR Pretraining.
Privacy and Security in Speech Communication 1
- Nicolas M. Müller, Nicholas W. D. Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger:
Harder or Different? Understanding Generalization of Audio Deepfake Detection. - Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki
, Taiki Miyagawa:
Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset. - David Looney, Nikolay D. Gaubitch:
Robust spread spectrum speech watermarking using linear prediction and deep spectral shaping. - Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Zhao Lv, Cunhang Fan:
RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection. - Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung:
How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines. - Ching-Yu Yang, Shreya G. Upadhyay, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee:
RW-VoiceShield: Raw Waveform-based Adversarial Attack on One-shot Voice Conversion.
Speech Synthesis: Voice Conversion 2
- Aleksei Gusev, Anastasia Avdeeva:
Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech. - Ji Sub Um, Hoirin Kim:
Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion. - Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie:
Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy. - Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari:
Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment. - Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima:
Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training. - Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao:
Residual Speaker Representation for One-Shot Voice Conversion. - Nicolas Gengembre, Olivier Le Blouch, Cédric Gendrot:
Disentangling prosody and timbre embeddings via voice conversion. - Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, Lirong Dai:
LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance.
Speech Synthesis: Text Processing
- Amit Roth, Arnon Turetzky, Yossi Adi:
A Language Modeling Approach to Diacritic-Free Hebrew TTS. - Avihu Dekel, Raul Fernandez:
Exploring the Benefits of Tokenization of Discrete Acoustic Units. - Markéta Rezácková, Daniel Tihelka, Jindrich Matousek:
Homograph Disambiguation with Text-to-Text Transfer Transformer. - Kiyoshi Kurihara, Masanori Sano:
Enhancing Japanese Text-to-Speech Accuracy with a Novel Combination Transformer-BERT-based G2P: Integrating Pronunciation Dictionaries and Accent Sandhi. - Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana:
Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data. - Xingxing Yang:
G2PA: G2P with Aligned Audio for Mandarin Chinese. - Siqi Sun, Korin Richmond:
Learning Pronunciation from Other Accents via Pronunciation Knowledge Transfer. - Deepanshu Gupta, Javier Latorre:
Positional Description for Numerical Normalization. - Christina Tånnander, Shivam Mehta, Jonas Beskow, Jens Edlund:
Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis.
Training Methods, Self-Supervised Learning, Adaptation
- Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu
, Maja Pantic:
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization. - Amrutha Prasad, Srikanth R. Madikeri, Driss Khalil, Petr Motlícek, Christof Schüpbach:
Speech and Language Recognition with Low-rank Adaptation of Pretrained Models. - Kwangyoun Kim, Suwon Shon, Yi-Te Hsu, Prashant Sridhar, Karen Livescu, Shinji Watanabe:
Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition. - Amit Meghanani, Thomas Hain
:
LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks. - Robert Flynn, Anton Ragni:
Self-Train Before You Transcribe. - Steven Vander Eeckt, Hugo Van hamme
:
Unsupervised Online Continual Learning for Automatic Speech Recognition. - Hao Shi, Tatsuya Kawahara:
Dual-path Adaptation of Pretrained Feature Extraction Module for Robust Automatic Speech Recognition. - Nahomi Kusunoki, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi:
Hierarchical Multi-Task Learning with CTC and Recursive Operation. - Keigo Hojo, Yukoh Wakabayashi, Kengo Ohta, Atsunori Ogawa, Norihide Kitaoka:
Boosting CTC-based ASR using inter-layer attention-based CTC loss. - Hyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe:
Self-training ASR Guided by Unsupervised ASR Teacher. - Yue Gu, Zhihao Du, Shiliang Zhang, Jiqing Han, Yongjun He:
Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition. - George Joseph, Arun Baby:
Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank Adaptation. - Jae-Hong Lee, Sang-Eon Lee, Dong-Hyun Kim, Do-Hee Kim, Joon-Hyuk Chang:
Online Subloop Search via Uncertainty Quantization for Efficient Test-Time Adaptation. - Vishwanath Pratap Singh, Federico Malato, Ville Hautamäki, Md. Sahidullah, Tomi Kinnunen:
ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2vec2.0 Based ASR. - Jeehye Lee, Hyeji Seo:
Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition.
Novel Architectures for ASR
- Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara:
Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer. - Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe:
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting. - Virat Shejwalkar, Om Thakkar, Arun Narayanan:
Quantifying Unintended Memorization in BEST-RQ ASR Encoders. - Woo Hyun Kang, Srikanth Vishnubhotla, Rudolf Braun, Yogesh Virkar, Raghuveer Peri, Kyu J. Han:
SWAN: SubWord Alignment Network for HMM-free word timing estimation in end-to-end automatic speech recognition.
Multimodality and Foundation Models
- Ziyun Cui, Chang Lei
, Wen Wu
, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang:
Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models. - Mohammad Amaan Sayeed, Hanan Aldarmaki:
Spoken Word2Vec: Learning Skipgram Embeddings from Speech. - Pawel Bujnowski, Bartlomiej Kuzma, Bartlomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, Piotr Andruszkiewicz:
SAMSEMO: New dataset for multilingual and multimodal emotion recognition. - Bonian Jia, Huiyao Chen
, Yueheng Sun, Meishan Zhang, Min Zhang:
LLM-Driven Multimodal Opinion Expression Identification. - Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang:
Zero-Shot Fake Video Detection by Audio-Visual Consistency. - Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Ju, Tae-Hyun Oh:
Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert.
Spoken Dialogue Systems and Conversational Analysis 1
- Matthew McNeill, Rivka Levitan
:
Autoregressive cross-interlocutor attention scores meaningfully capture conversational dynamics. - Conor Atkins, Ian D. Wood, Mohamed Ali Kâafar
, Hassan Asghar
, Nardine Basta, Michal Kepkowski:
ConvoCache: Smart Re-Use of Chatbot Responses. - Livia Qian, Gabriel Skantze:
Joint Learning of Context and Feedback Embeddings in Spoken Dialogue. - Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah:
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing. - Siyang Wang, Éva Székely, Joakim Gustafson:
Contextual Interactive Evaluation of TTS Models in Dialogue Systems. - Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee:
GSQA: An End-to-End Model for Generative Spoken Question Answering.
Speech Technology
- Mattias Nilsson, Riccardo Miccini
, Clement Laroche, Tobias Piechowiak, Friedemann Zenke:
Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps. - Maryam Naderi, Enno Hermann, Alexandre Nanchen, Sevada Hovsepyan, Mathew Magimai-Doss
:
Towards interfacing large language models with ASR systems using confidence measures and prompting. - Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran:
Text Injection for Neural Contextual Biasing. - Minglin Wu, Jing Xu, Xixin Wu, Helen Meng:
Prompting Large Language Models with Mispronunciation Detection and Diagnosis Abilities. - Haitong Sun, Jaehyun Choi, Nobuaki Minematsu, Daisuke Saito:
Acceleration of Posteriorgram-based DTW by Distilling the Class-to-class Distances Encoded in the Classifier Used to Calculate Posteriors. - Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah:
VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech. - Jonathan Svirsky, Uri Shaham, Ofir Lindenbaum:
Sparse Binarization for Fast Keyword Spotting.
Pathological Speech Analysis 2
- Bence Mark Halpern, Thomas Tienkamp
, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik
, Sebastiaan A. H. J. de Visscher
, Max J. H. Witjes, Martijn Wieling
, Defne Abur, Tomoki Toda:
Quantifying the effect of speech pathology on automatic and human speaker verification. - Bubai Maji, Rajlakshmi Guha, Aurobinda Routray, Shazia Nasreen
, Debabrata Majumdar:
Investigation of Layer-Wise Speech Representations in Self-Supervised Learning Models: A Cross-Lingual Study in Detecting Depression. - Tanya Talkar, Sherman Charles, Chelsea Krantsevich, Kan Kawabata:
Detection of Cognitive Impairment And Alzheimer's Disease Using a Speech- and Language-Based Protocol. - Nana Lin, Youxiang Zhu, Xiaohui Liang
, John A. Batsis, Caroline Summerour:
Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment Detection. - Dominika Woszczyk, Ranya Aloufi, Soteris Demetriou:
Prosody-Driven Privacy-Preserving Dementia Detection. - Alkis Koudounas, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Elena Baralis:
Voice Disorder Analysis: a Transformer-based Approach.
Speech Science, Speech Technology, and Gender (Special Session)
- Martha Schubert, Daniel Duran, Ingo Siegert:
Challenges of German Speech Recognition: A Study on Multi-ethnolectal Speech Among Adolescents. - Atli Sigurgeirsson, Eddie L. Ungless:
Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling Queer Voices. - Valentin Pelloin, Lena Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan:
Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis. - David Doukhan, Lena Dodson, Manon Conan, Valentin Pelloin, Aurélien Clamouse, Mélina Lepape, Géraldine Van Hille, Cécile Méadel, Marlène Coulomb-Gully:
Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses. - Cliodhna Hughes
, Guy J. Brown, Ning Ma
, Nicola Dibben
:
Acoustic Effects of Facial Feminisation Surgery on Speech and Singing: A Case Study. - Éva Székely, Maxwell Hope:
An inclusive approach to creating a palette of synthetic voices for gender diversity. - Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli:
Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology. - Li-Fang Lai, Nicole R. Holliday:
Voice Quality Variation in AAE: An Additional Challenge for Addressing Bias in ASR Models? - Benjamin Elie, David Doukhan, Rémi Uro, Lucas Ondel Yang, Albert Rilliard, Simon Devauchelle:
Articulatory Configurations across Genders and Periods in French Radio and TV archives. - Aravind Krishnan, Badr M. Abdullah, Dietrich Klakow:
On the Encoding of Gender in Transformer-based ASR Representations.
Speech Production and Perception
- Chaofei Fan, Jaimie M. Henderson
, Chris Manning, Francis R. Willett:
Towards a Quantitative Analysis of Coarticulation with a Phoneme-to-Articulatory Model. - Chetan Sharma, Vaishnavi Chandwanshi, Prasanta Kumar Ghosh:
A comparative study of the impact of voiceless alveolar and palato-alveolar sibilants in English on lip aperture and protrusion during VCV production. - Peter Birkholz
, Patrick Häsner
:
Measurement and simulation of pressure losses due to airflow in vocal tract models. - Qiang Fang:
On The Performance of EMA-synchronized Speech and Stand-alone Speech in Acoustic-to-articulatory Inversion. - Marc Freixes, Marc Arnela, Joan Claudi Socoró, Luis Joglar-Ongay, Oriol Guasch, Francesc Alías Pujol:
Glottal inverse filtering and vocal tract tuning for the numerical simulation of vowel /a/ with different levels of vocal effort. - Daniel Friedrichs, Monica Lancheros
, Sam Kirkham
, Lei He, Andrew Clark, Clemens Lutz, Volker Dellwo, Steven Moran:
Temporal Co-Registration of Simultaneous Electromagnetic Articulography and Electroencephalography for Precise Articulatory and Neural Data Alignment.
Phonetics and Phonology: Segmentals and Suprasegmentals
- Zuzanna Miodonska, Michal Krecichwost, Ewa Kwasniok, Agata Sage, Pawel Badura
:
Frication noise features of Polish voiceless dental fricative and affricate produced by children with and without speech disorder. - Yiying Hu
, Hui Feng:
Key Acoustic Cues for the Realization of Metrical Prominence in Tone Languages: A Cross-Dialect Study. - Michaela Watkins
, Paul Boersma, Silke Hamann:
Revisiting Pitch Jumps: F0 Ratio in Seoul Korean. - Lorenzo Maselli, Véronique Delvaux:
Aerodynamics of Sakata labial-velar oral stops. - Donna Erickson, Albert Rilliard, Malin Svensson Lundmark
, Adelaide Silva, Leticia Rebollo Couto, Oliver Niebuhr
, João Antônio de Moraes:
Collecting Mandible Movement in Brazilian Portuguese. - May Pik Yu Chan, Jianjing Kuang:
Pitch-driven adjustments in tongue positions: Insights from ultrasound imaging.
Topics in Paralinguistics
- Suhas BN, Amanda Rebar, Saeed Abdullah:
Speaking of Health: Leveraging Large Language Models to assess Exercise Motivation and Behavior of Rehabilitation Patients. - Wen Wu
, Chao Zhang, Philip C. Woodland:
Confidence Estimation for Automatic Detection of Depression and Alzheimer's Disease Based on Clinical Interviews. - Hitoshi Suda, Aya Watanabe, Shinnosuke Takamichi:
Who Finds This Voice Attractive? A Large-Scale Experiment Using In-the-Wild Data. - Ryo Setoguchi, Yoshiko Arimoto
:
Acoustical analysis of the initial phones in speech-laugh. - Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng:
On Calibration of Speech Classification Models: Insights from Energy-Based Model Investigations. - Rui Liu, Zening Ma:
Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge.
Emotion Recognition: Fairness, Variability, Uncertainty
- Jingyao Wu, Ting Dang
, Vidhyasaharan Sethu
, Eliathamby Ambikairajah:
Dual-Constrained Dynamical Neural ODEs for Ambiguity-aware Continuous Emotion Prediction. - Hsing-Hang Chou, Woan-Shiuan Chien, Ya-Tse Wu, Chi-Chun Lee:
An Inter-Speaker Fairness-Aware Speech Emotion Regression Framework. - James Tavernor, Yara El-Tawil, Emily Mower Provost:
The Whole Is Bigger Than the Sum of Its Parts: Modeling Individual Annotators to Capture Emotional Variability. - Haoqin Sun, Shiwan Zhao, Xiangyu Kong, Xuechen Wang, Hui Wang, Jiaming Zhou, Yong Qin:
Iterative Prototype Refinement for Ambiguous Speech Emotion Recognition. - Woan-Shiuan Chien, Chi-Chun Lee:
An Investigation of Group versus Individual Fairness in Perceptually Fair Speech Emotion Recognition. - Oliver Schrüfer, Manuel Milling, Felix Burkhardt, Florian Eyben, Björn W. Schuller:
Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition. - Ricardo García, Rodrigo Mahú, Nicolás Grágeda, Alejandro Luzanto, Nicolas Bohmer, Carlos Busso, Néstor Becerra Yoma:
Speech emotion recognition with deep learning beamforming on a distant human-robot interaction scenario.
Speaker Verification
- Themos Stafylakis
, Anna Silnova, Johan Rohdin, Oldrich Plchot, Lukás Burget:
Challenging margin-based speaker embedding extractors by using the variational information bottleneck. - Jen-Tzung Chien
, I-Ping Yeh, Man-Wai Mak:
Collaborative Contrastive Learning for Hypothesis Domain Adaptation. - Imen Ben Amor, Jean-François Bonastre, Salima Mdhaffar:
Extraction of interpretable and shared speaker-specific speech attributes through binary auto-encoder. - Ivan Yakovlev, Rostislav Makarov, Andrei Balykin
, Pavel Malov, Anton Okhotnikov, Nikita Torgashov
:
Reshape Dimensions Network for Speaker Recognition. - Jee-weon Jung, Xin Wang, Nicholas W. D. Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Siddhant Arora, Junichi Yamagishi, Joon Son Chung:
To what extent can ASV systems naturally defend against spoofing attacks? - Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Junjie Li:
ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency.
Spatial Audio and Acoustics
- Yuri Y. Khokhlov, Tatiana Prisyach, Anton Mitrofanov, Dmitry Dutov
, Igor Agafonov, Tatiana Timofeeva, Aleksei Romanenko, Maxim Korenevsky:
Classification of Room Impulse Responses and its application for channel verification and diarization. - Liam Kelley, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii:
RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation. - Byeongjoo Ahn, Karren D. Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang:
Novel-view Acoustic Synthesis From 3D Reconstructed Rooms. - Liang Tao, Maoshen Jia, Yonggang Hu, Changchun Bao:
Spatial Acoustic Enhancement Using Unbiased Relative Harmonic Coefficients. - Alireza Bayestehtashk, Amit Kumar, Mike Wurtz:
Design of Feedback Active Noise Cancellation Filter Using Nested Recurrent Neural Networks. - Sidi Yaya Arnaud Yarga, Sean U. N. Wood:
Neuromorphic Keyword Spotting with Pulse Density Modulation MEMS Microphones. - Jacob Bitterman, Daniel Levi, Hilel Hagai Diamandi, Sharon Gannot, Tal Rosenwein:
RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification.
Generative Models for Speech and Audio
- Yatong Bai, Trung Dang, Dung N. Tran, Kazuhito Koishida, Somayeh Sojoudi:
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation. - Francesco Paissan, Luca Della Libera, Zhepei Wang, Paris Smaragdis, Mirco Ravanelli, Cem Subakan:
Audio Editing with Non-Rigid Text Prompts. - Shubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan:
Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice. - Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan:
Exploring compressibility of transformer based text-to-music (TTM) models. - Jaewon Kim, Won-Gook Choi, Seyun Ahn, Joon-Hyuk Chang:
Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator. - Ho-Young Choi, Won-Gook Choi, Joon-Hyuk Chang:
Retrieval-Augmented Classifier Guidance for Audio Generation. - Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti
:
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters. - Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang:
PAM: Prompting Audio-Language Models for Audio Quality Assessment.
Speech and Audio Modelling
- Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng:
GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model. - Séverine Guillaume, Maxime Fily, Alexis Michaud, Guillaume Wisniewski:
Gender and Language Identification in Multilingual Models of Speech: Exploring the Genericity and Robustness of Speech Representations. - Zhaoyu Wang, Haohe Liu, Harry Coppock, Björn W. Schuller, Mark D. Plumbley:
Neural Compression Augmentation for Contrastive Audio Representation Learning. - Sai Harshitha Aluru, Jhansi Mallela, Chiranjeevi Yarra:
Post-Net: A linguistically inspired sequence-dependent transformed neural architecture for automatic syllable stress detection.
Multi-Channel Speech Enhancement
- Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo:
Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers. - Zhongweiyang Xu, Ali Aroudi, Ke Tan, Ashutosh Pandey, Jung-Suk Lee, Buye Xu, Francesco Nesta:
FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses. - Shiran Aziz, Yossi Adi, Shmuel Peleg:
Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline. - Dongheon Lee, Jung-Woo Choi:
DeFTAN-AA: Array Geometry Agnostic Multichannel Speech Enhancement. - Nan Zhou, Youhai Jiang, Jialin Tan, Chongmin Qi:
PLDNet: PLD-Guided Lightweight Deep Network Boosted by Efficient Attention for Handheld Dual-Microphone Speech Enhancement.
Speech Synthesis: Paradigms and Methods 1
- Charles McGhee
, Kate M. Knill, Mark J. F. Gales:
Highly Intelligible Speaker-Independent Articulatory Synthesis. - Masato Murata, Koichi Miyazaki, Tomoki Koriyama:
An Attribute Interpolation Method in Speech Synthesis by Model Merging. - Miku Nishihara, Dan Wells, Korin Richmond, Aidan Pine:
Low-dimensional Style Token Control for Hyperarticulated Speech Synthesis. - Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li:
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation. - Trung Dang, David Aponte, Dung N. Tran, Kazuhito Koishida:
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes. - Changhwan Kim:
ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis. - John Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao Wang, Yuzong Liu:
Multi-modal Adversarial Training for Zero-Shot Voice Cloning. - Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu:
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning. - Rongshuai Wu, Debasish Ray Mohapatra, Sidney Fels:
Modeling Vocal Tract Like Acoustic Tubes Using the Immersed Boundary Method.
Speech Synthesis: Paradigms and Methods 2
- Théodor Lemerle, Nicolas Obin, Axel Roebel:
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis. - Paarth Neekhara
, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Boris Ginsburg:
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment. - Shijie Lai, Minglu He, Zijing Zhao, Kai Wang, Hao Huang, Jichen Yang:
Synthesizing Long-Form Speech merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual Enrichment. - Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li:
FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency. - Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Trung Hieu Nguyen, Jia Qi Yip, Bin Ma:
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis. - Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim:
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model. - Martin Lenglet, Olivier Perrotin, Gérard Bailly:
FastLips: an End-to-End Audiovisual Text-to-Speech System with Lip Features Prediction for Virtual Avatars.
Neural Network Architectures for ASR 1
- Tien-Ju Yang, Andrew Rosenberg, Bhuvana Ramabhadran:
Contemplative Mechanism for Speech Recognition: Speech Encoders can Think. - Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya:
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding. - Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami:
Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding. - Arnav Kundu, Prateeth Nayak, Priyanka Padmanabhan, Devang Naik:
RepCNN: Micro-sized, Mighty Models for Wakeword Detection. - Matthijs Van Keirsbilck, Alexander Keller:
Conformer without Convolutions. - Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya:
Linear-Complexity Self-Supervised Learning for Speech Processing.
Error Correction and Rescoring
- Ashish R. Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi:
SALSA: Speedy ASR-LLM Synchronous Aggregation. - Eunseop Yoon, Hee Suk Yoon, John B. Harvill, Mark Hasegawa-Johnson, Chang D. Yoo:
LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition. - Yiwei Wang, Ke-Han Lu, Kuan-Yu Chen:
HypR: A comprehensive study for ASR hypothesis revising with a reference corpus. - Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang:
Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition. - Iwen E. Kang, Christophe Van Gysel, Man-Hung Siu:
Transformer-based Model for ASR N-Best Rescoring and Rewriting. - Hao Yang, Min Zhang, Minghan Wang, Jiaxin Guo:
RASU: Retrieval Augmented Speech Understanding through Generative Modeling.
Spoken Language Understanding
- Ainikaerjiang Aimaiti, Di Wu, Liting Jiang, Gulinigeer Abudouwaili, Hao Huang, Wushour Silamu:
An Uyghur Extension to the MASSIVE Multi-lingual Spoken Language Understanding Corpus with Comprehensive Evaluations. - Lukas Christ, Shahin Amiriparian, Friederike Hawighorst, Ann-Kathrin Schill, Angelo Boutalikakis, Lorenz Graf-Vlachy, Andreas König, Björn W. Schuller:
This Paper Had the Smartest Reviewers - Flattery Detection Utilising an Audio-Textual Transformer-Based Approach. - Eunice Akani, Frédéric Béchet, Benoît Favre, Romain Gemignani:
Unified Framework for Spoken Language Understanding and Summarization in Task-Based Human Dialog processing. - Grant Anderson, Emma Hart, Dimitra Gkatzia, Ian Beaver:
Automated Human-Readable Label Generation in Open Intent Discovery. - Jeremy Chang, Kuan-Yu Chen, Chung-Hsien Wu
:
Applying Reinforcement Learning and Multi-Generators for Stage Transition in an Emotional Support Dialogue System.
Spoken Dialogue Systems and Conversational Analysis 2
- Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota:
Target conversation extraction: Source separation using turn-taking dynamics. - Sara Ng, Gina-Anne Levow, Mari Ostendorf, Richard A. Wright:
Investigating the Influence of Stance-Taking on Conversational Timing of Task-Oriented Speech. - Rémi Uro, Marie Tahon
, David Doukhan, Antoine Laurent, Albert Rilliard:
Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content. - Yu Watanabe
, Koichiro Ito, Shigeki Matsubara:
Utilization of Text Data for Response Timing Detection in Attentive Listening. - Yo-Han Park, Wencke Liermann, Yong-Seok Choi, Seung Hi Kim, Jeong-Uk Bang, Seung Yun, Kong Joo Lee:
Backchannel prediction, based on who, when and what. - Mathilde Hutin
, Junfei Hu
, Liesbeth Degand
:
Uh, um and mh: Are filled pauses prone to conversational converge? - Masaya Ohagi, Tomoya Mizumoto, Katsumasa Yoshikawa:
Investigation of look-ahead techniques to improve response time in spoken dialogue system.
Computational Models of Human Language Acquisition, Perception, and Production (Special Session)
- Annika Heuser, Jianjing Kuang:
Information-theoretic hypothesis generation of relative cue weighting for the voicing contrast. - Sevada Hovsepyan, Mathew Magimai-Doss
:
Neurocomputational model of speech recognition for pathological speech detection: a case study on Parkinson's disease speech detection. - Angelo Ortiz Tandazo, Thomas Schatz, Thomas Hueber, Emmanuel Dupoux:
Simulating articulatory trajectories with phonological feature interpolation. - Kentaro Onda, Joonyong Park, Nobuaki Minematsu, Daisuke Saito:
A Pilot Study of GSLM-based Simulation of Foreign Accentuation Only Using Native Speech Corpora. - Guillem Bonafos, Clara Bourot, Pierre Pudlo, Jean-Marc Freyermuth, Laurence Reboul, Samuel Tronçon, Arnaud Rey:
Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations. - Benjamin Elie, Juraj Simko, Alice Turk:
A data-driven model of acoustic speech intelligibility for optimization-based models of speech production. - Joseph Coffey, Okko Räsänen, Camila Scaff, Alejandrina Cristià:
The Difficulty and Importance of Estimating the Lower and Upper Bounds of Infant Speech Exposure. - Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, Herman Kamper:
Spoken-Term Discovery using Discrete Speech Units. - Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater:
Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations.
Show and Tell 3
- Elena Ryumina, Dmitry Ryumin, Alexey Karpov:
OCEAN-AI: open multimodal framework for personality traits assessment and HR-processes automatization. - Paridhi Mundra, Manik Sharma, Yashwardhan Chaudhuri, Orchid Chetia Phukan, Arun Balaji Buduru:
VoxMed: one-step respiratory disease classifier using digital stethoscope sounds. - Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma:
AVR: synergizing foundation models for audio-visual humor detection. - Yashwardhan Chaudhuri, Paridhi Mundra, Arnesh Batra, Orchid Chetia Phukan, Arun Balaji Buduru:
ASGIR: audio spectrogram transformer guided classification and information retrieval for birds. - Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma:
PERSONA: an application for emotion recognition, gender recognition and age estimation. - Mohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan, Muskaan Singh:
NeuRO: an application for code-switched autism detection in children. - Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma:
ComFeAT: combination of neural and spectral features for improved depression detection. - Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma:
The reasonable effectiveness of speaker embeddings for violence detection. - Dmitrii Obukhov, Marcel de Korte, Andrey Adaschik:
ATTEST: an analytics tool for the testing and evaluation of speech technologies. - Margot Masson, Erfan A. Shams, Iona Gessinger, Julie Carson-Berndsen:
PhoneViz: exploring alignments at a glance. - Clément Pages, Hervé Bredin:
Gryannote open-source speaker diarization labeling tool. - Giovanni Morrone, Enrico Zovato, Fabio Brugnara, Enrico Sartori, Leonardo Badino:
A toolkit for joint speaker diarization and identification with application to speaker-attributed ASR.
Phonetics, Phonology and Prosody
- Tomi H. Kinnunen, Rosa González Hautamäki, Xin Wang, Junichi Yamagishi:
Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake Speech. - Alessandro De Luca
, Andrew Clark, Volker Dellwo:
NumberLie: a game-based experiment to understand the acoustics of deception and truthfulness. - Federico Lo Iacono, Valentina Colonna, Antonio Romano:
Preservation, conservation and phonetic study of the voices of Italian poets: A study on the seven years of the VIP archive. - Nicolas Audibert, Cécile Fougeron, Christine Meunier:
Do Speaker-dependent Vowel Characteristics depend on Speech Style? - Suyuan Liu, Molly Babel, Jian Zhu:
A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systems. - Austin Jones, Margaret E. L. Renwick:
Evaluating Italian Vowel Variation with the Recurrent Neural Network Phonet. - Kira Tulchynska, Sylvanus Job, Alena Witzlack-Makarevich, Margaret Zellers:
Prosodic marking of syntactic boundaries in Khoekhoe.
Segmentals
- Johanna Cronenberg, Ioana Chitoran, Lori Lamel, Ioana Vasilescu:
Crosslinguistic Comparison of Acoustic Variation in the Vowel Sequences /ia/ and /io/ in Four Romance Languages. - Jenifer Vega Rodríguez, Nathalie Vallée, Christophe Savariaux, Silvain Gerber:
Nasal Air Flow During Speech Production In Korebaju. - Ted Kye:
Affricates in Lushootseed. - Viyazonuo Terhiija, Priyankoo Sarmah:
Voiced and voiceless laterals in Angami. - Minmin Yang, Rachid Ridouane:
Intrusive schwa within French stop-liquid clusters: An acoustic analysis.
New Avenues in Emotion Recognition
- Ya-Tse Wu, Jingyao Wu, Vidhyasaharan Sethu
, Chi-Chun Lee:
Can Modelling Inter-Rater Ambiguity Lead To Noise-Robust Continuous Emotion Predictions? - Ziping Zhao, Tian Gao, Haishuai Wang, Björn W. Schuller:
MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition. - Xiaohan Shi, Xingfeng Li, Tomoki Toda:
Multimodal Fusion of Music Theory-Inspired and Self-Supervised Representations for Improved Emotion Recognition. - Andreas Triantafyllopoulos, Björn W. Schuller:
Enrolment-based personalisation for improving individual-level fairness in speech emotion recognition. - Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso:
Keep, Delete, or Substitute: Frame Selection Strategy for Noise-Robust Speech Emotion Recognition. - Cheng Lu, Yuan Zong, Yan Zhao, Hailun Lian, Tianhua Qi, Björn W. Schuller, Wenming Zheng:
Hierarchical Distribution Adaptation for Unsupervised Cross-corpus Speech Emotion Recognition.
Speaker Diarization 2
- Chenyuan Zhang, Linkai Luo, Hong Peng, Wei Wen:
Variable Segment Length and Domain-Adapted Feature Optimization for Speaker Diarization. - Jeong-Hwan Choi
, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang:
Efficient Speaker Embedding Extraction Using a Twofold Sliding Window Algorithm for Speaker Diarization. - Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, Hank Liao:
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models. - Alexander Blatt, Aravind Krishnan, Dietrich Klakow:
Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control. - Alexis Plaquet, Hervé Bredin:
On the calibration of powerset speaker diarization models. - Séverin Baroudi, Thomas Pellegrini, Hervé Bredin:
Specializing Self-Supervised Speech Representations for Speaker Segmentation.
Speaker Recognition 2
- Erfan Loweimi, Mengjie Qian, Kate M. Knill, Mark J. F. Gales:
On the Usefulness of Speaker Embeddings for Speaker Retrieval in the Wild: A Comparative Study of x-vector and ECAPA-TDNN Models. - Zezhong Jin
, Youzhi Tu, Man-Wai Mak:
W-GVKT: Within-Global-View Knowledge Transfer for Speaker Verification. - Yao Shen, Yingying Gao, Yaqian Hao, Chenguang Hu, Fulin Zhang, Junlan Feng, Shilei Zhang:
CEC: A Noisy Label Detection Method for Speaker Recognition. - Fengrun Zhang, Wangjin Zhou, Yiming Liu, Wang Geng, Yahui Shan, Chen Zhang:
Disentangling Age and Identity with a Mutual Information Minimization for Cross-Age Speaker Verification. - Zuoliang Li, Wu Guo, Bin Gu, Shengyu Peng, Jie Zhang:
Contrastive Learning and Inter-Speaker Distribution Alignment Based Unsupervised Domain Adaptation for Robust Speaker Verification. - Minh Nguyen, Franck Dernoncourt, Seunghyun Yoon, Hanieh Deilamsalehy, Hao Tan, Ryan A. Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen:
Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models. - Bhasi K. C., Rajeev Rajan, Noumida Abdul Kareem
:
Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM framework.
Speech and Audio Analysis
- Antonio Almudévar, Théo Mariotte
, Alfonso Ortega Giménez
, Marie Tahon, Luis Vicente
, Antonio Miguel, Eduardo Lleida:
Predefined Prototypes for Intra-Class Separation and Disentanglement. - Tomoki Koriyama:
VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features. - Biswajit Karan, Joshua Jansen van Vüren, Febe de Wet, Thomas Niesler:
A Transformer-Based Voice Activity Detector. - Sri Harsha Dumpala, Dushyant Sharma, Chandramouli Shama Sastry, Stanislav Yu. Kruchinin, James Fosburgh, Patrick A. Naylor
:
XANE: eXplainable Acoustic Neural Embeddings. - Jhansi Mallela, Sai Harshitha Aluru, Chiranjeevi Yarra:
A comparative analysis of sequential models that integrate syllable dependency for automatic syllable stress detection. - Jiahao Li, Miao Liu, Shu Yang, Jing Wang, Xiang Xie:
Motion Based Audio-Visual Segmentation.
Speech Quality and Intelligibility: Prediction and Enhancement
- Paul Best, Santiago Cuervo, Ricard Marxer
:
Transfer Learning from Whisper for Microscopic Intelligibility Prediction. - Ryandhimas E. Zezario
, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao:
Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata. - Haolan Wang, Amin Edraki, Wai-Yip Chan, Iván López-Espejo, Jesper Jensen:
No-Reference Speech Intelligibility Prediction Leveraging a Noisy-Speech ASR Pre-Trained Model. - Danilo de Oliveira
, Simon Welker, Julius Richter, Timo Gerkmann:
The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement. - Bao Thang Ta, Van Hai Do, Huynh Thi Thanh Binh:
Enhancing Non-Matching Reference Speech Quality Assessment through Dynamic Weight Adaptation. - Hongyang Chen, Yuhong Yang, Zhongyuan Wang, Weiping Tu, Haojun Ai, Cedar Lin:
Exploring Sentence Type Effects on the Lombard Effect and Intelligibility Enhancement: A Comparative Study of Natural and Grid Sentences.
Speech Synthesis: Vocoders
- Yuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie:
FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter. - Aryan Chaudhary, Vinayak Abrol:
QGAN: Low Footprint Quaternion Neural Vocoder for Speech Synthesis. - Hyunjae Cho, Junhyeok Lee
, Wonbin Jung:
JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis. - Rubing Shen, Yanzhen Ren, Zongkun Sun:
FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder. - Shaowen Chen, Tomoki Toda:
QHM-GAN: Neural Vocoder based on Quasi-Harmonic Modeling. - Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling:
BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation.
ASR Model Training Methods
- Tina Raissi, Christoph Lüscher, Simon Berger, Ralf Schlüter, Hermann Ney:
Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality. - Neeraj Gaur, Rohan Agrawal, Gary Wang, Parisa Haghani, Andrew Rosenberg, Bhuvana Ramabhadran:
ASTRA: Aligning Speech and Text Representations for Asr without Sampling. - Muhammad Shakeel, Yui Sudo
, Yifan Peng, Shinji Watanabe:
Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss. - Sheng-Chieh Chiu, Chia-Hua Wu, Jih-Kang Hsieh, Yu Tsao, Hsin-Min Wang:
Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning Models. - Devang Kulshreshtha, Nikolaos Pappas, Brady Houston, Saket Dingliwal, Srikanth Ronanki:
Sequential Editing for Lifelong Training of Speech Recognition Models. - Chia-Kai Yeh, Chih-Chun Chen, Ching-Hsien Hsu, Jen-Tzung Chien
:
Cross-Modality Diffusion Modeling and Sampling for Speech Recognition.
Cross-Lingual and Multilingual Processing
- Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee:
A Parameter-efficient Language Extension Framework for Multilingual ASR. - Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen:
LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR. - Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu:
mHuBERT-147: A Compact Multilingual HuBERT Model. - Vasista Sai Lodagala, Abhishek Biswas, Shoutrik Das, Jordan Fernandes, Srinivasan Umesh:
All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale. - Nikhil Jakhar, Sudhanshu Srivastava, Arun Baby:
A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages. - Ling Dong, Zhengtao Yu, Wenjun Wang, Yuxin Huang, Shengxiang Gao, Guojiang Zhou:
Integrating Speech Self-Supervised Learning Models and Large Language Models for ASR. - Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe:
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models. - Krishna C. Puvvada, Piotr Zelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg:
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data. - Georgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis, Vassilis Katsouros:
The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data. - Socrates Vakirtzian, Chara Tsoukala
, Stavros Bompolas, Katerina Mouzou, Vivian Stamou, Georgios Paraskevopoulos, Antonios Dimakis, Stella Markantonatou, Angela Ralli, Antonios Anastasopoulos:
Speech Recognition for Greek Dialects: A Challenging Benchmark. - Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee:
LUPET: Incorporating Hierarchical Information Path into Multilingual ASR. - Per Egil Kummervold, Javier de la Rosa, Freddy Wetjen, Rolv-Arild Braaten, Per Erik Solberg:
Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges. - Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe:
EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios. - Amir Hussein, Desh Raj, Matthew Wiesner, Daniel Povey, Paola García, Sanjeev Khudanpur:
Enhancing Neural Transducer for Multilingual ASR with Synchronized Language Diarization. - Shuaishuai Ye, Shunfei Chen, Xinhui Hu, Xinkang Xu:
SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR.
Speech Assessment
- Chung-Wen Wu, Berlin Chen:
Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies. - Jian Cheng:
Context-Aware Speech Recognition Using Prompts for Language Learners. - Raj Gothi, Rahul Kumar, Mildred Pereira, Nagesh Nayak, Preeti Rao:
A Dataset and Two-pass System for Reading Miscue Detection. - Tin Mei Lun, Ekaterina Voskoboinik, Ragheb Al-Ghezi, Tamás Grósz, Mikko Kurimo:
Oversampling, Augmentation and Curriculum Learning for Speaking Assessment with Limited Training Data. - Yu Tomita, Yingxiang Gao, Nobuaki Minematsu, Noriko Nakanishi, Daisuke Saito:
Analysis and Visualization of Directional Diversity in Listening Fluency of World Englishes Speakers in the Framework of Mutual Shadowing. - Sean Robertson, Gerald Penn
, Ewan Dunbar:
Quantifying the Role of Textual Predictability in Automatic Speech Recognition.
Question Answering from Speech and Spoken Dialogue Systems
- Tonmoy Rajkhowa, Amartya Roy Chowdhury, Sankalp Nagaonkar, Achyut Mani Tripathi, S. R. Mahadeva Prasanna:
TM-PATHVQA: 90000+ Textless Multilingual Questions for Medical Visual Question Answering. - Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru
, Rajesh Sharma:
Towards Multilingual Audio-Visual Question Answering. - Minh Nguyen, Toàn Quoc Nguyên, Kishan KC, Zeyu Zhang, Thuy Vu:
Reinforcement Learning from Answer Reranking Feedback for Retrieval-Augmented Answer Generation. - Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg:
Instruction Data Generation and Unsupervised Adaptation for Speech Language Models. - Martina Di Bratto, Maria Di Maro, Antonio Origlia:
On the Use of Plausible Arguments in Explainable Conversational AI. - Muhammad Yeza Baihaqi, Angel F. Garcia Contreras, Seiya Kawano, Koichiro Yoshino:
Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First Meeting. - Lifeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu:
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval.
Spoken Dialogue Systems and Conversational Analysis 3
- Zilong Huang, Man-Wai Mak, Kong Aik Lee:
MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation. - Haoxiang Shi, Ziqi Liang, Jun Yu:
Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation. - Keita Suzuki, Nobukatsu Hojo, Kazutoshi Shinoda, Saki Mizuno, Ryo Masumura:
Participant-Pair-Wise Bottleneck Transformer for Engagement Estimation from Video Conversation. - Johannah O'Mahony, Catherine Lai, Éva Székely:
Well, what can you do with messy data? Exploring the prosody and pragmatic function of the discourse marker "well" with found data and speech synthesis. - Kazutoshi Shinoda, Nobukatsu Hojo, Saki Mizuno, Keita Suzuki, Satoshi Kobashikawa, Ryo Masumura:
Learning from Multiple Annotator Biased Labels in Multimodal Conversation. - Jakub Hoscilowicz, Adam Wiacek, Jan Chojnacki, Adam Cieslak, Leszek Michon, Artur Janicki:
Non-Linear Inference Time Intervention: Improving LLM Truthfulness. - Zhe Liu, Suyoun Kim, Ozlem Kalinli:
Evaluating Speech Recognition Performance Towards Large Language Model Based Voice Assistants.
Dysarthric Speech Assessment
- Fathima Zaheera, Supritha Shetty, Gayadhar Pradhan, Deepak K. T:
Automatic Assessment of Dysarthria using Speech and synthetically generated Electroglottograph signal. - Yan Wan, Mengyi Sun, Xinchen Kang, Jingting Li, Pengfei Guo, Ming Gao, Su-Jing Wang:
CDSD: Chinese Dysarthria Speech Database. - Neelesh Samptur, Tanuka Bhattacharjee, Anirudh Chakravarty K, Seena Vengalil, Yamini Belur, Atchayaram Nalini, Prasanta Kumar Ghosh:
Exploring Syllable Discriminability during Diadochokinetic Task with Increasing Dysarthria Severity for Patients with Amyotrophic Lateral Sclerosis. - Matthew Perez
, Aneesha Sampath, Minxue Niu, Emily Mower Provost:
Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models. - Khalid Daoudi, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Margherita Fabbri, Anne Pavy-Le Traon, Olivier Rascol, Virginie Woisard, Wassilios G. Meissner:
Electroglottography for the assessment of dysphonia in Parkinson's disease and multiple system atrophy. - Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng:
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction.
Spoken Language Models for Universal Speech Processing (Special Session)
- Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu:
On the Effectiveness of Acoustic BPE in Decoder-Only TTS. - Kai-Wei Chang, Ming-Hao Hsu, Shang-Wen Li, Hung-yi Lee:
Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks. - Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee:
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models. - Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang:
Can Large Language Models Understand Spatial Audio? - Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu:
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding. - Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee:
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment. - Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li:
COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning. - Shoval Messica, Yossi Adi:
NAST: Noise Aware Speech Tokenization for Speech Language Models. - Slava Shechtman, Avihu Dekel:
Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer.
Keynote 4
- Barbara Tillmann:
Perception of music and speech: Focus on rhythm processing.
L1/L2 Acquisition and Cross-Linguistic Factors
- Hyun Kyung Hwang, Manami Hirayama:
Acquisition of high vowel devoicing in Japanese: A production experiment with three and four year olds. - Zimeng Li, Zhongxuan Mao, Shengting Shen, Ivan Yuen, Ping Tang:
The Production of Contrastive Focus by 7 to 13-year-olds Learning Mandarin Chinese. - Iuliia Zaitova, Irina Stenger, Wei Xue
, Tania Avgustinova, Bernd Möbius, Dietrich Klakow:
Cross-Linguistic Intelligibility of Non-Compositional Expressions in Spoken Context. - Alexis DeMaere, Nicole van Rootselaar, Fangfang Li, Robbin Gibb, Claudia L. R. Gonzalez:
On the relationship between speech production and vocabulary size in 3-5 year olds. - Tim Polzehl, Tim Herzig, Friedrich Wicke, Kathleen Wermke, Razieh Khamsehashari, Michiko Dahlem, Sebastian Möller:
Towards Classifying Mother Tongue from Infant Cries - Findings Substantiating Prenatal Learning Theory. - Aijun Li, Jun Gao, Zhiwei Wang:
Effect of Complex Boundary Tones on Tone Identification: An Experimental Study with Mandarin-speaking Preschool Children. - Thanh Lan Truong, Andrea Weber:
Ethnolinguistic Identification of Vietnamese-German Heritage Speech.
Speaker Stance, Emotion and Language-External Factors
- Dirk Eike Hoffner, Jana Roßbach, Bernd T. Meyer:
Joint prediction of subjective listening effort and speech intelligibility based on end-to-end learning. - Xinyi Wu, Changqing Xu, Nan Li, Rongfeng Su, Lan Wang, Nan Yan:
Depression Enhances Internal Inconsistency between Spoken and Semantic Emotion: Evidence from the Analysis of Emotion Expression in Conversation. - Olympia Simantiraki, Martin Cooke:
Listeners' F0 preferences in quiet and stationary noise. - Nao Hodoshima:
Effects of talker and playback rate of reverberation-induced speech on speech intelligibility of older adults.
Experimental Phonetics and Laboratory Phonology
- Bingliang Zhao
, Jiangping Kong, Xiyu Wu:
Age-related Differences in Acoustic Cues for the Perception of Checked Syllables in Shengzhou Wu. - Constantijn Kaland, Maria Lialiou:
Quantity-sensitivity affects recall performance of word stress. - Zuheyra Tokac, Jennifer Cole
:
Phonological Symmetry Does Not Predict Generalization of Perceptual Adaptation to Vowels. - Ariëlle Reitsema
, Chenxin Li, Leanne van Lambalgen, Laura Preining, Saskia Galindo Jong, Qing Yang, Xinyi Wen, Yiya Chen:
Perceptual Learning in Lexical Tone: Phonetic Similarity vs. Phonological Categories. - Anna Stein, Kevin Tang:
Modeling probabilistic reduction across domains with Naive Discriminative Learning. - Jonathan Him Nok Lee
, Mark Liberman, Martin Salzmann:
Do we EXPECT TO find phonetic traces for syntactic traces?
Speaker recognition evaluation and resources
- Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li:
VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark. - Wiebke Hutiri, Tanvina Patel, Aaron Yi Ding, Odette Scharenborg:
As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research. - Shuai Wang, Ke Zhang, Shaoxiong Lin, Junjie Li, Xuefei Wang, Meng Ge, Jianwei Yu, Yanmin Qian, Haizhou Li:
WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction. - Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Alex Gichamba, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe:
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models. - Xinghao Huang, Weiwei Jiang, Long Rao, Wei Xu, Wenqing Cheng:
Active Speaker Detection in Fisheye Meeting Scenes with Scene Spatial Spectrums. - Vu Hoang, Viet-Thanh Pham, Hoa Nguyen Xuan, Pham Nhi, Phuong Dat, Thi Thu Trang Nguyen:
VSASV: a Vietnamese Dataset for Spoofing-Aware Speaker Verification.
Speech Type Classification
- Liuxian Ma, Lin Shen
, Ruobing Li, Haojie Zhang
, Kun Qian, Bin Hu, Björn W. Schuller, Yoshiharu Yamamoto:
E-ODN: An Emotion Open Deep Network for Generalised and Adaptive Speech Emotion Recognition. - Joseph Liu
, Mahesh Kumar Nandwana, Janne Pylkkönen, Hannes Heikinheimo, Morgan McGuire:
Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment. - Youssef Nafea, Shady Shehata, Zeerak Talat, Ahmed Aboeitta, Ahmed Sharshar
, Preslav Nakov:
AraOffence: Detecting Offensive Speech Across Dialects in Arabic Media. - Jiali Cheng, Mohamed Elgaar, Nidhi Vakil, Hadi Amiri:
CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous Speech. - Fangjing Niu, Xiaozhe Qi, Xinya Chen, Liang He
:
Speech Topic Classification Based on Multi-Scale and Graph Attention Networks. - Liangwei Chen
, Xiren Zhou, Qiang Tu, Huanhuan Chen:
Enhancing Speech and Music Discrimination Through the Integration of Static and Dynamic Features.
Target Speaker Extraction
- Hanyu Meng
, Qiquan Zhang, Xiangyu Zhang
, Vidhyasaharan Sethu
, Eliathamby Ambikairajah:
Binaural Selective Attention Model for Target Speaker Extraction. - Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu:
All Neural Low-latency Directional Speech Extraction. - Woon-Haeng Heo, Joongyu Maeng, Yoseb Kang, Namhyun Cho:
Centroid Estimation with Transformer-Based Speaker Embedder for Robust Target Speaker Extraction. - Vidya Srinivas, Malek Itani, Tuochao Chen, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota:
Knowledge boosting during low-latency inference. - Tianci Wu, Shulin He, Jiahui Pan, Haifeng Huang, Zhijian Mo, Xueliang Zhang:
Unified Audio Visual Cues for Target Speaker Extraction. - Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi:
Target Speaker Extraction with Curriculum Learning.
Speech Synthesis: Voice Conversion 3
- Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li:
SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion. - Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ülgen, Carlos Busso, Berrak Sisman:
Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline. - Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Yuto Kondo:
PRVAE-VC2: Non-Parallel Voice Conversion by Distillation of Speech Representations. - Xinlei Niu
, Jing Zhang, Charles Patrick Martin
:
HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts. - Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali:
DreamVoice: Text-Guided Voice Conversion. - Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee:
Hear Your Face: Face-based voice conversion with F0 estimation. - Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Y. Espy-Wilson, Andrea Fanelli:
Accent Conversion with Articulatory Representations. - Jen-Hung Huang, Wei-Tsung Lee, Chung-Hsien Wu:
USD-AC: Unsupervised Speech Disentanglement for Accent Conversion. - Hiroki Kanagawa, Yusuke Ijima:
Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion.
Speech Synthesis: Paradigms and Methods 3
- Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng:
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models. - Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu:
Sample-Efficient Diffusion for Text-To-Speech Synthesis. - Jingyi Feng, Yusuke Yasuda, Tomoki Toda:
Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy Transcriptions. - Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon:
VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech. - Tasnima Sadekova, Mikhail A. Kudinov, Vadim Popov, Assel Yermekova, Artem Khrapov:
PitchFlow: adding pitch control to a Flow-matching based TTS model. - Jinhyeok Yang, Junhyeok Lee, Hyeong-Seok Choi, Seunghoon Ji, Hyeongju Kim, Juheon Lee:
DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance. - Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian:
Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems. - Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen:
TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers.
Privacy and Security in Speech Communication 2
- Suhita Ghosh, Mélanie Jouaiti
, Arnab Das, Yamini Sinha, Tim Polzehl, Ingo Siegert, Sebastian Stober:
Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example. - Rui Wang, Liping Chen, Kong Aik Lee, Zhen-Hua Ling:
Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding. - Sarina Meyer, Florian Lux, Ngoc Thang Vu:
Probing the Feasibility of Multilingual Speaker Anonymization. - Fan Huang, Kun Zeng, Wei Zhu:
DiffVC+: Improving Diffusion-based Voice Conversion for Speaker Anonymization.
Streaming ASR
- Yuting Yang, Guodong Ma
, Yuke Li, Binbin Du, Haoqi Zhu, Liang Ruan:
Learning from Back Chunks: Acquiring More Future Knowledge for Streaming ASR Models via Self Distillation. - Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe:
Decoder-only Architecture for Streaming End-to-end Speech Recognition. - Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie:
Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study. - Jens Heitkaemper, Joe Caroselli, Arun Narayanan, Nathan Howard:
TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASR. - Khanh Le, Duc Chau:
Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking. - Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li:
Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection.
Computational Resource Constrained ASR
- Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy
, Maja Pantic, Decebal Constantin Mocanu
, Shiwei Liu
:
Dynamic Data Pruning for Automatic Speech Recognition. - Dong-Hyun Kim, Joon-Hyuk Chang:
Mitigating Overfitting in Structured Pruning of ASR Models with Gradient-Guided Parameter Regularization. - Tianteng Gu, Bei Liu
, Hang Shao, Yanmin Qian:
SparseWAV: Fast and Accurate One-Shot Unstructured Pruning for Large Speech Foundation Models. - Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu:
One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model. - Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng:
USM RNN-T model weights binarization. - Tzu-Quan Lin, Hung-yi Lee, Hao Tang:
DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models. - Eunik Park, Daehyun Ahn, Hyungjun Kim:
RepTor: Re-parameterizable Temporal Convolution for Keyword Spotting via Differentiable Kernel Search. - Shuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang:
Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting. - Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li:
ED-sKWS: Early-Decision Spiking Neural Networks for Rapid, and Energy-Efficient Keyword Spotting. - Tongtao Ling, Yutao Lai, Lei Chen, Shilei Huang, Yi Liu:
A Small and Fast BERT for Chinese Medical Punctuation Restoration.
Evaluation of Speech Technology Systems
- Annika Heuser, Tyler Kendall, Miguel Del Rio, Quinn McNamara, Nishchal Bhandari, Corey Miller, Migüel Jetté:
Quantification of stylistic differences in human- and ASR-produced transcripts of African American English. - Korbinian Kuhn
, Verena Kersken, Gottfried Zimmermann:
Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications. - Maria Teleki, Xiangjue Dong, Soohwan Kim, James Caverlee:
Comparing ASR Systems in the Context of Speech Disfluencies. - Xiang-Li Lu, Yi-Fen Liu:
Deep Prosodic Features in Tandem with Perceptual Judgments of Word Reduction for Tone Recognition in Conversed Speech. - Zitha Sasindran
, Harsha Yelchuri, T. Venkata Prabhakar:
SeMaScore: A new evaluation metric for automatic speech recognition tasks.
Neural Network Training for Speech Recognition
- Jingjing Xu, Wei Zhou, Zijian Yang, Eugen Beck, Ralf Schlüter:
Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition. - Zejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff:
Revisiting Convolution-free Transformer for Speech Recognition. - Zhiqi Huang, Diamantino Caseiro, Kandarp Joshi, Christopher Li, Pat Rondon, Zelin Wu, Petr Zadrazil, Lillian Zhou:
Optimizing Large-Scale Context Retrieval for End-to-End ASR. - Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura
, Satoru Fukayama, Karen Livescu, Shinji Watanabe:
Self-Supervised Speech Representations are More Phonetic than Semantic. - Shiyi Han, Mingbin Xu, Zhihong Lei, Zhen Huang, Xingyu Na:
Enhancing CTC-based speech recognition with diverse modeling units. - Eungbeom Kim, Hantae Kim, Kyogu Lee:
Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation.
Leveraging Large Language Models and Contextual Features for Phonetic Analysis (Special Session)
- Marianne de Heer Kloots
, Willem H. Zuidema:
Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0. - Yuqin Lin, Longbiao Wang, Jianwu Dang, Nobuaki Minematsu:
Exploring Pre-trained Speech Model for Articulatory Feature Extraction in Dysarthric Speech Using ASR. - Yun Hao
, Reihaneh Amooie, Wietse de Vries
, Thomas Tienkamp
, Rik van Noord
, Martijn Wieling
:
Exploring Self-Supervised Speech Representations for Cross-lingual Acoustic-to-Articulatory Inversion. - Erfan A. Shams
, Iona Gessinger
, Patrick Cormac English, Julie Carson-Berndsen:
Are Articulatory Feature Overlaps Shrouded in Speech Embeddings? - Patrick Cormac English, John D. Kelleher, Julie Carson-Berndsen:
Searching for Structure: Appraising the Organisation of Speech Features in wav2vec 2.0 Embeddings.
Responsible Speech Foundation Models (Special Session)
- Daniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland R. Barnard, David T. Jones, Hugo Botha:
Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction. - Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet:
Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models. - Ajinkya Kulkarni
, Atharva Kulkarni, Miguel Couceiro, Isabel Trancoso:
Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems. - Yi-Cheng Lin, Haibin Wu, Huang-Cheng Chou, Chi-Chun Lee, Hung-yi Lee:
Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition. - Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee:
On the social bias of speech self-supervised models. - Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg
, David R. Mortensen
:
Self-supervised Speech Representations Still Struggle with African American Vernacular English. - Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald:
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? - Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng:
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System.
Multimodal Paralinguistics
- Yunrui Cai, Zhiyong Wu, Jia Jia, Helen Meng:
LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information. - Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li:
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition. - Kang Zhu, Cunhang Fan, Jianhua Tao, Zhao Lv:
Prompt Link Multimodal Fusion in Multimodal Sentiment Analysis. - Kexin Wang, Carlos Ishi, Ryoko Hayashi:
A multimodal analysis of different types of laughter expression in conversational dialogues. - Georgios Chochlakis, Chandrashekhar Lavania, Prashant Mathur, Kyu J. Han:
Tackling Missing Modalities in Audio-Visual Representation Learning Using Masked Autoencoders. - Jehyun Kyung, Serin Heo, Joon-Hyuk Chang:
Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-Tuning. - Lucas Goncalves, Donita Robinson, Elizabeth Richerson, Carlos Busso:
Bridging Emotions Across Languages: Low Rank Adaptation for Multilingual Speech Emotion Recognition.
Automatic Emotion Recognition
- Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee:
A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition. - Orchid Chetia Phukan, Gautam Siddharth Kashyap
, Arun Balaji Buduru
, Rajesh Sharma:
Are Paralinguistic Representations all that is needed for Speech Emotion Recognition? - Haiyang Sun, Fulin Zhang, Yingying Gao, Shilei Zhang, Zheng Lian, Junlan Feng:
MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition. - Bulat Khaertdinov, Pedro Jeuris
, Annanda Sousa, Enrique Hortal
:
Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations.
Self and Weakly-Labelled Speaker Verification
- Junxu Wang, Zhihua Fang
, Liang He
:
Self-Supervised Speaker Verification with Mini-Batch Prediction Correction. - Yue Li, Xinsheng Wang, Li Zhang, Lei Xie:
SCDNet: Self-supervised Learning Feature based Speaker Change Detection. - Zezhong Jin
, Youzhi Tu, Man-Wai Mak:
Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker Verification. - Anith Selvakumar, Homa Fashandi:
Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification.
Acoustic Event Detection, Segmentation and Classification
- Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani:
FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation. - Li Xiao, Lucheng Fang, Yuhong Yang, Weiping Tu:
LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification. - Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak:
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions. - Taisei Omine, Kenta Akita, Reiji Tsuruno:
Robust Laughter Segmentation with Automatic Diverse Data Synthesis. - Martin Lebourdais, Théo Mariotte
, Antonio Almudévar, Marie Tahon, Alfonso Ortega Giménez
:
Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing. - Gasser Elbanna, Zohreh Mostaani, Mathew Magimai-Doss
:
Predicting Heart Activity from Speech using Data-driven and Knowledge-based features. - Natalia Morozova
, Guanghao You, Sabine Stoll, Adrian Bangerter:
Measuring acoustic dissimilarity of hierarchical markers in task-oriented dialogue with MFCC-based dynamic time warping. - Sai Srujana Buddi, Satyam Kumar, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya:
Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness. - Zhiyong Wang
, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Xin Qi, Yi Lu, Shuchen Shi:
Generalized Fake Audio Detection via Deep Stable Learning. - Shruti Palaskar, Ognjen Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed H. Tewfik:
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection. - Yongyi Zang, Jiatong Shi, You Zhang
, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan:
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection. - Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He:
Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor. - Noumida Abdul Kareem
, Rajeev Rajan:
Multi-label Bird Species Classification from Field Recordings using Mel_Graph-GCN Framework.
Speech and Audio Modelling
- Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu:
DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation. - Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang:
Leveraging Language Model Capabilities for Sound Event Detection. - Xuenan Xu, Pingyue Zhang, Ming Yan, Ji Zhang, Mengyue Wu:
Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models. - Wenhao Guan, Kaidi Wang, Wangjin Zhou, Yang Wang, Feng Deng, Hui Wang, Lin Li, Qingyang Hong, Yong Qin:
LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation. - Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee:
DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech. - Veranika Boukun, Jakob Drefs, Jörg Lücke:
Blind Zero-Shot Audio Restoration: A Variational Autoencoder Approach for Denoising and Inpainting.
Fake Audio Detection
- Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, Horia Cucu:
Towards generalisable and calibrated audio deepfake detection with self-supervised representations. - Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonan Cheng, Long Ye, Jianhua Tao:
Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy. - Jiafeng Zhong, Bin Li, Jiangyan Yi:
Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention Mechanism. - Xuanjun Chen, Haibin Wu, Roger Jang, Hung-yi Lee:
Singing Voice Graph Modeling for SingFake Detection. - Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi:
Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection. - Hyun Myung Kim, Kangwook Jang, Hoirin Kim:
One-class learning with adaptive centroid shift for audio deepfake detection.
Deep Learning-Based Speech Enhancement: Approaches, Scalability, and Evaluation
- Rui Cao, Tianrui Wang, Meng Ge, Andong Li, Longbiao Wang, Jianwu Dang, Yungang Jia:
VoiCor: A Residual Iterative Voice Correction Framework for Monaural Speech Enhancement. - Tanel Pärnamaa, Ando Saabas:
Personalized Speech Enhancement Without a Separate Speaker Embedding Model. - Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell
, Chenda Li, Zhaoheng Ni, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian:
URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement. - Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann:
EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation.
Speech Synthesis: Other Topics 1
- Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang:
LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis. - Miseul Kim, Soo-Whan Chung, Youna Ji, Hong-Goo Kang, Min-Seok Choi:
Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation. - Shuhua Li, Qirong Mao, Jiatong Shi:
PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model. - Louis Abel, Vincent Colotte, Slim Ouni:
Towards realtime co-speech gestures synthesis using STARGATE. - Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang:
PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation. - Jaeuk Lee, Sohee Jang, Joon-Hyuk Chang:
Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic Control. - Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang:
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis. - Marie Kunesová
, Jan Lehecka, Josef Michálek, Jindrich Matousek, Jan Svec:
Zero-shot Out-of-domain is No Joke: Lessons Learned in the VoiceMOS 2023 MOS Prediction Challenge. - Nigel G. Ward, Andres Segura, Alejandro Ceballos, Divette Marco:
Towards a General-Purpose Model of Perceived Pragmatic Similarity.
Speech Synthesis: Other Topics 2
- Liisa Rätsep, Rasmus Lellep, Mark Fishel:
Enabling Conversational Speech Synthesis using Noisy Spontaneous Data. - Dong Yang, Tomoki Koriyama, Yuki Saito:
Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech. - Donghyun Seong, Joon-Hyuk Chang:
H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech. - Huai-Zhe Yang, Chia-Ping Chen, Shan-Yun He, Cheng-Ruei Li:
Bilingual and Code-switching TTS Enhanced with Denoising Diffusion Model and GAN. - Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari:
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics. - Juhwan Yoon, WooSeok Ko, Seyun Um, Sungwoong Hwang, Soojoong Hwang, Changhwan Kim, Hong-Goo Kang:
UNIQUE : Unsupervised Network for Integrated Speech Quality Evaluation. - Minyoung Lee, Eunil Park, Sungeun Hong:
FVTTS : Face Based Voice Synthesis for Text-to-Speech.
Speech synthesis: Cross-lingual and multilingual aspects
- Florian Lux, Sarina Meyer, Lyonel Behringer
, Frank Zalkow, Phat Do
, Matt Coler
, Emanuël A. P. Habets, Ngoc Thang Vu:
Meta Learning Text-to-Speech Synthesis in over 7000 Languages. - Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi:
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios. - Jing Wu, Ting Chen, Minchuan Chen, Wei Hu, Shaojun Wang, Jing Xiao:
Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent Disentanglement. - Jing Xu, Minglin Wu, Xixin Wu, Helen Meng:
Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models. - Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber:
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model. - Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro:
X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion.
Noise, Far-Field, Multi-Talker, Enhancement, Audio Classification
- Yiwen Shao, Shi-Xiong Zhang, Dong Yu:
RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios. - Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur:
Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment. - William Ravenscroft, George Close, Stefan Goetze
, Thomas Hain
, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs:
Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition. - Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Peer, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka:
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription. - Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet:
Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network. - Daniel Haider, Felix Perfler
, Vincent Lostanlen, Martin Ehler, Peter Balazs:
Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement. - Wenjun Wang, Shangbin Mo, Ling Dong, Zhengtao Yu, Junjun Guo, Yuxin Huang:
DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement Network. - Riyansha Singh, Parinita Nema, Vinod K. Kurmi:
Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation. - Muhammad Umer Sheikh, Hassan Abid, Bhuiyan Sanjid Shafique, Asif Hanif, Muhammad Haris Khan:
Bird Whisperer: Leveraging Large Pre-trained Acoustic Model for Bird Call Classification. - Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix:
SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling. - Marvin Borsdorf
, Zexu Pan, Haizhou Li, Tanja Schultz:
wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered Speech.
Self-Supervised Learning for ASR
- Yaroslav Getman
, Tamás Grósz, Mikko Kurimo:
What happens in continued pre-training? Analysis of self-supervised speech models with continued pre-training for colloquial Finnish ASR. - Akihiro Kato, Hiroyuki Nagano, Kohei Chike, Masaki Nose:
Self-Supervised Learning for ASR Pre-Training with Uniquely Determined Target Labels and Controlling Cepstrum Truncation for Speech Augmentation. - Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah:
MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations. - Mun-Hak Lee, Jae-Hong Lee, Do-Hee Kim, Ye-Eun Ko, Joon-Hyuk Chang:
Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning Through Sample Reweighting Techniques.
Spoken Term Detection and Speech Retrieval
- Junming Yuan, Ying Shi, Lantian Li, Dong Wang, Askar Hamdulla:
Few-Shot Keyword Spotting from Mixed Speech. - Bolaji Yusuf, Jan Honza Cernocký, Murat Saraçlar:
Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units. - Jiajun He, Tomoki Toda:
2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval. - Yuxin Xie, Zhihong Zhu, Xianwei Zhuang, Liming Liang, Zhichang Wang, Yuexian Zou:
GPA: Global and Prototype Alignment for Audio-Text Retrieval. - Ilseok Kim, Ju-Seok Seong, Joon-Hyuk Chang:
Few-Shot Keyword-Incremental Learning with Total Calibration. - Allahsera Tapo, Éric Le Ferrand
, Zoey Liu, Christopher Homan, Emily Prud'hommeaux:
Leveraging Speech Data Diversity to Document Indigenous Heritage and Culture.
Speech Disorders 1
- Payal Mohapatra, Shamika Likhite
, Subrata Biswas, Bashima Islam, Qi Zhu:
Missingness-resilient Video-enhanced Multimodal Disfluency Detection. - Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li:
AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection. - Wazeer Zulfikar, Nishat Protyasha, Camila Canales, Heli Patel, James Williamson, Laura Sarnie, Lisa Nowinski, Nataliya Kosmyna, Paige Townsend, Sophia Yuditskaya, Tanya Talkar, Utkarsh Oggy Sarawgi, Christopher J. McDougle, Thomas F. Quatieri, Pattie Maes, Maria Mody:
Analyzing Speech Motor Movement using Surface Electromyography in Minimally Verbal Adults with Autism Spectrum Disorder. - Cong Zhang, Tong Li, Gayle DeDe, Christos Salis:
Prosody of speech production in latent post-stroke aphasia. - Liangyu Nie, Sudarsana Reddy Kadiri, Ruchit Agrawal:
MMSD-Net: Towards Multi-modal Stuttering Detection. - Dominik Wagner, Sebastian P. Bayerl, Ilja Baumann, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet:
Large Language Models for Dysfluency Detection in Stuttered Speech.
Connecting Speech-science and Speech-technology for Children's Speech (Special Session)
- Carly Demopoulos, Linnea Lampinen, Cristian Preciado, Hardik Kothare, Vikram Ramanarayanan:
Preliminary Investigation of Psychometric Properties of a Novel Multimodal Dialog Based Affect Production Task in Children and Adolescents with Autism. - Delphine Charuau, Andrea Briglia, Erika Godde, Gérard Bailly:
Training speech-breathing coordination in computer-assisted reading. - Prad Kadambi, Tristan J. Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine C. Hustad
, Visar Berisha:
How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech? - Nina R. Benway, Jonathan L. Preston, Carol Y. Espy-Wilson:
Examining Vocal Tract Coordination in Childhood Apraxia of Speech with Acoustic-to-Articulatory Speech Inversion Feature Sets. - Vrunda N. Sukhadia, Shammur Absar Chowdhury:
Children's Speech Recognition through Discrete Token Enhancement. - Yujia Wang, Hexin Liu, Leibny Paola Garcia:
Bridging Child-Centered Speech Language Identification and Language Diarization via Phonetics. - Lingyun Gao, Cristian Tejedor García
, Helmer Strik, Catia Cucchiarini:
Reading Miscue Detection in Primary School through Automatic Speech Recognition. - Ilja Baumann, Nicole Unger, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet:
Automatic Evaluation of a Sentence Memory Test for Preschool Children. - Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios:
Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis. - Lucas Block Medin, Thomas Pellegrini, Lucile Gelin:
Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning. - Ruchao Fan, Natarajan Balaji Shankar, Abeer Alwan:
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models. - Thomas Rolland, Alberto Abad:
Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children's Automatic Speech Recognition Adaptation. - Yuanyuan Zhang
, Zhengjun Yue, Tanvina Patel, Odette Scharenborg
:
Improving child speech recognition with augmented child-like speech. - Thomas Graave, Zhengyang Li, Timo Lohrenz, Tim Fingscheidt:
Mixed Children/Adult/Childrenized Fine-Tuning for Children's ASR: How to Reduce Age Mismatch and Speaking Style Mismatch. - Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan:
Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions.
Show and Tell 4
- Meenakshi Sirigiraju, Arjun Rajasekar, Abhishikth Meejuri, Chiranjeevi Yarra:
IIITH Ucchar e-Sudharak: an automatic English pronunciation corrector for school-going children with a teacher in the loop. - Boon Peng Yap, Kok Liang Tan, Zhenghao Li, Rong Tong:
Speech enabled visual acuity test. - Mayuko Aiba, Daisuke Saito, Nobuaki Minematsu:
A ChatGPT-based oral Q&A practice system for first-time student participants in international conferences. - Karthik Venkat Sridaran, Raja Praveen, Reuben T. Varghese, Ajish K. Abraham, Shankar R, Winnie Rachel Cherian:
Visual scene display application for augmentative and alternative communication. - Ikuyo Masuda-Katsuse, Ayako Shirose:
CALL system using pitch-accent feature representations reflecting listeners' subjective adequacy. - Jonathan L. Preston, Nina R. Benway, Nathan R. Prestopnik, Nathan Preston:
The speech motor chaining web app for speech motor learning. - Charlotte Yoder, Karrie Karahalios, Mark Hasegawa-Johnson, Shreyansh Agrawal:
Visualization for improving foreign language pronunciation. - Nhan Phan, Anna von Zansen, Maria Kautonen, Tamás Grósz, Mikko Kurimo:
CaptainA self-study mobile app for practising speaking: task completion assessment and feedback with generative AI.

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.
