


default search action
26th Interspeech 2025: Rotterdam, The Netherlands
- Odette Scharenborg, Catharine Oertel, Khiet Truong:

26th Annual Conference of the International Speech Communication Association, Interspeech 2025, Rotterdam, The Netherlands, 17-21 August 2025. ISCA 2025
Keynote 1 - Roger Moore: From Talking and Listening Devices to Intelligent Communicative Machines
- Roger K. Moore:

From Talking and Listening Devices to Intelligent Communicative Machines.
Spoken Machine Translation 1
- Luca Ducceschi, Greta H. Franzini:

Speech transcription from South Tyrolean Dialect to Standard German with Whisper. - Aswin Shanmugam Subramanian, Harveen Singh Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li:

Length Aware Speech Translation for Video Dubbing. - Vishal Kumar, Vinayak Abrol:

ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space. - Jiale Ou, Hongying Zan:

CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation. - Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi:

End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model. - Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen:

Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data. - Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando:

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios. - Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe:

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs. - Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan Devarkonda, Santosh Kesiraju, Anil Kumar Vuppala:

End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data. - Shaowen Wang, Xinyuan Chen, Yao Xu:

Self-Improvement for Audio Large Language Model using Unlabeled Speech.
Real-time Speech Enhancement
- Yan Ru Pei, Ritik Shrivastava, Sidharth:

Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio. - Shenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li:

A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement. - Teng Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang:

Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations. - Yonghun Song, Yeeun Kim, Yoonyoung Chung:

Lightweight Speech Enhancement Model Based on Harmonic Attention and Phase Estimation with Skin-Attachable Accelerometer. - Yi Gao, Hangting Chen, Siyu Zhang, Qingshan Yang, Jingcong Chen:

TSDT-Net: Ultra-Low-Complexity Two-Stage Model Combining Dual-Path-Transformer and Transform-Average-Concatenate Network for Speech Enhancement. - Chidambar B, Hanumanth Rao Naidu:

Structured Codebook Based Hierarchical Framework for DNN for Computationally Efficient Speech Enhancement.
Multilinguality, Cross-linguistic Studies, L2 Speech
- Qian Zhou, Mathilde Hutin:

Evaluation of Three Automatic Alignment Tools for the Processing of Non-native French. - Hongchen Wu, Yixin Gu:

CrossPhon: An Auto Phone Mapping Tool to Streamline Cross-language Modeling for Phone Alignment of Low-resource Languages. - Ryo Magoshi, Shinsuke Sakai, Jaeyoung Lee, Tatsuya Kawahara:

Multi-lingual and Zero-Shot Speech Recognition by Incorporating Classification of Language-Independent Articulatory Features. - Haley Hsu, Dani Byrd, Khalil Iskarous, Louis Goldstein:

Instantaneous changes in acoustic signals reflect syllable progression and cross-linguistic syllable variation. - Linda Bakkouche, Brechtje Post:

Influence of Proficiency and L2 Experience on Dynamic Spectral Cue Utilization in L2 Vowel Perception and Production. - Kakeru Yazawa, Takayuki Konishi:

A Bayesian Approach to L2 Fluency Ratings by Native and Nonnative Listeners. - Silke Hamann, Andrea Alicehajic:

Are loan sequences different from foreign sequences? A perception study with Japanese listeners on coronal obstruent - high front vowel sequences. - Le Xuan Chan, Annika Heuser:

Relative cue weighting in multilingual stop voicing production. - Hannah White

, Joshua Penney
, Felicity Cox
:
Variability in Intervocalic /t/ and Community Diversity in Australian English.
Speech Emotion Recognition 1
- Pravin Mote, Donita Robinson, Elizabeth Richerson, Carlos Busso:

Vector Quantized Cross-lingual Unsupervised Domain Adaptation for Speech Emotion Recognition. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition. - Shi-Xin Fang, Liang-Yeh Shen, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee:

Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning. - Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller:

Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation. - Mehedi Hasan Bijoy

, Dejan Porjazovski, Tamás Grósz, Mikko Kurimo:
Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition. - Ziwei Gong, Pengyuan Shi, Kaan Donbekci, Lin Ai, Run Chen, David Sasu, Zehui Wu, Julia Hirschberg:

Learning More with Less: Self-Supervised Approaches forLow-Resource Speech Emotion Recognition.
Multimodal Resources
- Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu:

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model. - Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu:

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning. - Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong:

ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition. - Yihan Wu, Yichen Lu, Yijing Chen, Jiaqi Song, William Chen, Ruihua Song, Shinji Watanabe:

GALAXY: A Large-Scale Open-Domain Dataset for Multimodal Learning. - Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng:

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems. - Sho Inoue, Shuai Wang, Haizhou Li:

PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs. - Boya Dong, Wentao Lei, Li Liu:

FFD: Fine-Finger Diffusion Model for Music to Fine-grained Finger Dance Generation. - Manjie Xu, Chenxing Li, Yong Ren, Xinyi Tu, Ruibo Fu, Wei Liang, Dong Yu:

Towards Diverse and Efficient Audio Captioning via Diffusion Models. - Amit Sofer, Yoav Goldman, Shlomo E. Chazan:

Pull It Together: Reducing the Modality Gap in Contrastive Learning.
Interpretability in Audio and Speech Technology
- Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D. Plumbley:

EnvSDD: Benchmarking Environmental Sound Deepfake Detection. - Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli:

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution. - Cecilia Bolaños, Leonardo Pepino, Martín Meza, Luciana Ferrer:

Benchmarking Time-localized Explanations for Audio Classification Models. - Andrew Chang, Yike Li, Iran R. Roman, David Poeppel:

Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds. - Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu:

Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: an Analytical Study Towards Accent-robust ASR Only with Native Speech Data. - Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, Shota Horiguchi:

Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains. - Yaroslav Getman, Tamás Grósz, Tommi Lehtonen, Mikko Kurimo:

Is your model big enough? Training and interpreting large-scale monolingual speech foundation models. - Andreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou:

Semantic-Aware Interpretable Multimodal Music Auto-Tagging. - Asim Ersoy, Basel Ahmad Mousi, Shammur Absar Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani:

From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models. - Yen Meng, Sharon Goldwater, Hao Tang:

Effective Context in Neural Speech Models. - Martijn Bentum, Louis ten Bosch, Tomas O. Lentz

:
Word stress in self-supervised speech models: A cross-linguistic comparison. - Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem H. Zuidema, Martijn Bentum:

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. - Robin Huo, Ewan Dunbar:

Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0. - Gaofei Shen, Hosein Mohebbi, Arianna Bisazza, Afra Alishahi, Grzegorz Chrupala:

On the reliability of feature attribution methods for speech classification. - Emma Cathrine Liisborg Leschly, Oliver Roesler, Michael Neumann, Jackson Liscombe, Abhishek Hosamath, Lakshmi Arbatti, Line H. Clemmensen

, Melanie Ganz, Vikram Ramanarayanan:
An Exploration of Interpretable Deep Learning Models for the Assessment of Mild Cognitive Impairment.
Summarization
- Steffen Freisinger, Philipp Seeberger, Thomas Ranzenberger, Tobias Bocklet, Korbinian Riedhammer:

Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation. - Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Ryo Fukuda, William Chen, Shinji Watanabe:

Pick and Summarize: Integrating Extractive and Abstractive Speech Summarization. - Othman Istaiteh, Salima Mdhaffar, Yannick Estève:

Beyond Similarity Scoring: Detecting Entailment and Contradiction in Multilingual and Multimodal Contexts. - Ziwei Gong, Lin Ai, Harsh Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg:

Comparison-Based Automatic Evaluation for Meeting Summarization.
Show and Tell 1: ASR / Tools
- Alessandro De Luca, Srikanth Madikeri, Volker Dellwo:

Voxplorer: Voice data exploration and projection in an interactive dashboard. - Anand Kumar Rai, Satyam Rahangdale, Utkarsh Anand, Animesh Mukherjee:

ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems. - Christoph Draxler, Julian Pömp, Henk van den Heuvel, Fabio Ardolino, Arjan van Hessen:

Transcribing Oral History Recordings Using the Transcription Portal. - Teodora Vukovic, Jérémy Zehr, Jonathan Schaber, Igor Mustac, Nikolina Rajovic, Daniel McDonald, Johannes Graën, Noah Bubenhofer:

LiRI Corpus Platform: Demonstration of a Web-Based Infrastructure for Multimodal Corpus Analysis. - Zirong Li, Hongchen Wu, Yixin Gu, Yao Du, Yang Yue:

Speech Annotation for A: Accuracy, Access, and Application. - Arturs Znotins, Didzis Gosko, Normunds Gruzitis:

LATE: Open Source Toolkit for Latvian and Latgalian Speech Transcription. - Kumarmanas Nethil, Vaibhav Mishra, Kriti Anandan, Kavya Manohar:

Scalable Offline ASR for Command-Style Dictation in Courtrooms.
Models of Speech Production
- Yijing Lu, Khalil Iskarous, Louis Goldstein:

Towards a dynamical model of transitions between fluent and stuttered speech. - Juliette Dindart

, Agnès Rouxel, Crystal Lin, Trung Kien Bui, Muriel Lefort, Claire Pillot-Loiseau, Christophe Trésallet, Frédérique Frouin:
Study of vocal fold vibration using M-mode ultrasound: a proof of concept. - Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica González Machorro, Yoonjeong Lee, Björn W. Schuller, Louis Goldstein, Shrikanth Narayanan:

Articulatory Feature Prediction from Surface EMG during Speech Production. - Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Y. Espy-Wilson:

Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality.
Speech and Grammar/Articulatory Analyses
- Anna Stein, Kevin Tang:

Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning. - Zofia Malisz, Jan Foremski, Malgorzata Kul:

Contextual predictability effects on acoustic distinctiveness in read Polish speech. - Ivan Yuen

, Katherine Demuth
, Stefanie Shattuck-Hufnagel:
How do both phonological and syntactic complexity influence speech planning? - Anqi Xu, Yu-Yin Hsu:

When focus shapes the flow: prosodic restructuring in Mandarin complex nominals. - Sofoklis Kakouros

:
Investigating the Impact of Word Informativeness on Speech Emotion Recognition. - Bowei Shao, Philipp Buech, Anne Hermes, Maria Giavazzi:

Lexical stress affects lenition: The case of Italian palato-alveolar affricates. - Peter Birkholz, Tianyi Zhang:

Evaluation of a model for sound radiation from the vocal tract wall. - Satu Hopponen, Tomi Kinnunen, Alexandre Nikolaev

, Rosa González Hautamäki, Lauri Tavi, Einar Meister:
FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents.
Speaking Styles, Register and Conversational Speech
- Yunzhuo Xiang, Jingyi Sun:

Modeling Formant Dynamics in Mandarin /ai/: Effects of Speech Style and Speech Rate. - Livia Qian, Carol Figueroa, Gabriel Skantze:

Representation of Perceived Prosodic Similarity of Conversational Feedback. - Oana Niculescu, Monica Vasileanu:

Prolongation in Romanian. - Kübra Bodur

, Corinne Fredouille, Christine Meunier:
Speech Reduction in French: The Relationship Between Vowel Space and Articulation Dynamics. - Andre Batchelder-Schwab, Vasileios Michos, Jonathan Barnes:

Stress in Spoken and Whistled Greek.
Emotional Distress in Speech
- Justyna Krzywdziak, Bartlomiej Eljasiak, Joanna Stepien, Michal Swiatek, Agnieszka Pruszek:

Leveraging Text and Speech Processing for Suicide Risk Classification in Chinese Adolescents. - Wen Wu

, Ziyun Cui, Chang Lei
, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, Chao Zhang:
The 1st SpeechWellness Challenge: Detecting Suicide Risk Among Adolescents. - Yifan Gao, Jiao Fu, Long Guo, Hong Liu:

Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection. - Xi Chen, Renzhe Yu, Yanshen Tan, Yiyi Li, Quan Qian, Ying Lin:

Predicting Adolescent Suicidal Risk from Multi-task-based Speech: An Ensemble Learning Approach. - Filomene Roquefort, Alexandre Ducorroy, Rachid Riad:

In-context learning capabilities of Large Language Models to detect suicide risk among adolescents from speech transcripts. - June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang:

Language-Agnostic Suicidal Risk Detection Using Large Language Models. - Vincent P. Martin, Charles Brazier, Maxime Amblard, Michel Musiol, Jean-Luc Rouas:

Network of acoustic characteristics for the automatic detection of suicide risk from speech. Contribution to the 2025 SpeechWellness challenge by the Semawave team.
Prosody in Speech Synthesis
- Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan:

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs. - Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi:

Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models. - Paul Mayer, Florian Lux, Alejandro Pérez González de Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu:

Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis. - Tadashi Ogura, Takuma Okamoto, Yamato Ohtani, Erica Cooper, Tomoki Toda, Hisashi Kawai:

GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens. - Anindita Mondal, Monica Surtani, Anil Kumar Vuppala, Parameswari Krishnamurthy, Chiranjeevi Yarra:

ExagTTS: An Approach Towards Controllable Word Stress Incorporated TTS for Exaggerated Synthesized Speech Aiding Second Language Learners. - Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim:

Synthetic Data Generation for Phrase Break Prediction with Large Language Model.
Depression Detection and Assessment 1
- Lauren L. White, Ewan Carr, Judith Dineley

, Catarina Botelho, Pauline Conde, Faith Matcham, Carolin Oetzmann
, Amos Folarin, George Fairs, Agnes Norbury, Stefano Goria, Srinivasan Vairavan, Til Wykes, Richard J. B. Dobson
, Vaibhav A. Narayan, Matthew Hotopf, Alberto Abad, Isabel Trancoso, Nicholas Cummins:
Speech Reference Intervals: An Assessment of Feasibility in Depression Symptom Severity Prediction. - Wenrui Liang, Rong Zhang, Xuezhen Zhang, Ying Ma, Wei-Qiang Zhang:

DepressGEN: Synthetic Data Generation Framework for Depression Detection. - Yuqiu Zhou, Yongjie Zhou, Yudong Yang, Yang Liu, Jun Huang, Shuzhi Zhao, Rongfeng Su, Lan Wang, Nan Yan:

Emotion-Guided Graph Attention Networks for Speech-Based Depression Detection under Emotion-Inducting Tasks. - Patawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich:

Explainable Depression Detection using Masked Hard Instance Mining. - Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore

:
Test-Time Training for Speech-based Depression Detection. - Lishi Zuo, Man-Wai Mak:

Leveraging Ordinal Information for Speech-based Depression Classification. - Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz:

Zero-Shot Speech-Based Depression and Anxiety Assessment with LLMs. - Robert Lewis, Szymon Fedor, Nelson Hidalgo Julia, Joshua Curtiss, Jiyeon Kim, Noah Jones, David Mischoulon, Thomas F. Quatieri, Nicholas Cummins, Paola Pedrelli, Rosalind W. Picard:

Towards the Objective Characterisation of Major Depressive Disorder Using Speech Data from a 12-week Observational Study with Daily Measurements. - Sophie Young

, Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Markus Reuber, Heidi Christensen
:
Can Speech Accurately Detect Depression in Patients With Comorbid Dementia? An Approach for Mitigating Confounding Effects of Depression and Dementia.
Speech Analysis, Detection and Classification 1
- Shaojie Li, Qintuya Si, De Hu:

Temporal Convolutional Network with Smoothed and Weighted Losses for Distant Voice Activity and Overlapped Speech Detection. - Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik:

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion. - Miseul Kim, Seyun Um, Hyeonjin Cha, Hong-Goo Kang:

SpeechMLC: Speech Multi-label Classification. - Dohyun Kim, Jiwook Hwang:

Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced Alignment. - Anna Seo Gyeong Choi, Alexander Richardson, Ryan Partlan, Sunny X. Tang, Sunghye Cho:

Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis. - Tahiya Chowdhury, Verónica Romero:

Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling.
Speech-based Cognitive Assessment 1
- Vi Jun Sean Yong, Serkan Kumyol, Pau Le Lisa Low, Winnie Suk Wai Leung, Tristan Braud:

HK-GenSpeech: A Generative AI Scene Creation Framework for Speech Based Cognitive Assessment. - Parismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md. Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna:

Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and Assessment. - Yin-Long Liu, Yuanchao Li, Rui Feng, Liu He, Jia-Xin Chen, Yi-Ming Wang, Yu-Ang Chen, Yan-Han Peng, Jia-Hong Yuan, Zhen-Hua Ling:

Leveraging Cascaded Binary Classification and Multimodal Fusion for Dementia Detection through Spontaneous Speech. - Kaichen Jia, Jinpeng Li, Ke Li, Wei-Qiang Zhang:

Whisper-Based Multilingual Alzheimer's Disease Detection and Improvements for Low-Resource Language. - Qi Sun, Ziyue Qiu, Yu Pu, Jinpeng Li, Xuchu Chen, Wei-Qiang Zhang:

PPGs-BERT: Leveraging Phoneme Sequence and BERT for Alzheimer's Disease Detection from Spontaneous Speech.
Large Language Models in Speech Recognition
- Te Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou:

LLM-based phoneme-to-grapheme for phoneme-based speech recognition. - Jie Zhengjie, Gaofeng Cheng:

Pinyin-Guided Chinese Speech Recognition with Large Language Model. - Hang Su, Yuxiang Kong, Lichun Fan, Jian Luan:

Text-Enhanced Audio Encoder for Large Language Model based Speech Recognition via Cross-Modality Pre-training with Unpaired Audio-Text Data. - Jinda Zhang, Aanchan Mohan

:
Towards atypical speech transcription using LLM-based ASR. - Jeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke:

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM. - Tianyi Xu, Hongjie Chen, Qing Wang, Hang Lv, Jian Kang, Jie Li, Zhennan Lin, Yongxiang Li, Lei Xie:

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis.
Speech Coding and Echo Cancellation
- Shanhui Gan, Zijian Liang, Kai Niu, Ping Zhang:

Synonymity-Based Semantic Coding for Efficient Speech Compression. - Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang:

Towards an Ultra-Low-Delay Neural Audio Coding with Computational Efficiency. - Zixiang Wan, Guochang Zhang, Yifeng He, Jianqiang Wei:

SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain. - Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li:

TS3-Codec: Transformer-Based Simple Streaming Single Codec. - Yunkee Chae, Kyogu Lee:

Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ. - Bowen Zhang, Ian McLoughlin, Xiaoxiao Miao, A. S. Madhukumar:

LSPnet: an ultra-low bitrate hybrid neural codec. - Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling:

Vision-Integrated High-Quality Neural Speech Coding. - Woongjib Choi, Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang:

Neural Spectral Band Generation for Audio Coding. - Fei Zhao, Xueliang Zhang, Zhong-Qiu Wang:

Multi-Channel Acoustic Echo Cancellation Based on Direction-of-Arrival Estimation.
Decoding Algorithms
- Koji Okabe, Hitoshi Yamamoto:

Simultaneous Masked and Unmasked Decoding with Speculative Decoding Masking for Fast ASR without Accuracy Loss. - Hainan Xu, Vladimir Bataev

, Lilit Grigoryan, Boris Ginsburg:
WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection. - Vladimir Bataev

, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg:
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding. - Lilit Grigoryan, Vladimir Bataev

, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg:
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models. - Ashish R. Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi:

Skip-Salsa: Skip Synchronous Fusion of ASR LLM Decoders. - Kwok Chin Yuen, Jia Qi Yip:

Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition.
Queer and Trans Speech Science and Technology
- Tara McAllister, Collin Eagen, Yi Shan, Peter Traver, Daphna Harel, Tae Hong Park, Vesna D. Novak:

Web-Based Application for Real-Time Biofeedback of Vocal Resonance in Gender-Affirming Voice Training: Design and Usability Evaluation. - Robin Netzorg, Naomi Carvalho, Andrea Guzman, Lydia Wang, Juliana Francis, Klo Vivienne Garoute, Keith Johnson, Gopala Anumanchipalli:

On the Production and Perception of a Single Speaker's Gender. - Alice Ross

, Cliodhna Hughes, Eddie L. Ungless
, Catherine Lai:
Conveying Gender Through Speech: Insights from Trans Men. - Ingo Siegert, Jan Marquenie, Sven Grawunder:

Queer Waves: A German Speech Dataset Capturing Gender and Sexual Diversity from Podcasts and YouTube. - Carlos Hartmann

:
Reddit FlairShare: A Human-Annotated Dataset of Gender-Progressive Online Discourse. - Maxwell Hope, Éva Székely:

Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies.
Tone
- Xiao Dong, Fengming Liu, Chien-Jer Charles Lin, Monica Nesbitt, Shuju Shi:

Neutral Tone Variation in Beijing Mandarin: Is Neutral Tone Toneless? - Siqi Lu, Hui Feng, Ziyu Xiong:

The Role of Syntactic Structures in Shaping Directionality in Trisyllabic Tone Sandhi: Evidence from Tianjin Mandarin. - Zhijie Li, Hui Feng:

Acoustic Representation and Realization of Weak Elements Subcategories: In the Case of Tianjin Mandarin. - Lishan Li, Yaolin Zhou, Xiaoying Xu:

Lexical competition in the process of Cantonese tone merging: Diverse Impact Mechanisms Across Different Individuals and Tone Pairs. - Zhenrui Zhang, Fang Hu:

Tonal Perception in Changde Mandarin. - Changhong Du, Fang Hu:

Tonal Contrasts in the Malipo Variety of the Mienic Language.
Cross-Lingual and Multilingual Processing
- Yanir Marmor, Yair Lifshitz, Yoad Snapir, Kinneret Misgav:

Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. - Ondrej Klejch, William Lamb, Peter Bell:

A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic. - Razhan Hameed, Sina Ahmadi

, Hanah Hadi, Rico Sennrich:
Automatic Speech Recognition for Low-Resourced Middle Eastern Languages. - Zhaolin Li, Jan Niehues

:
In-context Language Learning for Endangered Languages in Speech Recognition. - Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Sai Lodagala, William Chen, Olga Iakovenko

, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe:
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset. - Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie:

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR. - Tuan Nguyen, Huy Dat Tran:

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages. - Leonora Vesterbacka, Faton Rekathati, Robin Kurtz, Justyna Sikora, Agnes Toftgård:

Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition.
Echo Cancellation, Feedback Control, and Near-end Enhancement
- Fei Zhao, Shulin He, Xueliang Zhang:

Room Impulse Response as a Prompt for Acoustic Echo Cancellation. - Yuyang Wang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He:

CAGCRN: Real-Time Speech Enhancement with a Lightweight Model for Joint Acoustic Echo Cancellation and Noise Suppression. - Jinfu Wang, Ziteng Wang, Xin Liu, Yang Liu, Qing Shi, Zhengqiang Luo, Feiran Yang:

Exploiting Echo Path Priors for Enhanced Stereo Acoustic Echo Cancellation. - Quang Minh Dinh, Hoda Rezaee Kaviani, Mehrdad Hosseinzadeh, Yuanhao Yu:

Extended Loss: Incorporating Long Context into Training Models when using Short Audio Frames. - Filippo Villani

, Wai-Yip Chan, Zheng-Hua Tan
, Jan Østergaard
, Jesper Jensen:
Analysis and Extension of a Near-End Listening Enhancement Method Based on Long-Term Fractile Noise Statistics. - Yuan-Kuei Wu, Juan Azcarreta Ortiz, Kashyap Patel, Buye Xu, Jung-Suk Lee, Sanha Lee, Ashutosh Pandey:

A Novel Deep Learning Framework for Efficient Multichannel Acoustic Feedback Control. - Bunlong Lay, Rostilav Makarov, Timo Gerkmann:

Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency.
Pathological Speech Analysis 1
- Xiaokang Liu, Xingfeng Li, Yudong Yang, Lan Wang, Nan Yan:

Addressing Task Conflicts in Stuttering Detection via MMoE-Based Multi-Task Learning. - Y. S. Upendra Vishwanath, Tanuka Bhattacharjee, Deekshitha G, Sathvik Udupa, Chowdam Venkata Thirumala Kumar, Madassu Keerthipriya, Darshan Chikktimmegowda, Dipti Baskar, Yamini Belur, Seena Vengalil, Atchayaram Nalini, Prasanta Kumar Ghosh:

Comparison of Acoustic and Textual Features for Dysarthria Severity Classification in Amyotrophic Lateral Sclerosis. - Suhita Ghosh, Mélanie Jouaiti, Jan-Ole Perschewski

, Sebastian Stober:
StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation. - Giulia Sanguedolce, Jón Guðnason, Dragos-Cristian Gruia, Emilie D'Olne, Fatemeh Geranmayeh, Patrick A. Naylor

:
Physiologically-Informed Feature Analysis of Acquired Speech Disorders for Stroke Assessment.
Hearing Disorders
- Gloria Araiza-Illan

, Luke Meyer
, Bert Maat
, Deniz Baskent
:
Robot-assisted Recognition of Vocal Emotions in Pseudospeech for Cochlear Implanted Adolescents. - Ahsan J. Cheema, Sunil Puria:

Using Neurogram Similarity Index Measure (NSIM) to Model Hearing Loss and Cochlear Neural Degeneration. - Longbin Jin, Donghun Min, Jung Eun Shin, Eun Yi Kim:

Contrastive Learning-based Syllable-Level Mispronunciation Detection and Diagnosis for Speech Audiometry. - Hsin-Tien Chiang, John H. L. Hansen:

A Deformable Convolution GAN Approach for Speech Dereverberation in Cochlear Implant Users. - Fengyuan Hao, Brian C. J. Moore, Huiyong Zhang, Xiaodong Li, Chengshi Zheng:

L3C-DeepMFC: Low-Latency Low-Complexity Deep Marginal Feedback Cancellation with Closed-Loop Fine Tuning for Hearing Aids. - Man Wang, Yixin Ding, Niels O. Schiller

:
Semantic Processing During Spoken Word Production by Children with Cochlear Implants. - Yuting Ding, Xuefei Wang, Fei Chen:

Linguistic Masking and Its Release in Simulated Electric-acoustic Hearing.
Interspeech 2025 URGENT Challenge
- Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian:

Lessons Learned from the URGENT 2024 Speech Enhancement Challenge. - Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach

, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe:
Interspeech 2025 URGENT Speech Enhancement Challenge. - Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu:

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network. - Xiaohuai Le, Zhuangqi Chen, Siyu Sun, Xianjun Xia, Chuanzeng Huang:

Multistage Universal Speech Enhancement System for URGENT Challenge. - Zhihang Sun, Andong Li, Tong Lei, Rilin Chen, Meng Yu, Chengshi Zheng, Yi Zhou, Dong Yu:

Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025. - Sanberk Serbest, Tijana Stojkovic, Milos Cernak, Andrew Harper:

DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration. - Nabarun Goswami

, Tatsuya Harada:
FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge. - Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukic, Szu-Wei Fu, Yu Tsao:

Universal Speech Enhancement with Regression and Generative Mamba.
Spoken Machine Translation 2
- Jean-Luc Rouas, Charles Brazier, Leila Ben Letaifa, Rafael Medina, Pedro Palacios, David Atienza, Giovanni Ansaloni:

Structured pruning for efficient systolic array accelerated cascade Speech-to-Text Translation. - Mohammad MohammadAmini, Aghilas Sini, Marie Tahon, Antoine Laurent:

Scaling pseudo-labeling data for end-to-end low-resource speech translation (the case of Kurdish language). - Kirandevraj R, Vinod K. Kurmi, Vinay P. Namboodiri, C. V. Jawahar:

Multilingual Query-by-Example KWS for Indian Languages using Transliteration. - Chenyang Le, Yinfeng Xia, Huiyan Li, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian:

Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation. - Verena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, Barbara Plank:

A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation. - Tahir Javed, Kaushal Santosh Bhogale, Mitesh M. Khapra:

NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data.
Spatial Audio and Acoustics 1
- Sheng Lyu, Yuemin Yu, Chenshu Wu:

Temporal Modeling of Room Impulse Response Generation via Multi-Scale Autoregressive Learning. - Yunqi C. Zhang, Dhruv Jagmohan, Hong Kit Li, C. T. Justine Hui, Yusuke Hioka:

Effect of Noise Floor in Room Impulse Response on Speech Perception Under Spherical Harmonics-based Spatial Sound Reproduction. - Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux:

Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses. - Linya Fu, Yu Liu, Zhijie Liu, Zedong Yang, Zhong-Qiu Wang, Youfu Li, He Kong:

AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers. - Tuochao Chen, D. Shin, Hakan Erdogan, Sinan Hersek:

SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction. - Yang Xiao, Rohan Kumar Das:

TF-Mamba: A Time-Frequency Network for Sound Source Localization.
Articulatory and Vocal Tract Modelling
- Frédéric Berthommier:

Articulatory modeling of the S-shaped F2 trajectories observed in Öhman's spectrographic analysis of VCV syllables. - Allan Vurma, Einar Meister, Lya Meister, Jaan Ross, Marju Raju, Veeda Kala, Tuuri Dede:

The Role of Voiced Consonant Duration in Sung Vowel-Consonant and Consonant-Vowel Recognition. - Riccarda Funk, Melanie Weirich, Adrian P. Simpson:

How sibilant spectra shape gender perception in prepubertal children: A voice morphing study. - Tharinda Piyadasa, Joan Glaunès, Amelia Gully, Michael Proctor

, Kirrie J. Ballard, Tünde Szalay
, Naeim Sanaei, Sheryl Foster
, David Waddington, Craig T. Jin:
Constrained LDDMM for Dynamic Vocal Tract Morphing: Integrating Volumetric and Real-Time MRI. - Rongshuai Wu, Debasish Ray Mohapatra, Sidney Fels:

2D Immersed Boundary Method in Vocal Tract Acoustics: An Eulerian-Lagrangian Model for Simulation of Diphthongs. - Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie:

Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI Data. - Yubin Zhang, Prakash Kumar, Ye Tian, Ziwei Zhao, Xuan Shi, Kevin Huang, Kevin Lee, Haley Hsu, Shrikanth Narayanan, Krishna S. Nayak, Louis Goldstein:

Co-registration of real-time MRI and respiration for speech research.
Acoustic Assessment of Respiratory Health
- Loes van Bemmel

, Lauren G. Reinders
, Folkert Brijker, Bas Holverda, Frits M. E. Franssen, Hanneke van Helvoort, Visara Urovi
, Marieke Spreeuwenberg, Sami O. Simons:
SPEAKtoCOPD: a flashmob study to collect COPD speech. - Yuyang Yan

, Sami O. Simons, Visara Urovi
:
Developing a LeFF Transformer Model for Exacerbated Speech Detection in COPD and Asthma. - Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada:

Towards Pre-training an Effective Respiratory Audio Foundation Model. - Lauren G. Reinders

, Loes van Bemmel
, Alexander Mackay, David Nobbs, Frits M. E. Franssen, Hester Gietema, Simona Schäfer, Sami O. Simons:
Effect of physical exercise on voice in people living with COPD. - Gaoyang Dong, Zhicheng Zhang, Ping Sun, Minghui Zhang:

Adaptive Differential Denoising for Respiratory Sounds Classification. - Peidong Wei, Shiyu Miao, Lin Li:

Disentangling Dual-Encoder Masked Autoencoder for Respiratory Sound Classification. - Seung Gyu Jeong, Seong Eun Kim:

Patient-Aware Feature Alignment for Robust Lung Sound Classification: Cohesion-Separation and Global Alignment Losses. - Miika Toikkanen, June-Woo Kim:

Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from Ensembles.
Advances in Modelling and Imaging
- Mélen Guillaume, Anahita Basirat, Julien Diard:

Theoretical proposal for a unified Bayesian model of adaptation in non-interactive and interactive speech production. - Juraj Simko, Benjamin Elie, Alice Turk:

Self-supervised Optimality-Guided Learning of Speech Articulation. - Zhe-chen Guo, Bharath Chandrasekaran:

Extended High-frequency Cues to Phoneme Recognition: Insights from ASR. - Jia-Xin Chen, Yi-Ming Wang, Ziyu Zhang, Jiayang Han, Yin-Long Liu, Rui Feng, Xiuyuan Liang, Zhen-Hua Ling, Jia-Hong Yuan:

Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception. - Tong Zhu, Xiaoke Yang, Jian Zhou, Lu Li, Zhao Lv, Cunhang Fan:

SSF-DST: A Spectro-Spatial Features Enhanced Deep Spatiotemporal Network for EEG-Based Auditory Attention Detection. - Yujie Yan, Xiran Xu, Haolin Zhu, Songyi Li, Bo Wang, Xihong Wu, Jing Chen:

Overestimated performance of auditory attention decoding caused by experimental design in EEG recordings. - Chetan Sharma, Vaishnavi Chandwanshi, Shreya Shrikant Karkun, Aditya Anand Gupta, Prasanta Kumar Ghosh:

A real-time MRI study on asymmetry in velum dynamics during VCV production with nasal sounds. - Carey Smith, Hu Cheng, Pertti Palo, Daniel Aalto, Steven M. Lulich:

Exploratory Analysis of Brainstem fMRI Data During Sustained Phonation.
Conversation, Communication and Interaction 1
- Seongsil Heo, Christi Miller, Calvin Murdock, Michael J. Proulx:

Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations. - Sam O'Connor Russell, Naomi Harte:

Visual Cues Support Robust Turn-taking Prediction in Noise. - Yoshinori Fukunaga, Ryota Nishimura, Kengo Ohta, Norihide Kitaoka:

Backchannel prediction for natural spoken dialog systems using general speaker and listener information. - Muhammad Yeza Baihaqi, Angel F. Garcia Contreras, Seiya Kawano, Koichiro Yoshino:

Rapport-Building Dialogue Strategies for Deeper Connection: Integrating Proactive Behavior, Personalization, and Aizuchi Backchannels. - Lena-Marie Huttner, Jeppe H. Christensen, Gitte Keidser, Tobias May, Torsten Dau, Sergi Rotger-Griful:

Does effortful speech production indicate communication difficulty caused by noise and hearing aid support? - Julio Cesar Cavalcanti

, Gabriel Skantze:
"Dyadosyncrasy", Idiosyncrasy and Demographic Factors in Turn-Taking.
Robust Speaker Verification
- Théo Lepage

, Réda Dehak:
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification. - Minu Kim, Kangwook Jang, Hoirin Kim:

ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction. - Zhe Li, Man-Wai Mak, Jen-Tzung Chien, Mert Pilanci, Zezhong Jin, Helen Meng:

Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification. - Alexandre Ferro Filho, Diogo Fernandes Costa Silva, Pedro Elias Engelberg Silva Borges, Arlindo Rodrigues Galvão Filho:

Evaluating Deep Speaker Embedding Robustness to Domain, Sampling Rate, and Codec Variations. - Zhiyong Chen, Shuhang Wu, Xinnuo Li, Zhiqi Ai

, Shugong Xu:
Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech Synthesis. - Jin Li, Man-Wai Mak, Johan Rohdin, Kong Aik Lee, Hynek Hermansky:

Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing.
Multilingual ASR
- Masato Mimura, Jaeyoung Lee, Tatsuya Kawahara:

Switch Conformer with Universal Phonetic Experts for Multilingual ASR. - Hongli Yang, Sheng Li

, Hao Huang, Ayiduosi Tuohan, Yizhou Peng:
Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR. - Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian:

Efficient Multilingual ASR Finetuning via LoRA Language Experts. - Raphaël Bagat, Irina Illina, Emmanuel Vincent:

Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition. - Zheng Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard:

Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR. - Pouya Mehralian, Hugo Van hamme

:
Leveraging Geographic Metadata for Dialect-Aware Speech Recognition. - Ömer Tarik Özyilmaz

, Matt Coler
, Matias Valdenegro-Toro
:
Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning. - Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen:

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining. - Yingzhi Wang, Anas Alhmoud, Muhammad Alqurishi:

Open Universal Arabic ASR Leaderboard.
Multi-channel Speech Enhancement
- Yujie Yang, Bing Yang, Xiaofei Li:

Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement. - Zheng Wang, Xiaobin Rong, Yu Sun, Tianchi Sun, Zhibin Lin, Jing Lu:

A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions. - Pengjie Shen, Xueliang Zhang, Zhong-Qiu Wang:

ARiSE: Auto-Regressive Multi-Channel Speech Enhancement. - Lu Han, Junqi Zhao, Renhua Peng:

WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Peservation. - Nurali Alip, Tianrui Wang, Rui Cao, Meng Ge, Jingru Lin, Longbiao Wang, Jianwu Dang:

A Three-Stage Beamforming with Harmonic Guidance for Multi-Channel Speech Enhancement. - Chengyuan Qin, Wenmeng Xiong, Jing Zhou, Maoshen Jia, Changchun Bao:

Speech Enhancement with Dual-path Multi-Channel Linear Prediction Filter and Multi-norm Beamforming.
Self-supervised Learning
- Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu:

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR. - Nik Vaessen, Roeland Ordelman, David A. van Leeuwen:

Self-supervised learning of speech representations with Dutch archival data. - Aleksandr Kutsakov, Alexandr Maximenko, Georgii Gospodinov, Pavel Bogomolov, Fyodor Minkin:

GigaAM: Efficient Self-Supervised Learner for Speech Recognition. - Hyung-Gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watanabe, Ahmed Hussen Abdelaziz:

DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective. - Kentaro Onda, Yosuke Kashiwagi, Emiru Tsunoo, Hayato Futami, Shinji Watanabe:

Differentiable K-means for Fully-optimized Discrete Token-based ASR. - Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève:

Towards Early Prediction of Self-Supervised Speech Model Performance.
Singing Voice and Audio Synthesis
- Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee:

VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion. - Chenyu Yang, Hangting Chen, Shuai Wang, Haina Zhu, Haizhou Li:

TVC-MusicGen: Time-Varying Structure Control for Background Music Generation via Self-Supervised Training. - Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra:

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation. - Mingda Liu, Jiatong Shi:

Bridging Speech and Singing: Multi-stage Speech-Prompted Singing Voice Conversion with Speaker Embedding Adaptation. - Yicheng Gu, Chaoren Wang, Zhizheng Wu, Lauri Juvela:

Neurodyne: Neural Pitch Manipulation with Representation Learning and Cycle-Consistency GAN. - Zijing Zhao, Kai Wang, Hao Huang, Ying Hu, Liang He, Jichen Yang:

VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge. - Wei Chen, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu:

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching. - Wangjin Zhou, Tianjiao Du, Chenglin Xu, Sheng Li

, Yi Zhao, Tatsuya Kawahara:
Simple and Effective Content Encoder for Singing Voice Conversion via SSL-Embedding Dimension Reduction. - Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee:

Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control.
Acoustic and Articulatory Cues in Speech Perception
- Wenwei Dong, Alif Silpachai, Catia Cucchiarini, Helmer Strik:

Multitalker Babble in English Vowel Perception Training: A Comparison between Humans and Neural Models. - Etienne Gaudrain, Sarah Verhulst, Deniz Baskent

:
Speech stimulus design to study the neural coding of speech and the impact of cochlear synaptopathy. - Esther Janse, Chen Shen, Martin Cooke:

Prediction of listening effort ratings for habitual and clear-Lombard speech presented in noise. - Shengyue Xiong, Zhe-chen Guo, Bharath Chandrasekaran:

Language and Accent Familiarity Effects on the Use of Acoustic Cues in Talker Identification. - Laura Rachman, Deniz Baskent

:
Characterization of voice cue sensitivity and vocal emotion recognition across the adult lifespan. - Zixia Fan, Ronny Ibrahim, Joshua Penney

, Felicity Cox
:
Creaky Voice Facilitates More Efficient Phonological Processing of Mandarin Tone 3.
Audio Event Detection and Classification
- Tomoya Yoshinaga, Yoshiaki Bando, Keitaro Tanaka, Keisuke Imoto, Masaki Onishi, Shigeo Morishima:

Training Onset-and-Offset-Aware Sound Event Detection on a Heterogeneous Dataset via Probabilistic Sequential Modeling. - Yulu Fang, Mingyue He, Qisheng Xu, Jianqiao Zhao, Cheng Yang, Kele Xu, Yong Dou:

Multi-view Fusion and Parameter Perturbation for Few-Shot Class-Incremental Audio Classification. - Yongjie Si, Yanxiong Li, Jiaxin Tan, Qianhua He, Il-Youp Kwak:

Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier. - Claudia Montero-Ramírez, Alba Martínez-Serrano, Jorge Garcelán-Gómez, Francisco J. Valverde-Albacete, Carmen Peláez-Moreno:

Beyond Conventional Metrics: using Entropic Triangles to Explain Balancing Methods in Acoustic Scene Classification. - Emiliano Acevedo, Martín Rocamora, Magdalena Fuentes:

Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound Classification. - Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park:

Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation.
Inclusivity
- Chris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Toluwase Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal:

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages. - Maliha Jahan, Yinglun Sun, Priyam Mazumdar, Zsuzsanna Fagyal, Thomas Thebaud, Jesús Villalba, Mark Hasegawa-Johnson, Najim Dehak, Laureano Moro-Velázquez:

FaiST: A Benchmark Dataset for Fairness in Speech Technology. - Kemal Altwlkany, Amar Kuric, Emanuel Lacic:

On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs. - José Giraldo, Alex Peiró Lilja, Carme Armentano-Oller, Rodolfo Zevallos, Cristina España-Bonet:

Evaluating Speech Enhancement Performance Across Demographics and Language.
Voice Conversion 1
- Seymanur Akti, Tuan-Nam Nguyen, Alexander Waibel:

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion. - Hitoshi Suda, Shinnosuke Takamichi, Satoru Fukayama:

Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora. - Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah:

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion. - Alexander Lobashev, Assel Yermekova, Maria A. Larchenko:

Training-Free Voice Conversion with Factorized Optimal Transport. - Yihan Liu, Zhengyang Chen, Leying Zhang, Yanmin Qian:

E2E-BPVC: End-to-End Background-Preserving Voice Conversion via In-Context Learning. - Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li:

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion. - Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li:

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization. - Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu:

In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion. - Herman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau:

LinearVC: Linear Transformations of Self-Supervised Features Through the Lens of Voice Conversion. - Desheng Hu, Yang Xiang, Jian Lu, Xinhui Hu, Xinkang Xu:

Speaker Normalization and Content Restoration for Zero-Shot Voice Conversion with Attention-Enhanced Discriminator.
Speech-based Cognitive Assessment 2
- Xiaoquan Ke, Man-Wai Mak, Helen Meng:

Optimizing Pause Context in Fine-Tuning Pre-trained Large Language Models for Dementia Detection. - Emmanuel Akinrintoyo, Nadine Abdelhalim, Nicole Salomons:

WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper. - Catarina Botelho, David Gimeno-Gómez, Francisco Teixeira

, John Mendonça, Patrícia Pereira, Diogo A. P. Nunes, Thomas Rolland, Anna Pompili, Rubén Solera-Ureña, Maria Ponte, David Martins de Matos, Carlos D. Martínez-Hinarejos, Isabel Trancoso, Alberto Abad:
Acoustic and Linguistic Biomarkers for Cognitive Impairment Detection from Speech. - Yao Xiao, Heidi Christensen

, Stefan Goetze:
Alzheimer's Dementia Detection Using Perplexity from Paired Large Language Models. - Mansi, Anastasios Lepipas

, Dominika C. Woszczyk, Yiying Guan, Soteris Demetriou:
Understanding Dementia Speech Alignment with Diffusion-Based Image Generation. - Dominika C. Woszczyk, Ranya Aloufi, Soteris Demetriou:

ClaritySpeech: Dementia Obfuscation in Speech.
Source Separation 1
- Jihyun Kim, Doyeon Kim, Hyewon Han, Jinyoung Lee, Jonguk Yoo, Chang Woo Han, Jeongook Song, Hoon-Young Cho, Hong-Goo Kang:

Quadruple Path Modeling with Latent Feature Transfer for Permutation-free Continuous Speech Separation. - Kangqi Jing, Wenbin Zhang, Yu Gao:

End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios. - Xue Yang, Guiru Shen, Yu Yang:

Speaker Separation for an Unknown Number of Speakers with Encoder-Decoder-Based Contextual Information Module. - Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen:

Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers. - Hadi Alizadeh, Rahil Mahdian Toroghi, Hassan Zareian:

ReSepNet: A Unified-Light Model for Recursive Speech Separation with Unknown Speaker Count. - Tzlil Avidan, Bracha Laufer-Goldshtein:

Deep-Simplex Multichannel Speech Separation. - Haoxu Wang, Yiheng Jiang, Gang Qiao, Pengteng Shi, Biao Tian:

FLASepformer: Efficient Speech Separation with Gated Focused Linear Attention Transformer. - Liang Tao, Maoshen Jia, Yonggang Hu:

Power Spectral Density Estimation for Acoustic Source Separation Using A Spherical Microphone Array. - Yiheng Jiang, Haoxu Wang, Yafeng Chen, Gang Qiao, Biao Tian:

Exploring Efficient Directional and Distance Cues for Regional Speech Separation.
Language and Accent Identification and Speaker Privacy
- Spandan Dey, Hirak Mondal

, Sanjay Kumar Kurmi:
Teacher-Free Knowledge Distillation for Improving Short-Utterance Spoken Language Identification. - Niyati Bafna, Matthew Wiesner:

LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech. - Gowtham Premananth, Vinith Kugathasan, Carol Y. Espy-Wilson:

Analyzing the Impact of Accent on English Speech: Acoustic and Articulatory Perspectives. - Eliathamby Ambikairajah, Jingyao Wu, Ting Dang, Vidhyasaharan Sethu:

A Study of Speech Embedding Similarities Between Australian Aboriginal and High-Resource Languages. - Abderrahim Fathan, Jahangir Alam, Xiaolin Zhu:

An Investigative Study on Recent Sharpness- and Flatness-Based Optimizers for Enhanced Self-Supervised Speaker Verification. - Chenguang Hu, Yaqian Hao, Fulin Zhang, Xiaoxue Luo, Yao Shen, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng:

Privacy-Preserving Speaker Verification via End-to-End Secure Representation Learning. - Elvir Karimov, Alexander Varlamov, Danil Ivanov, Dmitrii Korzh, Oleg Rogov:

Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy. - Ying Meng, Zhihua Fang, Liang He:

Federated Learning with Feature Space Separation for Speaker Recognition.
Source Tracing: The Origins of Synthetic or Manipulated Speech
- Pierre Falez, Tony Marteau, Damien Lolive, Arnaud Delhay:

Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification. - Ajinkya Kulkarni

, Sandipana Dowerah, Tanel Alumäe, Mathew Magimai-Doss:
Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion. - Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du

, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang:
Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy. - Adriana Stan, David Combei, Dan Oneata, Horia Cucu:

TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes. - Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro:

Source Verification for Speech Deepfakes. - Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka:

STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution. - Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Panagakis, Themos Stafylakis:

Synthetic Speech Source Tracing using Metric Learning. - Yang Xiao, Rohan Kumar Das:

Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing. - Thien-Phuc Doan, Kihun Hong, Souhwan Jung:

VIB-based Real Pre-emphasis Audio Deepfake Source Tracing. - Jiankun Zhao, Lingwei Meng, Chengxi Deng, Helen Meng, Xixin Wu:

Defending Unauthorized Voice Cloning with Watermark-Aware Codecs. - Nicholas Klein, Hemlata Tak, Elie Khoury:

Open-Set Source Tracing of Audio Deepfake Systems.
Speaker Diarization 1
- Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Díez, Jan Cernocký, Lukás Burget:

Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization. - Prabhav Singh, Jesús Villalba, Najim Dehak:

Count Your Speakers! Multitask Learning for Multimodal Speaker Diarization. - David Palzer, Matthew Maciejewski, Eric Fosler-Lussier:

End-to-End Diarization utilizing Attractor Deep Clustering. - Berkin Durmus, Blaise Munyampirwa, Eduardo Pacheco, Atila Orhon, Andrey Leonov:

SDBench: A Comprehensive Benchmark Suite for Speaker Diarization. - Fengyun Tan, Tao Wei, Kun Zou, Ning Cheng, Shaojun Wang, Jing Xiao:

Enhancing Serialized Output Training for Multi-Talker ASR with Soft Monotonic Alignment and Utterance-level Timestamp. - Shota Horiguchi, Atsushi Ando, Naohiro Tawara, Marc Delcroix:

Pretraining Multi-Speaker Identification for Neural Speaker Diarization.
Multilingual Speech Synthesis and Special Applications 1
- Ki-Joong Kwon, Jun-Ho So, Sang-Hoon Lee:

Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual Learning. - Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li:

Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data. - Chang Liu, Zhen-Hua Ling, Yu Gu:

LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models. - Fatima Naseem, Maham Sajid, Farah Adeeba, Sahar Rauf, Asad Mustafa, Sarmad Hussain:

Developing High-Quality TTS for Punjabi and Urdu: Benchmarking against MMS Models. - Frederik Rautenberg, Fritz Seebauer

, Jana Wiechmann, Michael Kuhlmann, Petra Wagner
, Reinhold Haeb-Umbach:
Synthesizing Speech with Selected Perceptual Voice Qualities - A Case Study with Creaky Voice. - Christina Tånnander, David House, Jonas Beskow, Jens Edlund:

Intrasentential English in Swedish TTS: perceived English-accentedness.
Characterization and Multimodal Approaches for Speaker Recognition
- Shengyu Peng, Wu Guo, Jie Zhang, Yu Guan, Lipeng Dai, Zuoliang Li:

Parameter-Efficient Fine-tuning with Instance-Aware Prompt and Parallel Adapters for Speaker Verification. - Nathan Griot, Driss Matrouf, Raphaël Blouet, Jean-François Bonastre, Ana Mantecon:

Unified Text and Speaker Verification using SSL model for Text-Dependent Speaker Verification. - Zhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie:

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM. - N. Shashaank, Xiao Quan, Andrew Kaluzny, Leonard Varghese, Marko Stamenovic, Chuan-Che Huang:

Towards Secure User Authentication for Headphones via In-Ear or In-Earcup Microphones. - Gwangyeol Yu

, Junhyeok Lee, Seoryeong Kim, Jimin Lee, Jehyuk Lee:
Mimic Blocker: Self-Supervised Adversarial Training for Voice Conversion Defense with Pretrained Feature Extractors. - Bhasi K. C., Rajeev Rajan:

A Siamese Network-Based Framework for Voice Mimicry Proficiency Assessment Using X-Vector Embeddings. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models. - Rishabh Ranjan, Ayinala Likhith, Mayank Vatsa, Richa Singh:

Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages. - Chikara Maeda, Muhammad Shakeel, Yui Sudo:

Joint Target-Speaker ASR and Activity Detection. - Wooil Kim, Bongsu Jung:

DLF-EEND: Dynamic Layer Fusion for End-to-End Speaker Diarization.
Acoustic Analysis and Bioacoustics
- Noumida A, Rajeev Rajan:

Analysis of Avian Biphonic Vocalization Using Computational Modelling. - Xingyuan Li, Kenny Q. Zhu, Mengyue Wu:

Dog2vec: Self-Supervised Pre-Training for Canine Vocal Representation. - Ezhini Rasendiran R, Chandresh Kumar Maurya:

Improving Bird Classification with Primary Color Additives. - Chenhao Wu, Xiangjun Cai, Haojie Zhang, Tianrui Jia, Yilu Deng, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto, Jiang Liu:

Exploring the Power of Empirical Mode Decomposition for Sensing the Sound of Silence: A Pilot Study on Mice Autism Detection via Ultrasonic Vocalisation. - Yuchen Song, Yucong Zhang, Ming Li:

Exploring Pre-trained models on Ultrasound Modeling for Mice Autism Detection with Uniform Filter Bank and Attentive Scoring. - Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto:

MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge. - Szymon Szmajdzinski, Juliusz Wójtowicz-Kruk, Ivan Ryzhankow, Lukasz Lazarski, Jakub Zak, Wladyslaw Sredniawa:

Significance of Time-Frequency preprocessing for automatic Ultrasonic Vocalization classification in Autism Spectrum Disorder model detection. - Quentin Le Tellier, Marc Evrard, Albert Rilliard, Jean-Sylvain Liénard:

Robust Vocal Intensity Prediction: Overcoming Dataset Bias with Pretrained Deep Models. - Ryo Terashima, Yuma Shirahata, Masaya Kawamura:

SLASH: Self-Supervised Speech Pitch Estimation Leveraging DSP-derived Absolute Pitch.
Keynote 2 - Alexander Waibel: From Speech Science to Language Transparence
- Alexander Waibel:

From Speech Science to Language Transparence.
Spoken Dialogue Systems 1
- Truong Do, Phuong Minh Nguyen

, Le-Minh Nguyen:
PruneSLU: Efficient On-device Spoken Language Understanding through Vocabulary and Structural Pruning. - Haris Gulzar, Monikka Roslianna Busto, Akiko Masaki, Takeharu Eda, Ryo Masumura:

Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State Tracking. - Simon Sedlácek, Bolaji Yusuf, Jan Svec, Pradyoth Hegde

, Santosh Kesiraju, Oldrich Plchot, Jan Cernocký:
Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs. - Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel F. Garcia Contreras, Koichiro Yoshino:

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems. - Minghan Wang, Ye Bai

, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari:
SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation. - Xiaohan Shi, Xingfeng Li, Tomoki Toda:

Who, When, and What: Leveraging the "Three Ws" Concept for Emotion Recognition in Conversation. - Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis:

"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding. - Ebru Arisoy, Merve Ünlü Menevse, Yusufcan Manav, Arzucan Özgür:

Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering. - Maria Teleki, Lingfeng Shi, Chengkai Liu, James Caverlee:

I want a horror - comedy - movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance. - Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka:

Towards a Japanese Full-duplex Spoken Dialogue System.
Speech Assessment
- Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee:

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information. - Guitao Wang, Jinming Zhao, Hao Yang, Guilin Qi, Tongtong Wu, Gholamreza Haffari:

Continual Speech Learning with Fused Speech Features. - Jiatong Shi, Hye-jin Shim, Shinji Watanabe:

Uni-VERSA: Versatile Speech Assessment with a Unified Network. - John Alderete, Macarious Kin Fung Hui, Aanchan Mohan

:
Evaluating ASR Robustness to Spontaneous Speech Errors: A Study of WhisperX Using a Speech Error Database. - Tomoya Mizumoto, Atsushi Kojima, Yusuke Fujita, Lianbo Liu, Yui Sudo:

Is Synthetic Data Truly Effective for Training Speech Language Models? - Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane:

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not.
Audio-Visual ASR and Multimodal System
- Julián Zapata, Lara Hanna:

Text Entry for All: Towards Speech-based Multimodal Interaction for Inclusion, Accessibility and the Preservation of the World's Linguistic Heritage. - Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti

:
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach. - Thai-Binh Nguyen, Ngoc-Quan Pham, Alexander Waibel:

Cocktail-Party Audio-Visual Speech Recognition. - Zhengyang Li

, Pascal Reichert, Thomas Graave, Patrick Blumenberg, Tim Fingscheidt:
Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition. - Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura:

Unified Audio-Visual Modeling for Recognizing Which Face Spoke When and What in Multi-Talker Overlapped Speech and Video.
Speech and Voice Disorders 1
- Shangkun Huang, Jing Deng, Jintao Kang, Rong Zheng:

Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection. - Zongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis. - Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary A. Miller, Jet Vonk, Brittany Morin, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection. - Yeseul Park, Bowon Lee:

Fine-tuning Strategies for Automatic Speech Recognition of Low-Resource Speech with Autism Spectrum Disorder. - Margot Masson, Isabelle Ferrané, Julie Mauclair:

Identification of Pathological Pronunciation Profiles in ASR Transcription Errors. - Hadrien Titeux, Quang Tuan Rémy Nguyen, Andres Gil-Salcedo, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux:

A simple method for predicting Clinical Scores in Huntington's Disease by leveraging ASR's uncertainty on spontaneous speech. - Itay Ben-Dom, Catherine I. Watson, Clare M. McCann:

Introducing EMOPARKNZ: the Emotional Speech Database from New Zealand English Speakers with Parkinson's Disease. - Naoki Hojo, Ryoichi Takashima, Chihiro Sugiyama, Nobukazu Tanaka, Kanji Nohara, Kazunori Nozaki, Tetsuya Takiguchi:

Revisiting WFST-based Hybrid Japanese Speech Recognition System for Individuals with Organic Speech Disorders.
Multimodal Information Based Speech Processing (MISP) 2025 Challenge
- Longjie Luo, Shenghui Lu, Lin Li, Qingyang Hong:

Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge. - Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg:

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition. - Zhaoyang Li, Haodong Zhou, Longjie Luo, XiaoXiao Li, Yongxin Chen, Lin Li, Qingyang Hong:

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge. - Ming Cheng, Fei Su, Cancan Li, Juan Liu, Ming Li:

Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 Challenge. - Zeyan Song, Tianchi Sun, Ronghui Hu, Kai Chen, Jing Lu:

Leveraging Self-Supervised Learning Based Speaker Diarization for MISP 2025 AVSD Challenge. - Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng:

Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge.
Speaker Extraction 1
- Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasios Drosou, Junho Heo, Hyuk Oh, Seokyeong Jeong:

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling. - Wang Dai, Archontis Politis, Tuomas Virtanen:

Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction. - Shaole Li, Shuai Wang, Jiangyu Han, Ke Zhang, Wupeng Wang, Haizhou Li:

REAL-T: Real Conversational Mixtures for Target Speaker Extraction. - Zexu Pan, Wupeng Wang, Shengkui Zhao, Chong Zhang, Kun Zhou, Yukun Ma, Bin Ma:

Online Audio-Visual Autoregressive Speaker Extraction. - Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma:

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction.
Low Resource Speech Recognition
- Salvatore Carta, Alessandro Giuliani

, Marco Manolo Manca, Mirko Marras, Leonardo Piano:
SardinianVoxes: A Speech Recognition Dataset for the Sardinian Languages. - Griffin Dietz Smith, Dianna Yee, Jennifer King Chen, Leah Findlater:

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection. - Bowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin:

Automated evaluation of children's speech fluency for low-resource languages. - King Yiu Suen, Rudolf Chow, Albert Y. S. Lam:

Cantonese Punctuation Restoration using LLM Annotated Data. - David Sasu, Benedict Quartey, Kweku Andoh Yamoah, Natalie Schluter:

Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody. - Abhijit Sinha, Hemant Kumar Kathania, Mikko Kurimo:

Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASR. - Nicol Visser, Herman Kamper:

Spoken Language Modeling with Duration-Penalized Self-Supervised Units.
Computational Resource Constrained ASR
- Zhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu:

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision. - Zhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu:

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models. - Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu:

Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates. - Tianteng Gu, Bei Liu, Haoyu Wang, Yanmin Qian:

Ultra-Low Bit Post-Training Quantization of Large Speech Models via K-Means Clustering and Mixed Precision Allocation. - Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe:

Context-Driven Dynamic Pruning for Large Speech Foundation Models. - Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter:

Analyzing the Importance of Blank for CTC-Based Knowledge Distillation. - Seraphina Fong, Marco Matassoni, Alessio Brutti:

Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages.
Speech and Language Technology for Health Applications
- Yue Pan, Liwei Liu, Changxin Li, Xingyao Wang, Yili Xia, Hanyue Zhang, Ming Chu:

A Chinese Heart Failure Status Speech Database with Universal and Personalised Classification. - Harish Battula, Gauri Deshpande, Yagna Gudipalli, Sachin Patel:

Heart Rate as a Proxy Measure to Assess Human Confidence in Spoken Speech. - Jingping Nie, Tien Dung Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendaño, Erdrin Azemi, Vikramjit Mitra:

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma:

Towards Fusion of Neural Audio Codec-based Representations with Spectral for Heart Murmur Classification via Bandit-based Cross-Attention Mechanism. - Yizhou Chen, Xiyu Wu:

Perception of Emotional Speech by Individuals with High Borderline Personality Features. - Agata Sage, Zuzanna Miodonska, Michal Krecichwost, Ewa Kwasniok, Pawel Badura:

Visual features of the oral region in Polish sibilants produced by children with various sibilance patterns. - Roseline Polle, Agnes Norbury, Alexandra Livia Georgescu, Nicholas Cummins, Stefano Goria:

Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models. - Ning Wang, Bingyang Wen, Minghui Wu, Yang Sun, Zongru Shao, Haojie Zhou, K. P. Subbalakshmi:

Decoding Alzheimer's: Interpretable Visual and Logical Attention in Picture Description Tasks.
Responsible Speech Foundation Models + SUPERB Challenge
- Antonios Alexos, Raghuveer Peri, Sai Muralidhar Jayanthi, Metehan Cekic, Srikanth Vishnubhotla, Kyu J. Han, Srikanth Ronanki:

Defending Speech-enabled LLMs Against Adversarial Jailbreak Threats. - Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee:

Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach. - Dariia Puhach, Amir H. Payberah, Éva Székely:

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM. - Mengzhe Geng, Patrick Littell, Aidan Pine, Robbie Jimerson, Gilles Boulianne, Vishwa Gupta, Rolando Coto-Solano, Anna Kazantseva, Marc Tessier, Delaney Lothian, Akwiratékha' Martin, Eric Joanis, Samuel Larkin, Roland Kuhn:

Evaluating Speech Foundation Models for Automatic Speech Recognition in the Low-Resource Kanyen'kéha Language. - Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy:

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning. - Chun-Yi Kuan, Hung-yi Lee:

Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples. - Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee:

Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models. - Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul:

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models. - Qingzheng Wang, Jiancheng Sun, Yifan Peng, Shinji Watanabe:

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC. - William Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe:

The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties. - Tanel Alumäe, Artem Fedorchenko:

TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge.
Dysarthric Speech Assessment 1
- Tao Zhong, Mengzhe Geng, Shujie Hu, Guinan Li, Xunying Liu:

Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition. - Yejin Jeon, Solee Im, Youngjae Kim, Gary Geunbae Lee:

Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning. - Xueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu, Helen Meng:

DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model. - Shoutrik Das, Nishant Singh, Arjun Gangwar, S. Umesh:

Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching. - Ahmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata:

Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches. - Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen:

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages. - Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti:

Mitigating Overfitting During Speech Foundation Model Fine-tuning: Applications to Dysarthric Speech Detection. - Seohyun Park, Chitralekha Gupta, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara:

Towards Temporally Explainable Dysarthric Speech Clarity Assessment.
Show and Tell 2: Speech Synthesis
- Vishal Gourav, Phanindra Mankale:

Code Mix TTS: An Approach to Infer Human Like Speech for Multi-Lingual Input Texts. - Binh Nguyen, Thai Le:

Turing's Echo: Investigating Linguistic Sensitivity of Deepfake Voice Detection via Gamification. - Namhyun Cho, Sunmin Kim, Minsu Kang, Seolhee Lee, Choonghyeon Lee, Yangsun Lee:

Unleashing the Inner Monster: Demonstrating High-Fidelity Human to Non-Human Voice Conversion. - Victor Shepardson, Jonathan Reus, Thor Magnusson:

Tungnaá In Live Performance: An Implementation Of Interactive Artistic Text-To-Voice. - Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely:

Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI. - Takayuki Arai:

Vocal-tract model with two directions: Static design for a dummy head and dynamic design for a speaking machine.
Databases and Progress in Methodology
- Arnav Rustagi, Satvik Bajpai, Nimrat Kaur, Siddharth:

Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi. - Alexandra Fort, Francis Tyers:

Evaluating Wav2Vec2-Bert for Computer-Assisted Pronunciation Training for isiZulu. - Lubos Marcinek, Jonas Beskow, Joakim Gustafson:

Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. - Lidea Shahidi, Erdem Baha Topbas, Thu Ngan Dang

, Tobias Goehring:
Harnessing Text-to-Speech Voice Cloning Models for Improved Audiological Speech Assessment. - Xuan Shi, Yubin Zhang, Yijing Lu, Marcus Ma, Tiantian Feng

, Asterios Toutios, Haley Hsu, Louis Goldstein, Shrikanth Narayanan:
75-Speaker Annot-16: A benchmark dataset for speech articulatory rt-MRI annotation with articulator contours and phonetic alignment. - Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel Yamins:

Representing Speech Through Autoregressive Prediction of Cochlear Tokens. - Chanwoo Park, Anna Seo Gyeong Choi, Sunghye Cho, Chanwoo Kim:

Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language Models. - Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, Julia Schwarz:

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings. - Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu:

Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora.
Novel Architectures for ASR
- Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel:

Weight Factorization and Centralization for Continual Learning in Speech Recognition. - Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Peter Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection. - I-Ting Hsieh, Chung-Hsien Wu:

Dysarthric Speech Recognition Using Curriculum Learning and Multi-stream Architecture. - Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe:

DYNAC: Dynamic Vocabulary-based Non-Autoregressive Contextualization for Speech Recognition. - Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho:

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts. - Yifan Peng, Muhammad Shakeel, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe:

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning.
Deepfake Detection
- Kwok Chin Yuen, Jia Qi Yip, Zhen Qiu, Chi-Hung Chi, Kwok-Yan Lam:

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems. - Yassine El Kheir, Tim Polzehl, Sebastian Möller:

BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention. - Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot

, Ethan Fetaya:
Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes. - Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, Philip Sperl:

Replay Attacks Against Audio Deepfake Detection. - Seung-bin Kim, Hyun-seo Shin, Jungwoo Heo, Chan-yeong Lim, Kyo-Won Koo, Jisoo Son, Sanghyun Hong, Souhwan Jung, Ha-Jin Yu:

Enhancing Audio Deepfake Detection by Improving Representation Similarity of Bonafide Speech. - Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang:

Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphere.
Tools for Speech Analysis
- Kun Jin

, Siva Penke, Srinivasa Algubelli:
VoiceNet: Multilingual On-Device Phoneme-To-Audio Alignment. - Maya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton

, Sam Kirkham
:
Nosey: Open-Source Hardware for Acoustic Nasalance. - James Tanner, Morgan Sonderegger, Jane Stuart-Smith, Jeff Mielke, Tyler Kendall:

Automatic classification of stop realisation with wav2vec2.0.
Text Processing and Evaluation for Speech Synthesis 1
- Siqi Sun, Korin Richmond:

Acquiring Pronunciation from Speech Audio via Multi-task Learning. - Sujoy Roychowdhury, Ranjani H. G., Sumit Soman, Nishtha Paul, Subhadip Bandyopadhyay, Siddhanth Iyengar:

Intelligibility of Text-to-Speech Systems for Mathematical Expressions. - Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra:

The State Of TTS: A Case Study with Human Fooling Rates. - Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond:

Pairwise Evaluation of Accent Similarity in Speech Synthesis. - Harm Lameris, Joakim Gustafsson, Éva Székely:

VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. - Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach:

Towards Frame-level Quality Predictions of Synthetic Speech.
Segmental and Tonal Units
- C. T. Justine Hui, Jenice Kuzhikombil, Isabella Shields, Hiraia Haami-Wells, Catherine I. Watson, Peter J. Keegan:

Perception of Long and Short Vowel Contrast in Te Reo Māori in Clean and Everyday Listening Environments. - Patrik Hrabánek, Michaela Watkins, Silke Hamann:

The function of creaky voice in South Korean: A perception study. - Mingxi Lu, Ran Tao, Yujia Tian:

Talker Normalization in Chinese Bilinguals: A Comparative Study. - Terumichi Ariga:

Coping with segmental-prosodic incongruity in spoken word recognition in Japanese. - Saskia Wepner, Lucas Eckert, Gernot Kubin

, Barbara Schuppler:
What the Filler? Both ASR Systems and Humans Struggle More With Other Kinds of Disfluencies Than With Filler Particles.
Speech Quality Assessment
- Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann:

Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech. - Wafaa Wardah, Robert P. Spang, Vincent Barriac, Jan Reimes, Anna Llagostera, Jens Berger, Sebastian Möller:

SQ-AST: A Transformer-Based Model for Speech Quality Prediction. - Imran E. Kibria, Donald S. Williamson:

AttentiveMOS: A Lightweight Attention-Only Model forSpeech Quality Prediction. - Yu-Fei Shi, Yang Ai, Zhen-Hua Ling:

Universal Preference-Score-based Pairwise Speech Quality Assessment. - Enjamamul Hoq, Nikhil Gupta, Danielle Omondi, Ifeoma Nwogu:

FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification. - Wen-Chin Huang, Erica Cooper, Tomoki Toda:

SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit.
Speech Enhancement
- Haici Yang, Gordon Wichern, Ryo Aihara, Yoshiki Masuyama, Sameer Khurana, François G. Germain, Jonathan Le Roux:

Investigating continuous autoregressive generative speech enhancement. - Venkatesh Parvathala, K. Sri Rama Murty:

Dynamic Layer Gating for Speech Enhancement. - Saisamarth Rajesh Phaye, Milos Cernak, Andrew Harper:

Model as Loss: A Self-Consistent Training Paradigm. - Avishkar Behera, Riya Ann Easow, Venkatesh Parvathala, K. Sri Rama Murty:

Test-Time Training for Speech Enhancement. - Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee:

Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement. - Venkatesh Parvathala, Ramesh Gundluru, Sreekanth Sankala, K. Sri Rama Murty:

Exploiting Bispectral Features for Single-Channel Speech Enhancement.
Language Learning and Assessment
- Olli Kuparinen

:
Automatic Dialectal Transcription: An Evaluation on Finnish and Norwegian. - Wieke Harmsen, Roeland van Hout, Catia Cucchiarini, Helmer Strik:

Can ASR generate valid measures of child reading fluency? - Chowdam Venkata Thirumala Kumar, Chiranjeevi Yarra:

SGED-Probe: Probing E2E ASR decoder and aligner for spoken grammar error detection under three speaking practice conditions. - Aditya Kamlesh Parikh, Cristian Tejedor García, Catia Cucchiarini, Helmer Strik:

Evaluating Logit-Based GOP Scores for Mispronunciation Detection. - Yassine El Kheir, Omnia Ibrahim, Amit Meghanani, Nada Almarwani, Hawau Olamide Toyin, Sadeen Alharbi, Modar Alfadly, Lamya Alkanhal, Ibrahim Selim, Shehab Elbatal, Salima Mdhaffar, Thomas Hain

, Yasser Hifny, Mostafa Shahin
, Ahmed Ali:
Towards a Unified Benchmark for Arabic Pronunciation Assessment: Qur'anic Recitation as Case Study. - Wen-Wei Hsieh, Hao-Wei Chi, Kuan-Chen Wang, Ping-Cheng Yeh, Te-Hsin Liu, Chen-Yu Chiang:

OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global Learners. - Haopeng Geng, Daisuke Saito, Nobuaki Minematsu:

A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice Conversion. - Sehyun Oh, Sunhee Kim, Minhwa Chung:

Multimodal and Multitask Learning for Predicting Multiple Scores in L2 English Speech. - Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu:

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving. - Nhan Phan, Mikko Kuronen, Maria Kautonen, Riikka Ullakonoja, Anna von Zansen, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo:

Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish.
Speech Synthesis Paradigms and Methods 1
- Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song:

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching. - Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen:

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling. - Changfeng Gao, Zhihao Du, Shiliang Zhang:

Differentiable Reward Optimization for LLM based TTS system. - Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu:

Long-Context Speech Synthesis with Context-Aware Memory. - Yike Zhang, Yiming Li, Jie Chen, Qinghua Wu, Songjun Cao, Long Ma:

Monotonic Attention for Robust Text-to-Speech Synthesis in Large Language Model Frameworks. - Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling:

Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising. - Frank Zalkow, Paolo Sani, Kishor Kayyar Lakshminarayana, Emanuël A. P. Habets, Nicola Pia, Christian Dittmar:

Bridging the Training-Inference Gap in TTS: Training Strategies for Robust Generative Postprocessing for Low-Resource Speakers. - Chunhui Lu, Xue Wen, Liming Song, Junkwang Oh:

Robust Neural Codec Language Modeling with Phoneme Position Prediction for Zero-Shot TTS.
Spatial Audio and Acoustics 2
- Roland Hartanto, Sakriani Sakti, Koichi Shinoda:

SepVAC: Multitask Learning of Speaker Separation, Speaker Localization, Microphone Array Localization, and Room Acoustic Parameter Estimation in Various Acoustic Conditions. - Junhui Zhao, Hang Chen, Qing Wang, Jun Du, Yanhui Tu, Feng Ma:

TA-RIR: Topology-Aware Neural Modeling of Acoustic Propagation for Room Impulse Response Synthesis. - Hyun-Soo Kim

, Da-Hee Yang, Joon-Hyuk Chang:
Spatially Weighted Contrastive Learning for Robust Sound Source Localization. - Yiyuan Yang, Shitong Xu, Niki Trigoni, Andrew Markham:

Efficient and Microphone-Fault-Tolerant 3D Sound Source Localization. - De Hu, Shuyao Liu, Yanrong He:

Joint Reference Microphone Selection and Filter Order Determination in Multi-channel Active Noise Control. - Liang Tao, Maoshen Jia, Yonggang Hu:

Direct-path Relative Harmonic Coefficients Detection for Multi-source Direction-of-Arrival Estimation in Reverberant Environments. - Junsheng Hu, Shaojie Li, Qintuya Si, De Hu:

D-GAT: Dual Graph Attention Network for Global HRTF Interpolation. - Mateusz Guzik

, Giulio Cengarle, Daniel Arteaga:
Deep learning based spatial aliasing reduction in beamforming for audio capture. - Xiaoming Zhang, Ke-Yue Zhang, Taiping Yao, Songjun Cao, Shouhong Ding, Long Ma:

SonarGuard2: Ultrasonic Face Liveness Detection Based on Adaptive Doppler Effect Feature Extraction.
Text Processing and Evaluation for Speech Synthesis 2
- Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto:

Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning. - Noe Berger, Siqi Sun, Korin Richmond:

Non-Standard Accent TTS Support via Large Multi-Accent Frontend Pronunciation Knowledge Transfer. - Timothy Shin Heng Mak, King Yiu Suen, Albert Y. S. Lam:

Speech-guided Grapheme-to-Phoneme Conversion for Cantonese Text-to-Speech. - Rui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan:

Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation. - Sébastien Le Maguer

, Gwénolé Lecorvé, Damien Lolive, Naomi Harte, Juraj Simko:
Enabling the replicability of speech synthesis perceptual evaluations. - Natacha Miniconi, Meysam Shamsi, Anthony Larcher:

When The MOS Predictor Asks For Training Annotation In Cross Lingual/Domain Adaptation. - Ryo Setoguchi, Yoshiko Arimoto:

Assessment of the synthetic quality and controllability of laughing onset in speech-laugh synthesis.
General Topics in ASR
- Nick Rossenbach, Benedikt Hilmes, Leon Brackmann, Moritz Gunz, Ralf Schlüter:

Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach. - Ke Hu, Krishna C. Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg:

Word Level Timestamp Generation for Automatic Speech Recognition and Translation. - Ju Lin, Yiteng Huang, Ming Sun, Frank Seide, Florian Metze:

Directional Speech Recognition with Full-Duplex Capability. - Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda:

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models. - Hongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie:

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty.
Acoustic Event Detection and Classification
- James Taylor, Wolfgang Mack:

Improving Audio Classification by Transitioning from Zero- to Few-Shot. - Kohei Uehara, Ryoichi Takashima, Tetsuya Takiguchi:

Zero-Shot Learning for Acoustic Event Classification Using an Attribute Vector and Conditional GAN. - Lipeng Dai, Qing Wang, Jie Zhang, Shengyu Peng, Yu Guan, Wu Guo:

Leveraging Multi-Level Features of ATST with Conformer-Based Dual-Branch Network for Sound Event Detection. - Tatsuya Komatsu, Hokuto Munakata, Yuchi Ishikawa:

Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features. - Yuchi Ishikawa, Shota Nakada, Hokuto Munakata, Kazuhiro Saito, Tatsuya Komatsu, Yoshimitsu Aoki:

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos. - Yusuke Fujita, Tomoya Mizumoto, Atsushi Kojima, Lianbo Liu, Yui Sudo:

AC/DC: LLM-based Audio Comprehension via Dialogue Continuation. - Yawei Wang, Qiaoling Zhang, Yi Zhang, Junyao Hu:

Anomalous Sound Detection Based Feature Fusion and Dual-path Non-linear Independent Components Estimation. - Nan Jiang

, Yan Song, Qing Gu, Haoyu Song, Lirong Dai, Ian McLoughlin:
An Effective Anomalous Sound Detection Method Based on Global and Local Attribute Mining. - Long-Vu Hoang, Tuan Nguyen, Huy Dat Tran:

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment.
Keyword Spotting and Retrieval
- Anup Singh

, Kris Demuynck, Vipul Arora:
Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval. - Akanksha Singh, Yi-Ping Phoebe Chen

, Vipul Arora:
H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing. - Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jin:

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval. - Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho:

Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting. - Minghui Fang, Shengpeng Ji, Jialong Zuo, Xize Cheng, Wenrui Liu, Xiaoda Yang, Ruofan Hu, Jieming Zhu, Zhou Zhao:

GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale Tokenizer. - Changin Choi, Sungjun Lim, Wonjong Rhee:

Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning. - Ruochu Yang, Milind Rao, Harshavardhan Sundar, Anirudh Raju, Aparna Khare, Srinath Tankasala, Di He, Venkatesh Ravichandran:

On Retrieval of Long Audios with Complex Text Queries. - Jin-Gyo Lim, Seong-Eun Kim:

SIDC-KWS: Efficient Spiking Inception-Dilated Conformer with Self-Attention for Keyword Spotting. - Dzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov:

Multichannel Keyword Spotting for Noisy Conditions. - Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge:

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting. - Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang:

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples. - Firoj Alam, Md. Arid Hasan, Shammur Absar Chowdhury:

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs.
Multimodal Systems
- Sun-Kyung Lee, Jong-Hwan Kim:

CAMER: Contribution-Aware Multimodal Emotion Recognition. - Jiajun He, Jinyi Mi, Tomoki Toda:

GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints. - Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer. - Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang:

CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge. - Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz

, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman:
PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association. - Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Zelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg:

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model. - Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie:

U-SAM: An Audio Language Model for Unified Speech, Audio, and Music Understanding. - Yun Tang, Eesung Kim, Vijendra Raj Apsingekar:

Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data. - Yi Wang, Oli Danyi Liu, Peter Bell:

The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models.
Dysarthric Speech Assessment 2
- Éva Székely, Péter Mihajlik, Máté Soma Kádár, László Tóth:

Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication. - Dimme de Groot, Tanvina Patel, Devendra Kayande, Odette Scharenborg, Zhengjun Yue:

Objective and Subjective Evaluation of Diffusion-Based Speech Enhancement for Dysarthric Speech. - Minseop Kim, Minsu Han, Seokyoung Hong, Myoung-wan Koo:

Data Augmentation using Speech Synthesis for Speaker-Independent Dysarthria Severity Classification. - Anuprabha M, Krishna Gurugubelli, Anil Kumar Vuppala:

Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS. - Jingting Li, Keyi Feng, Xinran Zhao, Yan Wang, Su-Jing Wang:

Synthetic Dysarthric Speech: A Supplement, Not a Substitute for Authentic Data in Dysarthric Speech Recognition. - Karl El Hajal, Enno Hermann

, Sevada Hovsepyan
, Mathew Magimai-Doss:
Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech.
Dialect Identification in Different Languages
- Lorenz Gutscher, Michael Pucher:

Audio-Based Classification and Geographic Regression of Austrian Dialects. - Saurabh Kumar, Amartyaveer

, Prasanta Kumar Ghosh:
Jointly Improving Dialect Identification and ASR in Indian Languages using Multimodal Feature Fusion. - Haroun Elleuch

, Salima Mdhaffar, Yannick Estève, Fethi Bougares:
ADI-20: Arabic Dialect Identification dataset and models. - Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek:

Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion. - Phoebe Parsons, Heming Strømholt Bremnes, Knut Kvale, Torbjørn Svendsen, Giampiero Salvi:

Effects of Prosodic Information on Dialect Classification Using Whisper Features. - Badr M. Abdullah, Matthew Baas, Bernd Möbius

, Dietrich Klakow:
Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification.
Connecting Speech Science and Speech Technology for Children's Speech
- Xulin Fan, Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain:

Band-Split Self-supervised Mamba for Infant-centered Audio Analysis. - Nina R. Benway, Saba Tabatabaee, Benjamin Munson

, Jonathan Preston, Carol Y. Espy-Wilson:
Subtyping Speech Errors in Childhood Speech Sound Disorders with Acoustic-to-Articulatory Speech Inversion. - Amanda Eads, Heather Kabakoff, Nina Benway, Elaine Hitchcock, Jonathan L. Preston, Tara McAllister:

PERCEPT-US: A Multimodal American English Child Speech Corpus Specialized for Articulatory Feedback. - Ajinkya Kulkarni

, Francisco Teixeira
, Enno Hermann, Thomas Rolland, Isabel Trancoso, Mathew Magimai-Doss:
Children's Voice Privacy: First Steps and Emerging Challenges. - Saba Tabatabaee, Jing Liu, Carol Y. Espy-Wilson:

FT-Boosted SV: Towards Noise Robust Speaker Verification for English Speaking Classroom Environments. - Zhonghao Shi

, Xuan Shi, Anfeng Xu, Tiantian Feng
, Harshvardhan Srivastava, Shrikanth Narayanan, Maja J. Mataric:
Examining Test-Time Adaptation for Personalized Child Speech Recognition. - Theo Zhang, Madurya Suresh, Anne Warluamont, Kasia Hitczenko, Alejandrina Cristià, Margaret Cychosz:

Employing self-supervised learning models for cross-linguistic child speech maturity classification. - Ankita, Shambhavi, Syed Shahnawazuddin:

On Enhancing the Performance of Children's ASR Task in Limited Data Scenario. - Tiantian Feng

, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan:
Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling. - Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan:

Large Language Models based ASR Error Correction for Child Conversations. - Tarek Kunze, Marianne Métais, Hadrien Titeux, Lucas Elbert, Joseph Coffey, Emmanuel Dupoux, Alejandrina Cristià, Marvin Lavechin:

Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier. - Lingyun Gao, Cristian Tejedor García, Catia Cucchiarini, Helmer Strik:

Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts. - Jazmín Vidal, Luciana Ferrer, Juan Esteban Kamienkowski, Pablo Riera:

Improving Automatic Speech Recognition for Children's Reading Assessment with Disfluency-aware Language Models. - Sneha Raman, Preeti Rao:

Oral Reading Errors by Grade 3 Children in Indian Schools: A Hindi-English Perspective. - Christopher Gebauer, Lars Rumberg, Lars Köhn, Hanna Ehlert

, Edith Beaulac
, Jörn Ostermann
:
Grammatical Error Detection on Spontaneous Children's Speech Using Iterative Pseudo Labeling. - Koharu Horii, Naohiro Tawara, Atsunori Ogawa, Shoko Araki:

Why is children's ASR so difficult? Analyzing children's phonological error patterns using SSL-based phoneme recognizers. - Darline Monika Marx, Marco Matassoni, Alessio Brutti:

Automatic detection of speech sound disorders in German-speaking children: augmenting the data with typically developed speech. - Edem Ahadzi, Vishwanath Pratap Singh, Tomi Kinnunen, Ville Hautamäki:

Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence. - Thomas Rolland, Alberto Abad:

Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech Recognition. - Karen Rosero, Ali N. Salman, Shreeram Suresh Chandra, Berrak Sisman, Cortney Van't Slot, Alex A. Kane, Rami R. Hallac, Carlos Busso:

Advancing Pediatric ASR: The Role of Voice Generation in Disordered Speech. - Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan:

CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR. - Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen:

Causal Structure Discovery for Error Diagnostics of Children's ASR.
Brain and Cognition
- Omer Moussa, Mariya Toneva:

Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain. - Rini A. Sharon, Hema A. Murthy:

Enhancing Syllabic Recognition via Speech-EEG Phase Analysis and Non-Activity State Modeling. - Saravanakumar Duraisamy, Maurice Rekrut, Luis A. Leiva:

Functional Connectivity and Hilbert-Based Features for Covert Speech EEG Variability Analysis and Classification. - Siavash Shams, Richard J. Antonello, Gavin Mischler, Stephan Bickel, Ashesh D. Mehta, Nima Mesgarani:

Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG. - Gabriel Ivucic, Saurav Pahuja, Dashanka De Silva, Tanja Schultz:

Selective Auditory Attention Decoding in Naturalistic Conversations Using EEG-Based Speech Envelope Tracking in Multi-Speaker Environments. - Mohammed Salah Al-Radhi, Géza Németh, Branislav Gerazov:

MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction.
Regional, Social and Diachronic Variation
- Gustavo Silveira, Aviad Albert, Martine Grice:

Probing Prosodic Differences Between Two Regional Varieties of Brazilian Portuguese. - Gilly Marchini, Jeremy Steffman:

Data-driven approaches to pitch modelling in two Mexican Spanish ethnolects: K-means Clustering & GAMMs. - Anisia Popescu, Lori Lamel, Marc Evrard, Ioana Vasilescu:

Tracking /r/ Deletion: Forced Alignment of Pronunciation Variants and Sociophonetic Insights into Post-Obstruent Final /r/ in French. - Lilian von Bressensdorf, Pia Greca, Jonathan Harrington:

Agent-based modelling, sound change, and metaphony in Southern Italian varieties of Italo-Romance. - John McGahay:

Modeling Vowel System Typology Using Iterated Confusion Minimization. - Bingliang Zhao

, Xiyu Wu:
Investigating Glottal Stop Coda Loss During Sound Change of Checked Syllables Based on Speech-EGG Voice Offset Alignment.
Speaker Extraction 2
- Thomas Serre, Mathieu Fontaine, Eric Benhaim, Slim Essid:

MTSE: Multi-Target Speaker Extraction for Conversation Scenarios. - Daniel-José Alcala Padilla, Nils L. Westhausen, Swati Vivekananthan, Bernd T. Meyer:

Location-Aware Target Speaker Extraction for Hearing Aids. - Shengkui Zhao, Zexu Pan, Bin Ma:

ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment. - Cheng Yu, Vahid Ahmadi Kalkhorani, Buye Xu, DeLiang Wang:

Online AV-CrossNet: a Causal and Efficient Audiovisual System for Speech Enhancement and Target Speaker Extraction. - Jakob Kienegger, Timo Gerkmann:

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios.
Multimodal Emotion Recognition
- Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin:

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval. - Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman:

EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast. - Georgios Chochlakis, Turab Iqbal, Woo Hyun Kang, Zhaocheng Huang:

Modality-Agnostic Multimodal Emotion Recognition using a Contrastive Masked Autoencoder. - Maxim Markitantov, Elena Ryumina, Heysem Kaya, Alexey Karpov

:
Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion.
Conversation, Communication and Interaction 2
- Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara:

Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems. 


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID