


default search action
25th Interspeech 2024: Kos, Greece
- Itshak Lapidot, Sharon Gannot:

25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024. ISCA 2024
Keynote 1 ISCA Medallist
- Isabel Trancoso:

Towards Responsible Speech Processing.
L2 Speech, Bilingualism and Code-Switching
- Sarah Wesolek

, Piotr Gulgowski
, Joanna Blaszczak, Marzena Zygis:
The influence of L2 accent strength and different error types on personality trait ratings. - Jie Chi, Electra Wallington, Peter Bell:

Characterizing code-switching: Applying Linguistic Principles for Metric Assessment and Development. - Wei Xue

, Ivan Yuen, Bernd Möbius
:
Towards a better understanding of receptive multilingualism: listening conditions and priming effects. - Debasish Ray Mohapatra, Victor Zappi, Sidney Fels:

2.5D Vocal Tract Modeling: Bridging Low-Dimensional Efficiency with 3D Accuracy.
Speaker Diarization 1
- Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna:

Investigating Confidence Estimation Measures for Speaker Diarization. - Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan:

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization. - Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang:

On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization. - Marc Härkönen, Samuel J. Broughton, Lahiru Samarakoon:

EEND-M2F: Masked-attention mask transformers for speaker diarization. - Yongkang Yin, Xu Li, Ying Shan, Yuexian Zou:

AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild. - Arunav Arya, Murtiza Ali, Karan Nathwani:

Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network Framework.
Speech and Audio Analysis and Representations
- Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma:

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning. - Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto:

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation. - Yusuke Fujita, Tatsuya Komatsu:

Audio Fingerprinting with Holographic Reduced Representations. - David Meyer, Eitan Abecassis, Clara Fernandez-Labrador, Christopher Schroers:

RAST: A Reference-Audio Synchronization Tool for Dubbed Content. - Xuefei Li, Hao Huang, Ying Hu, Liang He

, Jiabao Zhang, Yuyi Wang:
YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation. - Asad Ullah, Alessandro Ragano

, Andrew Hines
:
Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech Models. - Jaden Pieper, Stephen Voran:

AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators.
Acoustic Event Detection and Classification 2
- Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz:

Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation. - Da Mu, Zhicheng Zhang, Haobo Yue:

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection. - Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee

, Yong-Hwa Park:
Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection. - Tuan Vu Ho, Kota Dohi, Yohei Kawaguchi:

Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring. - Anbai Jiang, Bing Han, Zhiqiang Lv

, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan
:
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection. - Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu:

FakeSound: Deepfake General Audio Detection. - Shabnam Ghaffarzadegan, Luca Bondi, Wei-Cheng Lin, Abinaya Kumar, Ho-Hsiang Wu, Hans-Georg Horst, Samarjit Das:

Sound of Traffic: A Dataset for Acoustic Traffic Identification and Counting.
Detection and Classification of Bioacoustic Signals
- Sahil Kumar, Jialu Li, Youshan Zhang:

Vision Transformer Segmentation for Visual Bird Sound Denoising. - Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Björn W. Schuller:

DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition. - Jules Cauzinille, Benoît Favre, Ricard Marxer

, Dena J. Clink, Abdul Hamid Ahmad, Arnaud Rey:
Investigating self-supervised speech models' ability to classify animal vocalizations: The case of gibbon's vocal signatures. - Xihang Qiu

, Lixian Zhu, Zikai Song, Zeyu Chen, Haojie Zhang
, Kun Qian, Ye Zhang, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller:
Study Selectively: An Adaptive Knowledge Distillation based on a Voting Network for Heart Sound Classification. - Jie Lin, Xiuping Yang, Li Xiao, Xinhong Li, Weiyan Yi, Yuhong Yang, Weiping Tu, Xiong Chen:

SimuSOE: A Simulated Snoring Dataset for Obstructive Sleep Apnea-Hypopnea Syndrome Evaluation during Wakefulness.
Acoustic Echo Cancellation
- Premanand Nayak, Kamini Sabu

, M. Ali Basha Shaik:
Multi-mic Echo Cancellation Coalesced with Beamforming for Real World Adverse Acoustic Conditions. - Vahid Khanagha, Dimitris Koutsaidis, Kaustubh Kalgaonkar, Sriram Srinivasan

:
Interference Aware Training Target for DNN based joint Acoustic Echo Cancellation and Noise Suppression. - Yi Gao, Xiang Su:

Low Complexity Echo Delay Estimator Based on Binarized Feature Matching. - Ye Ni

, Cong Pang
, Chengwei Huang, Cairong Zou:
MSA-DPCRN: A Multi-Scale Asymmetric Dual-Path Convolution Recurrent Network with Attentional Feature Fusion for Acoustic Echo Cancellation. - Ofer Schwartz, Sharon Gannot

:
Efficient Joint Bemforming and Acoustic Echo Cancellation Structure for Conference Call Scenarios. - Fei Zhao, Jinjiang Liu, Xueliang Zhang:

SDAEC: Signal Decoupling for Advancing Acoustic Echo Cancellation.
Speech Synthesis: Voice Conversion 1
- Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura

, Hiroshi Saruwatari:
Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals. - Alan Baade, Puyuan Peng, David Harwath:

Neural Codec Language Models for Disentangled and Textless Voice Conversion. - Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo:

Fine-Grained and Interpretable Neural Speech Editing. - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo:

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation. - Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi:

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion. - Tianhua Qi, Shiyan Wang, Cheng Lu, Yan Zhao, Yuan Zong, Wenming Zheng:

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity.
Neural Network Architectures for ASR 2
- Yu Nakagome, Michael Hentschel:

InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions. - Qinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao:

SEQ-former: A context-enhanced and efficient automatic speech recognition framework. - Robert Flynn, Anton Ragni:

How Much Context Does My Attention-Based ASR System Need? - Vincenzo Norman Vitale, Loredana Schettino, Francesco Cutugno:

Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers' ability to model hesitation phenomena. - Tian-Hao Zhang, Xinyuan Qian, Feng Chen, Xu-Cheng Yin:

Transmitted and Aggregated Self-Attention for Automatic Speech Recognition. - Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe

:
MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels. - Koichi Miyazaki, Yoshiki Masuyama, Masato Murata:

Exploring the Capability of Mamba in Speech Applications. - Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye:

Lightweight Transducer Based on Frame-Level Criterion. - Ankit Gupta, George Saon, Brian Kingsbury:

Exploring the limits of decoder-only models trained on public speech recognition corpora. - Xun Gong, Anqi Lv, Zhiming Wang, Yanmin Qian:

Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model.
Decoding Algorithms
- Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu:

Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask. - Kun Zou, Fengyun Tan, Ziyang Zhuang, Chenfeng Miao, Tao Wei, Shaodan Zhai, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao:

E-Paraformer: A Faster and Better Parallel Transformer for Non-autoregressive End-to-End Mandarin Speech Recognition. - Martino Ciaperoni, Athanasios Katsamanis, Aristides Gionis, Panagiotis Karras:

Beam-search SIEVE for low-memory speech recognition. - Daniel Galvez, Vladimir Bataev

, Hainan Xu, Tim Kaldewey:
Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU. - Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Yanzhang He, Pedro Moreno Mengibar:

Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm. - Tatsunari Takagi, Yukoh Wakabayashi, Atsunori Ogawa, Norihide Kitaoka:

Text-only Domain Adaptation for CTC-based Speech Recognition through Substitution of Implicit Linguistic Information in the Search Space.
Pronunciation Assessment
- Xintong Wang, Mingqian Shi, Ye Wang:

Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis. - Yu-Wen Chen, Zhou Yu, Julia Hirschberg:

MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios. - Xinwei Cao, Zijian Fan, Torbjørn Svendsen

, Giampiero Salvi:
A Framework for Phoneme-Level Pronunciation Assessment Using CTC. - Mostafa Shahin

, Beena Ahmed:
Phonological-Level Mispronunciation Detection and Diagnosis. - Heejin Do, Wonjun Lee, Gary Geunbae Lee:

Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment. - Nhan Phan

, Anna von Zansen
, Maria Kautonen
, Ekaterina Voskoboinik, Tamás Grósz, Raili Hildén
, Mikko Kurimo:
Automated content assessment and feedback for Finnish L2 learners in a picture description speaking task.
Spoken Language Processing
- Zhenyu Wang, Shuyu Kong, Li Wan, Biqiao Zhang, Yiteng Huang, Mumin Jin, Ming Sun, Xin Lei, Zhaojun Yang:

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning. - Youngmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoonyoung Cho:

Relational Proxy Loss for Audio-Text based Keyword Spotting. - Sichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han, Hoonyoung Cho:

CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting. - Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu:

Text-aware Speech Separation for Multi-talker Keyword Spotting. - Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee:

Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition. - Raul Monteiro:

Adding User Feedback To Enhance CB-Whisper. - Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo

, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
:
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer.
Spoken Machine Translation 2
- Nan Chen, Yonghe Wang, Feilong Bao:

Parameter-Efficient Adapter Based on Pre-trained Models for Speech Translation. - Badr M. Abdullah, Mohammed Maqsood Shaik, Dietrich Klakow:

Wave to Interlingua: Analyzing Representations of Multilingual Speech Transformers for Spoken Language Translation. - Nan Chen, Yonghe Wang, Feilong Bao:

Knowledge-Preserving Pluggable Modules for Multilingual Speech Translation Tasks. - Rastislav Rabatin, Frank Seide, Ernie Chang:

Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation. - Peidong Wang, Jian Xue, Jinyu Li, Jun-Kun Chen, Aswin Shanmugam Subramanian:

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation. - Dan Oneata, Herman Kamper

:
Translating speech with just images. - Sameer Khurana, Chiori Hori, Antoine Laurent, Gordon Wichern, Jonathan Le Roux:

ZeroST: Zero-Shot Speech Translation.
Biosignal-enabled Spoken Communication
- Jinyu Li, Leonardo Lancia:

A multimodal approach to study the nature of coordinative patterns underlying speech rhythm. - Peter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W. Black, Rikky Muller, Gopala Krishna Anumanchipalli:

Towards EMG-to-Speech with Necklace Form Factor. - Chris Bras, Tanvina Patel, Odette Scharenborg

:
Using articulated speech EEG signals for imagined speech decoding. - Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang:

Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals. - Yudong Yang, Rongfeng Su, Rukiye Ruzi, Manwa L. Ng, Shaofeng Zhao, Nan Yan, Lan Wang:

Optical Flow Guided Tongue Trajectory Generation for Diffusion-based Acoustic to Articulatory Inversion. - Rishi Jain, Bohan Yu, Peter Wu, Tejas S. Prabhune, Gopala Anumanchipalli:

Multimodal Segmentation for Vocal Tract Modeling. - Jesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh:

Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss. - Yujie Yan, Xiran Xu, Haolin Zhu, Pei Tian, Zhongshu Ge, Xihong Wu, Jing Chen:

Auditory Attention Decoding in Four-Talker Environment with EEG. - Zijie Lin, Tianyu He, Siqi Cai, Haizhou Li:

ASA: An Auditory Spatial Attention Dataset with Multiple Speaking Locations. - Saurav Pahuja, Gabriel Ivucic, Pascal Himmelmann, Siqi Cai, Tanja Schultz, Haizhou Li:

Leveraging Graphic and Convolutional Neural Networks for Auditory Attention Detection with EEG.
Individual and Social Factors in Phonetics
- Tillmann Pistor, Adrian Leemann:

Echoes of Implicit Bias Exploring Aesthetics and Social Meanings of Swiss German Dialect Features. - Vivian Guo Li:

In search of structure and correspondence in intra-speaker trial-to-trial variability. - Irene Smith, Morgan Sonderegger, Spade Consortium:

Modelled Multivariate Overlap: A method for measuring vowel merger. - Keiko Ochi

, Koji Inoue, Divesh Lala, Tatsuya Kawahara:
Entrainment Analysis and Prosody Prediction of Subsequent Interlocutor's Backchannels in Dialogue. - James Tanner

, Morgan Sonderegger, Jane Stuart-Smith, Tyler Kendall, Jeff Mielke
, Robin Dodsworth
, Erik Thomas:
Exploring the anatomy of articulation rate in spontaneous English speech: relationships between utterance length effects and social factors. - Katelyn Taylor, Amelia Jane Gully, Helena Daffern:

Familiar and Unfamiliar Speaker Identification in Speech and Singing.
Paralinguistics
- Luis Felipe Parra-Gallego, Tilak Purohit

, Bogdan Vlasenko, Juan Rafael Orozco-Arroyave, Mathew Magimai-Doss
:
Cross-transfer Knowledge between Speech and Text Encoders to Evaluate Customer Satisfaction. - Manila Kodali

, Sudarsana Reddy Kadiri
, Paavo Alku
:
Fine-tuning of Pre-trained Models for Classification of Vocal Intensity Category from Speech Signals. - Alexander Kathan, Martin Bürger, Andreas Triantafyllopoulos, Sabrina Milkus, Jonas Hohmann, Pauline Muderlak, Jürgen Schottdorf, Richard Musil, Björn W. Schuller, Shahin Amiriparian:

Real-world PTSD Recognition: A Cross-corpus and Cross-linguistic Evaluation. - Debasmita Bhattacharya, Eleanor Lin, Run Chen, Julia Hirschberg:

Switching Tongues, Sharing Hearts: Identifying the Relationship between Empathy and Code-switching in Speech.
Speaker Recognition: Adversarial and Spoofing Attacks
- Eros Rosello

, Angel M. Gomez, Iván López-Espejo, Antonio M. Peinado, Juan M. Martín-Doñas
:
Anti-spoofing Ensembling Model: Dynamic Weight Allocation in Ensemble Models for Improved Voice Biometrics Security. - Lin Zhang, Xin Wang, Erica Cooper, Mireia Díez, Federico Landini, Nicholas W. D. Evans, Junichi Yamagishi:

Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio. - Haochen Wu, Wu Guo, Zhentao Zhang, Wenting Zhao, Shengyu Peng, Jie Zhang:

Spoofing Speech Detection by Modeling Local Spectro-Temporal and Long-term Dependency. - Jingze Lu, Yuxiang Zhang, Zhuo Li, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang:

Improving Copy-Synthesis Anti-Spoofing Training Method with Rhythm and Speaker Perturbation. - Yip Keng Kan, Ke Xu, Hao Li, Jie Shi:

VoiceDefense: Protecting Automatic Speaker Verification Models Against Black-box Adversarial Attacks. - Xuanjun Chen, Jiawei Du

, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee:
Neural Codec-based Adversarial Sample Detection for Speaker Verification. - Sizhou Chen, Yibo Bai

, Jiadi Yao, Xiao-Lei Zhang, Xuelong Li
:
Textual-Driven Adversarial Purification for Speaker Verification. - Zhuhai Li, Jie Zhang, Wu Guo, Haochen Wu:

Boosting the Transferability of Adversarial Examples with Gradient-Aligned Ensemble Attack for Speaker Recognition. - Duc-Tuan Truong

, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng:
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection.
Audio Event Detection and Classification 1
- Tiantian Feng

, Dimitrios Dimitriadis, Shrikanth S. Narayanan:
Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling? - Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang:

Scaling up masked audio encoder learning for general audio classification. - Sarthak Yadav

, Zheng-Hua Tan
:
Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations. - Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin:

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection. - Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux:

Sound Event Bounding Boxes. - Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He:

Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network.
Source Separation 2
- Hassan Taherian, Vahid Ahmadi Kalkhorani, Ashutosh Pandey, Daniel Wong, Buye Xu, DeLiang Wang:

Towards Explainable Monaural Speaker Separation with Auditory-based Training. - Iva Ewert, Marvin Borsdorf

, Haizhou Li, Tanja Schultz:
Does the Lombard Effect Matter in Speech Separation? Introducing the Lombard-GRID-2mix Dataset. - Zexu Pan, Gordon Wichern, François G. Germain, Kohei Saijo, Jonathan Le Roux:

PARIS: Pseudo-AutoRegressIve Siamese Training for Online Speech Separation. - Yiru Zhang, Linyu Yao, Qun Yang:

OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction. - Tsun-An Hsieh, Heeyoul Choi, Minje Kim:

Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation. - Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li:

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech. - Yiwen Wang, Xihong Wu:

TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information. - Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux:

Enhanced Reverberation as Supervision for Unsupervised Speech Separation.
Noise Reduction, Dereverberation, and Echo Cancellation
- Fei Zhao, Chenggang Zhang, Shulin He, Jinjiang Liu, Xueliang Zhang:

Deep Echo Path Modeling for Acoustic Echo Cancellation. - Hongmei Guo, Yijiang Chen, Xiaolei Zhang, Xuelong Li

:
Graph Attention Based Multi-Channel U-Net for Speech Dereverberation With Ad-Hoc Microphone Arrays. - Louis Bahrman

, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard:
Speech dereverberation constrained on room impulse response characteristics. - Kuang Yuan

, Shuo Han, Swarun Kumar
, Bhiksha Raj:
DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing. - Alexander Barnhill

, Elmar Nöth, Andreas K. Maier, Christian Bergler:
ANIMAL-CLEAN - A Deep Denoising Toolkit for Animal-Independent Signal Enhancement. - Premanand Nayak, M. Ali Basha Shaik:

Elucidating Clock-drift Using Real-world Audios In Wireless Mode For Time-offset Insensitive End-to-End Asynchronous Acoustic Echo Cancellation. - Shilin Wang, Haixin Guan

, Yanhua Long:
QMixCAT: Unsupervised Speech Enhancement Using Quality-guided Signal Mixing and Competitive Alternating Model Training.
Computationally-Efficient Speech Enhancement
- Hanbin Bae, Pavel Andreev, Azat Saginbaev, Nicholas Babaev, Won-Jun Lee, Hosang Sung, Hoon-Young Cho:

Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds. - Behnam Gholami, Mostafa El-Khamy, Kee-Bong Song:

Knowledge Distillation for Tiny Speech Enhancement with Latent Feature Augmentation. - Yuewei Zhang

, Huanbin Zou, Jie Zhu:
Sub-PNWR: Speech Enhancement Based on Signal Sub-Band Splitting and Pseudo Noisy Waveform Reconstruction Loss. - Zugang Zhao, Jinghong Zhang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He:

Streamlining Speech Enhancement DNNs: an Automated Pruning Method Based on Dependency Graph with Advanced Regularized Loss Strategies. - Zehua Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang:

Lightweight Dynamic Sparse Transformer for Monaural Speech Enhancement. - Zizhen Lin, Xiaoting Chen, Junyu Wang:

MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement. - Longbiao Cheng

, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Shih-Chii Liu
:
Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement.
Zero-shot TTS
- Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li:

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model. - Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda:

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS. - Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima:

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters. - Vikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva:

DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness.
Noise Robustness, Far-Field, and Multi-Talker ASR
- Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey:

LibriheavyMix: A 20, 000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization. - Xujiang Xing, Mingxing Xu, Thomas Fang Zheng:

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification. - Ying Shi, Lantian Li, Shi Yin, Dong Wang, Jiqing Han:

Serialized Output Training by Learned Dominance. - Xianrui Zheng, Guangzhi Sun, Chao Zhang, Philip C. Woodland:

SOT Triggered Neural Clustering for Speaker Attributed ASR. - Yoshiaki Bando, Tomohiko Nakamura

, Shinji Watanabe
:
Neural Blind Source Separation and Diarization for Distant Speech Recognition. - Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Mana Ihori, Naotaka Kawata, Shota Orihashi, Kazutoshi Shinoda, Taiga Yamane, Saki Mizuno, Keita Suzuki, Satoshi Suzuki, Nobukatsu Hojo, Takafumi Moriya, Atsushi Ando:

Unified Multi-Talker ASR with and without Target-speaker Enrollment.
Contextual Biasing and Adaptation
- Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

:
Keyword-Guided Adaptation of Automatic Speech Recognition. - Nguyen Manh Tien Anh, Thach Ho Sy:

Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor. - Peng Wang

, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen:
Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer. - Chengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li:

Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition. - Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey:

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation. - Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev

, Vitaly Lavrukhin, Boris Ginsburg:
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter. - Xizi Wei, Stephen McGregor

:
Prompt Tuning for Speech Recognition on Unknown Spoken Name Entities. - Junzhe Liu, Jianwei Yu, Xie Chen:

Improved Factorized Neural Transducer Model For Text-only Domain Adaptation. - Pin-Yen Liu, Jen-Tzung Chien

:
Modality Translation Learning for Joint Speech-Text Model. - Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR. - Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura:

Factor-Conditioned Speaking-Style Captioning. - Yerbolat Khassanov, Zhipeng Chen

, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang:
Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR. - Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran:

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models. - Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang:

Domain-Aware Data Selection for Speech Classification via Meta-Reweighting.
Spoken Language Understanding
- Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

:
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model. - Dejan Porjazovski, Anssi Moisio, Mikko Kurimo:

Out-of-distribution generalisation in spoken language understanding. - Gaëlle Laperrière, Sahar Ghannay, Bassam Jabaian, Yannick Estève:

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding. - Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier:

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. - Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang:

Using Large Language Model for End-to-End Chinese ASR and NER. - Alkis Koudounas, Flavio Giobergia, Eliana Pastor, Elena Baralis:

A Contrastive Learning Approach to Mitigate Bias in Speech Models.
Spoken Machine Translation 1
- Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri:

Investigating Decoder-only Large Language Models for Speech-to-text Translation. - Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi du Bois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang, Zoë Abrams, Morgan McGuire:

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation. - Nan Chen, Yonghe Wang, Feilong Bao:

Sign Value Constraint Decomposition for Efficient 1-Bit Quantization of Speech Translation Tasks. - Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung:

Lightweight Audio Segmentation for Long-form Speech Translation. - Haotian Tan, Sakriani Sakti:

Contrastive Feedback Mechanism for Simultaneous Speech Translation. - Cécile Macaire, Chloé Dion, Didier Schwab, Benjamin Lecouteux, Emmanuelle Esperança-Rodier:

Towards Speech-to-Pictograms Translation.
Hearing Disorders
- Seonwoo Lee, Sunhee Kim, Minhwa Chung:

Automatic Assessment of Speech Production Skills for Children with Cochlear Implants Using Wav2Vec2.0 Acoustic Embeddings. - Youngjin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim:

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. - Mark A. Huckvale, Gaston Hilkhuysen:

Evaluating a 3-factor listener model for prediction of speech intelligibility to hearing-impaired listeners. - Sophie Fagniart

, Brigitte Charlier, Véronique Delvaux, Bernard Harmegnies, Anne Huberlant, Myriam Piccaluga, Kathy Huet:
Production of fricative consonants in French-speaking children with cochlear implants and typical hearing: acoustic and phonological analyses. - Toshio Irino, Shintaro Doan, Minami Ishikawa:

Signal processing algorithm effective for sound quality of hearing loss simulators. - Yixiang Niu

, Ning Chen, Hongqing Zhu, Zhiying Zhu, Guangqiang Li, Yibo Chen:
Auditory Spatial Attention Detection Based on Feature Disentanglement and Brain Connectivity-Informed Graph Neural Networks. - Jessica Monaghan

, Arun Sebastian, Nicky Chong-White, Vicky Zhang, Vijayalakshmi Easwar, Pádraig Kitterick:
Automatic Detection of Hearing Loss from Children's Speech using wav2vec 2.0 Features.
Speech Disorders 2
- Vrushank Changawala, Frank Rudzicz:

Whister: Using Whisper's representations for Stuttering detection. - Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti:

Improving Speech-Based Dysarthria Detection using Multi-task Learning with Gradient Projection. - Guanlin Chen, Yun Jin:

Cascaded Transfer Learning Strategy for Cross-Domain Alzheimer's Disease Recognition through Spontaneous Speech. - Loukas Ilias, Dimitris Askounis:

A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech. - Si-Ioi Ng, Lingfeng Xu

, Kimberly D. Mueller, Julie Liss, Visar Berisha:
Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance. - Katerina Papadimitriou, Gerasimos Potamianos:

Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning. - Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L. Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas K. Maier:

Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data. - Haojie Zhang, Tao Zhang, Ganjun Liu, Dehui Fu, Xiaohui Hou, Ying Lv:

DysArinVox: DYSphonia & DYSarthria mandARIN speech corpus. - Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary A. Miller, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli:

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. - Gábor Gosztolya, Veronika Svindt

, Judit Bóna, Ildikó Hoffmann:
Automatic Longitudinal Investigation of Multiple Sclerosis Subjects.
TAUKADIAL Challenge: Speech-Based Cognitive Assessment in Chinese and English (Special Session)
- Saturnino Luz, Sofia de la Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi

, Ya-Ning Chang
, Chia-Ju Chou, Yi-Chien Liu:
Connected Speech-Based Cognitive Assessment in Chinese and English. - David Ortiz-Perez, José García Rodríguez, David Tomás:

Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis. - Gábor Gosztolya, László Tóth:

Combining Acoustic Feature Sets for Detecting Mild Cognitive Impairment in the Interspeech'24 TAUKADIAL Challenge. - Junwen Duan, Fangyuan Wei, Hong-Dong Li, Jin Liu:

Pre-trained Feature Fusion and Matching for Mild Cognitive Impairment Detection. - Benjamin Barrera-Altuna, Daeun Lee, Zaima Zarnaz, Jinyoung Han, Seungbae Kim:

The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal Approach. - Anna Favaro, Tianyu Cao, Najim Dehak, Laureano Moro-Velázquez:

Leveraging Universal Speech Representations for Detecting and Assessing the Severity of Mild Cognitive Impairment Across Languages. - Bao Hoang

, Yijiang Pang, Hiroko H. Dodge, Jiayu Zhou:
Translingual Language Markers for Cognitive Assessment from Spontaneous Speech. - Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Philipp Klumpp, Tobias Weise, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier:

Multilingual Speech and Language Analysis for the Assessment of Mild Cognitive Impairment: Outcomes from the Taukadial Challenge.
Show and Tell 1
- Takayuki Arai, Ryohei Suzuki, Chandler Earp, Shinya Tsuji, Keiko Ochi:

Production of phrases by mechanical models of the human vocal tract. - Vishal Gourav, Ankit Tyagi, Phanindra Mankale:

Faster Vocoder: a multi threading approach to achieve low latency during TTS Inference. - Aanchan Mohan, Monideep Chakraborti, Katelyn Eng, Nailia Kushaeva, Mirjana Prpa, Jordan Lewis, Tianyi Zhang, Vince Geisler, Carol Geisler:

A powerful and modern AAC composition tool for impaired speakers. - Grzegorz P. Mika, Konrad Zielinski, Pawel Cyrta, Marek Grzelec:

VoxFlow AI: wearable voice converter for atypical speech. - Sai Akarsh C, Vamshiraghusimha Narasinga, Anil Kumar Vuppala:

Stress transfer in speech-to-speech machine translation. - Takuma Okamoto, Yamato Ohtani, Hisashi Kawai:

Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis. - Alex Peiró Lilja, José Giraldo, Martí Llopart-Font, Carme Armentano-Oller, Baybars Külebi, Mireia Farrús:

Multi-speaker and multi-dialectal Catalan TTS models for video gaming. - Juliana Francis, Éva Székely, Joakim Gustafson:

ConnecTone: a modular AAC system prototype with contextual generative text prediction and style-adaptive conversational TTS. - Mahdin Rohmatillah, Bryan Gautama Ngo, Willianto Sulaiman, Po-Chuan Chen, Jen-Tzung Chien:

Reliable dialogue system for facilitating student-counselor communication. - Harm Lameris, Joakim Gustafson, Éva Székely:

CreakVC: a voice conversion tool for modulating creaky voice. - Yu-Sheng Tsao, Yung-Chang Hsu, Jiun-Ting Li, Siang-Hong Weng, Tien-Hong Lo, Berlin Chen:

EZTalking: English assessment platform for teachers and students.
Keynote 2
- Shoko Araki:

Frontier of Frontend for Conversational Speech Processing.
Phonetics and Phonology of Second Language Acquisition
- Paige Tuttösí

, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim:
Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation. - Anisia Popescu, Lori Lamel, Ioana Vasilescu, Laurence Devillers:

Automatic Speech Recognition with parallel L1 and L2 acoustic phone models to evaluate /l/ allophony in L2 English speech production. - Kevin Huang, Jack Goldberg, Louis Goldstein, Shrikanth Narayanan:

Analysis of articulatory setting for L1 and L2 English speakers using MRI data. - Ioana Colgiu, Laura Spinu, Rajiv Rao, Yasaman Rafat:

Bilingual Rhotic Production Patterns: A Generational Comparison of Spanish-English Bilingual Speakers in Canada. - Sylvain Coulange, Tsuneo Kato, Solange Rossato, Monica Masperi:

Exploring Impact of Pausing and Lexical Stress Patterns on L2 English Comprehensibility in Real Time. - Qi Wu:

Mandarin T3 Production by Chinese and Japanese Native Speakers.
Corpora-based Approaches in Automatic Emotion Recognition
- Sumit Ranjan, Rupayan Chakraborty, Sunil Kumar Kopparapu

:
Reinforcement Learning based Data Augmentation for Noise Robust Speech Emotion Recognition. - Pravin Mote, Berrak Sisman, Carlos Busso:

Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion. - Jincen Wang, Yan Zhao, Cheng Lu, Hailun Lian, Hongli Chang, Yuan Zong, Wenming Zheng:

Confidence-aware Hypothesis Transfer Networks for Source-Free Cross-Corpus Speech Emotion Recognition. - Yuxuan Xi, Yan Song, Lirong Dai, Haoyu Song, Ian McLoughlin:

An Effective Local Prototypical Mapping Network for Speech Emotion Recognition. - Yuan Gao, Hao Shi, Chenhui Chu, Tatsuya Kawahara:

Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction.
Analysis of Speakers States and Traits
- Oliver Niebuhr

, Nafiseh Taghva:
How rhythm metrics are linked to produced and perceived speaker charisma. - Zhu Li

, Xiyuan Gao
, Yuqing Zhang
, Shekhar Nayak
, Matt Coler
:
A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm. - John Murzaku, Adil Soubki, Owen Rambow:

Multimodal Belief Prediction. - Run Chen, Haozhe Chen, Anushka Kulkarni, Eleanor Lin, Linda Pang, Divya Tadimeti, Jun Shin, Julia Hirschberg:

Detecting Empathy in Speech. - Dehua Tao, Tan Lee, Harold Chui, Sarah Luk:

Learning Representation of Therapist Empathy in Counseling Conversation Using Siamese Hierarchical Attention Network. - Han Kunmei:

Modelling Lexical Characteristics of the Healthy Aging Population: A Corpus-Based Study. - Maurice Gerczuk

, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller:
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment.
Spoofing and Deepfake Detection
- Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury:

Source Tracing of Audio Deepfake Systems. - Tianchi Liu

, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li:
How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio? - Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi:

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis. - Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali:

SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures. - Menglu Li, Xiao-Ping Zhang:

Interpretable Temporal Class Activation Representation for Audio Spoofing Detection. - Zirui Ge

, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang, Björn W. Schuller:
DGPN: A Dual Graph Prototypical Network for Few-Shot Speech Spoofing Algorithm Recognition.
Audio Captioning, Tagging, and Audio-Text Retrieval
- Jianyuan Sun, Wenwu Wang, Mark D. Plumbley:

PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio Captioning. - Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang:

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding. - Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou:

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation. - Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang:

Streaming Audio Transformers for Online Audio Tagging. - Aryan Chaudhary, Arshdeep Singh

, Vinayak Abrol, Mark D. Plumbley:
Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging. - Xin Jing, Andreas Triantafyllopoulos, Björn W. Schuller:

ParaCLAP - Towards a general language-audio model for computational paralinguistic tasks. - Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley:

Efficient Audio Captioning with Encoder-Level Knowledge Distillation.
Generative Speech Enhancement
- Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu:

Universal Score-based Speech Enhancement with High Content Preservation. - Haici Yang, Jiaqi Su, Minje Kim, Zeyu Jin:

Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens. - Ante Jukic, Roman Korostik, Jagadeesh Balam, Boris Ginsburg:

Schrödinger Bridge for Generative Speech Enhancement. - Thanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich:

Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge. - Yiyuan Yang, Niki Trigoni, Andrew Markham:

Pre-training Feature Guided Diffusion Model for Speech Enhancement. - Dail Kim, Da-Hee Yang

, Donghyun Kim, Joon-Hyuk Chang, Jeonghwan Choi, Moa Lee, Jaemo Yang, Han-gil Moon:
Guided conditioning with predictive network on score-based diffusion model for speech enhancement.
Speech Synthesis: Evaluation
- Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang

:
SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models. - Srija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra:

Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies. - Jens Edlund, Christina Tånnander, Sébastien Le Maguer

, Petra Wagner
:
Assessing the impact of contextual framing on subjective TTS quality. - Adaeze Adigwe

, Sarenne Wallbridge, Simon King:
What do people hear? Listeners' Perception of Conversational Speech. - Hui Wang, Shiwan Zhao, Jiaming Zhou, Xiguang Zheng, Haoqin Sun, Xuechen Wang, Yong Qin:

Uncertainty-Aware Mean Opinion Score Prediction. - Félix Saget, Meysam Shamsi, Marie Tahon

:
Lifelong Learning MOS Prediction for Synthetic Speech Quality Evaluation.
Multilingual ASR
- Kwok Chin Yuen, Jia Qi Yip, Eng Siong Chng:

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems. - Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

:
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets. - Andrés Piñeiro Martín

, Carmen García-Mateo, Laura Docío Fernández, Maria del Carmen Lopez-Perez, Georg Rehm:
Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition. - A. F. M. Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen:

M2ASR: Multilingual Multi-task Automatic Speech Recognition via Multi-objective Optimization. - Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan:

MSR-86K: An Evolving, Multilingual Corpus with 86, 300 Hours of Transcribed Audio for Speech Recognition Research. - Brady Houston, Omid Sadjadi, Zejiang Hou, Srikanth Vishnubhotla, Kyu J. Han:

Improving Multilingual ASR Robustness to Errors in Language Input.
General Topics in ASR
- Jiwon Suh, Injae Na, Woohwan Jung:

Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions. - Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang:

A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting. - Mario Zusag, Laurin Wagner, Bernhard Thallinger:

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions. - Péter Mihajlik, Yan Meng, Mate S. Kadar, Julian Linke

, Barbara Schuppler
, Katalin Mády:
On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition. - Dena F. Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss

, Caryn Herring, Jia Bin:
Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation. - Hao Tan

, Xiaochen Liu, Huan Zhang, Junjian Zhang, Yaguan Qian, Zhaoquan Gu:
DualPure: An Efficient Adversarial Purification Method for Speech Command Recognition. - Jan Lehecka, Josef V. Psutka, Lubos Smídl

, Pavel Ircing, Josef Psutka:
A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives. - Antón de la Fuente, Dan Jurafsky:

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models. - Spyretta Leivaditi, Tatsunari Matsushima, Matt Coler

, Shekhar Nayak
, Vass Verkhodanova:
Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating the Impact of Healthy, Disease-Specific, and Speaker-Specific Data. - I-Ting Hsieh, Chung-Hsien Wu

:
Dysarthric Speech Recognition Using Curriculum Learning and Articulatory Feature Embedding. - Shiyao Wang, Shiwan Zhao, Jiaming Zhou

, Aobo Kong, Yong Qin:
Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation. - Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou:

An efficient text augmentation approach for contextualized Mandarin speech recognition. - Sheng Li

, Chen Chen, Kwok Chin Yuen, Chenhui Chu, Eng Siong Chng, Hisashi Kawai:
Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses. - Lun Wang, Om Thakkar, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan:

Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping.
Spoken Language Understanding
- Emmy Phung, Harsh Deshpande, Ahmad Emami, Kanishk Singh:

AR-NLU: A Framework for Enhancing Natural Language Understanding Model Robustness against ASR Errors. - Mohan Li, Simon Keizer, Rama Doddipatla:

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding. - Tuyen Tran, Khanh Le, Ngoc Dang Nguyen, Minh Vu, Huyen Ngo, Woomyoung Park, Thi Thu Trang Nguyen:

VN-SLU: A Vietnamese Spoken Language Understanding Dataset. - Shunsuke Kando, Yusuke Miyao, Jason Naradowsky, Shinnosuke Takamichi:

Textless Dependency Parsing by Labeled Sequence Prediction. - Yaoyao Yue, Michael Proctor

, Luping Zhou, Rijul Gupta, Tharinda Piyadasa, Amelia Gully, Kirrie J. Ballard
, Craig T. Jin:
Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI. - Alexander Johnson, Peter Plantinga, Pheobe Sun, Swaroop Gadiyaram, Abenezer Girma, Ahmad Emami:

Efficient SQA from Long Audio Contexts: A Policy-driven Approach.
Speech and Multimodal Resources
- Jan Pesán, Vojtech Jurík, Martin Karafiát, Jan Cernocký:

BESST Dataset: A Multimodal Resource for Speech-based Stress Detection and Analysis. - Arnon Turetzky, Or Tal, Yael Segal, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya Roni Chernyak, Olga Seleznova, Joseph Keshet

, Yossi Adi:
HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing. - Wenbin Wang, Yang Song, Sanjay Jha:

GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech. - Yuexuan Kong, Viet-Anh Tran

, Romain Hennequin:
STraDa: A Singer Traits Dataset. - Katharina Anderer

, Andreas Reich, Matthias Wölfel:
MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features. - Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh:

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset. - Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer:

Towards measuring fairness in speech recognition: Fair-Speech dataset. - Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi:

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio. - Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem:

SER Evals: In-domain and Out-of-domain benchmarking for speech emotion recognition.
Pathological Speech Analysis 1
- Vidar Freyr Gudmundsson, Keve Márton Gönczi, Malin Svensson Lundmark

, Donna Erickson, Oliver Niebuhr
:
The MARRYS helmet: A new device for researching and training "jaw dancing". - Moreno La Quatra, Maria Francesca Turco, Torbjørn Svendsen

, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi:
Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions. - Andreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian B. Pokorny, Maurice Gerczuk

, Shahin Amiriparian, Thomas M. Berghaus, Björn W. Schuller:
Sustained Vowels for Pre- vs Post-Treatment COPD Classification. - Mahdi Amiri, Ina Kodrasi:

Adversarial Robustness Analysis in Automatic Pathological Speech Detection Approaches. - Gahye Kim, Yunjung Eom, Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So:

Automatic Children Speech Sound Disorder Detection with Age and Speaker Bias Mitigation.
Speech and Language in Health: from Remote Monitoring to Medical Conversations - 1 (Special Session)
- Mojtaba Kadkhodaie Elyaderani, John Glover, Thomas Schaaf:

Reference-Free Estimation of the Quality of Clinical Notes Generated from Doctor-Patient Conversations. - Jihyun Mun, Sunhee Kim, Minhwa Chung:

Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder. - Vladimir Despotovic

, Abir Elbéji, Petr V. Nazarov
, Guy Fagherazzi:
Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention. - Stefano Goria, Roseline Polle, Salvatore Fara, Nicholas Cummins:

Revealing Confounding Biases: A Novel Benchmarking Approach for Aggregate-Level Performance Metrics in Health Assessments. - Anaïs Rameau, Satrajit Ghosh, Alexandros Sigaras

, Olivier Elemento, Jean-Christophe Bélisle-Pipon, Vardit Ravitsky, Maria Powell, Alistair Johnson, David A. Dorr, Philip R. O. Payne, Micah Boyer, Stephanie Watts, Ruth Bahr, Frank Rudzicz, Jordan Lerner-Ellis, Shaheen Awan, Don Bolser, Yael Bensoussan:
Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI. - Sri Harsha Dumpala, Katerina Dikaios, Abraham Nunes, Frank Rudzicz, Rudolf Uher, Sageev Oore

:
Self-Supervised Embeddings for Detecting Individual Symptoms of Depression. - Daryush D. Mehta, Jarrad H. Van Stan, Hamzeh Ghasemzadeh, Robert E. Hillman:

Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction. - Jennifer Williams

, Eike Schneiders
, Henry Card, Tina Seabrooke
, Beatrice Pakenham-Walsh, Tayyaba Azim, Lucy Valls-Reed, Ganesh Vigneswaran
, John Robert Bautista
, Rohan Chandra, Arya Farahi:
Predicting Acute Pain Levels Implicitly from Vocal Features. - Kubilay Can Demir, Belén Lojo Rodríguez, Tobias Weise, Andreas K. Maier, Seung Hee Yang:

Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis. - Gowtham Premananth

, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Y. Espy-Wilson:
A Multimodal Framework for the Assessment of the Schizophrenia Spectrum.
Speech and Brain
- Yuzhe Wang, Anna Favaro, Thomas Thebaud, Jesús Villalba, Najim Dehak, Laureano Moro-Velázquez:

Exploring the Complementary Nature of Speech and Eye Movements for Profiling Neurological Disorders. - Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling:

Refining Self-supervised Learnt Speech Representation using Brain Activations. - Yuejiao Wang, Xianmin Gong

, Lingwei Meng, Xixin Wu, Helen Meng:
Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder. - Kumar Neelabh, Vishnu Sreekumar:

From Sound to Meaning in the Auditory Cortex: A Neuronal Representation and Classification Analysis. - Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang:

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models. - Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, Shrikanth Narayanan:

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals.
Innovative Methods in Phonetics and Phonology
- Emily P. Ahn, Eleanor Chodroff, Myriam Lapierre, Gina-Anne Levow:

The Use of Phone Categories and Cross-Language Modeling for Phone Alignment of Panãra. - Sneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma:

Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN. - Harsha Veena Tadavarthy, Austin Jones, Margaret E. L. Renwick:

Phonological Feature Detection for US English using the Phonet Library. - Constantijn Kaland, Jeremy Steffman, Jennifer Cole

:
K-means and hierarchical clustering of f0 contours. - Rotem Rousso, Eyal Cohen, Joseph Keshet

, Eleanor Chodroff:
Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment. - Lila Kim, Cédric Gendrot:

Using wav2vec 2.0 for phonetic classification tasks: methodological aspects. - Michael Lambropoulos

, Frantz Clermont, Shunichi Ishihara
:
The sub-band cepstrum as a tool for locating local spectral regions of phonetic sensitivity: A first attempt with multi-speaker vowel data. - Woo-Jin Chung

, Hong-Goo Kang:
Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator. - Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Nöth, Björn Heismann, Andreas K. Maier, Seung Hee Yang:

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech. - Anna Oura, Hideaki Kikuchi, Tetsunori Kobayashi:

Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech.
Voice, Tones and F0
- Chenyu Li, Jalal Al-Tamimi:

Impact of the tonal factor on diphthong realizations in Standard Mandarin with Generalized Additive Mixed Models. - Xiaowang Liu, Jinsong Zhang:

A Study on the Information Mechanism of the 3rd Tone Sandhi Rule in Mandarin Disyllabic Words. - Melanie Weirich, Daniel Duran, Stefanie Jannedy

:
Gender and age based f0-variation in the German Plapper Corpus. - Chenzi Xu

, Jessica Wormald, Paul Foulkes, Philip Harrison, Vincent Hughes, Poppy Welch, Finnian Kelly
, David van der Vloed:
Voice quality in telephone speech: Comparing acoustic measures between VoIP telephone and high-quality recordings. - Iona Gessinger

, Bistra Andreeva, Benjamin R. Cowan:
The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer Partners.
Emotion Recognition: Resources and Benchmarks
- Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

:
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark. - Andreas Triantafyllopoulos, Anton Batliner, Simon David Noel Rampp, Manuel Milling, Björn W. Schuller:

INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition. - Adham Ibrahim, Shady Shehata

, Ajinkya Kulkarni
, Mukhtar Mohamed, Muhammad Abdul-Mageed:
What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark. - Abinay Reddy Naini, Lucas Goncalves, Mary A. Kohler, Donita Robinson, Elizabeth Richerson, Carlos Busso:

WHiSER: White House Tapes Speech Emotion Recognition Corpus. - Siddique Latif

, Raja Jurdak
, Björn W. Schuller:
Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition. - Jincen Wang, Yan Zhao, Cheng Lu, Chuangao Tang, Sunan Li, Yuan Zong, Wenming Zheng:

Boosting Cross-Corpus Speech Emotion Recognition using CycleGAN with Contrastive Learning.
Speaker and Language Identification and Diarization
- Bilal Rahou, Hervé Bredin:

Multi-latency look-ahead for streaming speaker segmentation. - Christoph Boeddeker, Tobias Cord-Landwehr

, Reinhold Haeb-Umbach:
Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment. - Théo Mariotte

, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas:
ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings. - Gabriel Pirlogeanu, Octavian Pascu, Alexandru-Lucian Georgescu, Horia Cucu:

Hybrid-Diarization System with Overlap Post-Processing for the DISPLACE 2024 Challenge. - Shareef Babu Kalluri

, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde
, Swapnil Sontakke
, Deepak K. T., S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy:
The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments. - Joonas Kalda, Tanel Alumäe, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer

:
TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024. - Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng:

Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification. - Martina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino:

Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech. - Rohit Paturi, Xiang Li, Sundararajan Srinivasan:

AG-LSEC: Audio Grounded Lexical Speaker Error Correction. - Hang Su, Yuxiang Kong, Lichun Fan, Peng Gao, Yujun Wang, Zhiyong Wu:

Speaker Change Detection with Weighted-sum Knowledge Distillation based on Self-supervised Pre-trained Models. - Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura:

SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization. - Hokuto Munakata, Ryo Terashima, Yusuke Fujita:

Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework.
Audio-Text Retrieval
- Yifei Xin, Xuxin Cheng, Zhihong Zhu, Xusheng Yang, Yuexian Zou:

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval. - Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang:

Bridging Language Gaps in Audio-Text Retrieval. - Soham Deshmukh, Rita Singh, Bhiksha Raj:

Domain Adaptation for Contrastive Audio-Language Models. - Francesco Paissan, Elisabetta Farella:

tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models. - June-Woo Kim

, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung:
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification. - Yuwu Tang, Ziang Ma, Haitao Zhang:

Enhanced Feature Learning with Normalized Knowledge Distillation for Audio Tagging.
Speech Enhancement
- Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie:

RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention. - Xi Liu, John H. L. Hansen:

DNN-based monaural speech enhancement using alternate analysis windows for phase and magnitude modification. - Li Li, Shogo Seki:

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio. - Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu:

Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression. - Jizhen Li, Xinmeng Xu, Weiping Tu, Yuhong Yang, Rong Zhu:

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer. - Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, Haizhou Li:

An Exploration of Length Generalization in Transformer-Based Speech Enhancement. - Haixin Guan

, Wei Dai, Guangyong Wang, Xiaobin Tan, Peng Li, Jiaen Liang:
Reducing Speech Distortion and Artifacts for Speech Enhancement by Loss Function. - Candy Olivia Mawalim, Shogo Okada, Masashi Unoki:

Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments? - Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe

, Yanmin Qian:
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement.
Speech Coding
- Jinghong Zhang, Zugang Zhao, Yonghui Liu, Jianbing Liu, Zhiqiang He, Kai Niu:

TD-PLC: A Semantic-Aware Speech Encoding for Improved Packet Loss Concealment. - Zihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie:

BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation. - Kishan Gupta, Nicola Pia, Srikanth Korse

, Andreas Brendel
, Guillaume Fuchs, Markus Multrus:
On Improving Error Resilience of Neural End-to-End Speech Coders. - Thomas Muller, Stéphane Ragot, Laetitia Gros, Pierrick Philippe, Pascal Scalart:

Speech quality evaluation of neural audio codecs. - Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, Zhen-Hua Ling:

A Low-Bitrate Neural Audio Codec Framework with Bandwidth Reduction and Recovery for High-Sampling-Rate Waveforms. - Haibin Wu, Yuan Tseng, Hung-yi Lee:

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems.
Speech Synthesis: Expressivity and Emotion
- Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong

, Pinxin Liu, Zhiyao Duan:
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis. - Donghyun Seong, Hoyoung Lee, Joon-Hyuk Chang:

TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech. - Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng:

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models. - Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie:

Text-aware and Context-aware Expressive Audiobook Speech Synthesis. - Thomas Bott, Florian Lux, Ngoc Thang Vu:

Controlling Emotion in Text-to-Speech with Natural Language Prompts. - Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li:

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining. - Pavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya:

Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation. - Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee:

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech. - Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang:

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder. - Chin-Yun Yu

, György Fazekas:
Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis.
Speech Synthesis: Tools and Data
- Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari:

SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark. - Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra:

Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings. - Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani:

FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks. - Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie:

WenetSpeech4TTS: A 12, 800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark. - Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen

, Zhefeng Wang, Baoxing Huai:
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis. - Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana:

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning. - Sewade Ogun, Abraham Toluwase Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin P. Adewumi:

1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis. - Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari:

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis.
Speech Synthesis: Singing Voice Synthesis
- Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin Choi, Nam Soo Kim:

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance. - Takuma Okamoto, Yamato Ohtani, Sota Shimizu, Tomoki Toda, Hisashi Kawai:

Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural Vocoder. - Taewoo Kim, Choongsang Cho, Young Han Lee:

Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis. - Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

:
Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing. - Ji-Sang Hwang, HyeongRae Noh, Yoonseok Hong, Insoo Oh:

X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning. - Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu:

An End-to-End Approach for Chord-Conditioned Song Generation.
LLM in ASR
- Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng:

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions. - Frank Seide, Yangyang Shi, Morrie Doulaty, Yashesh Gaur, Junteng Jia, Chunyang Wu:

Speech ReaLLM - Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of Time. - Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie:

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition. - Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang:

Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models.
Vision and Speech
- Jongsuk Kim, Jiwon Shin, Junmo Kim:

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning. - Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha:

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition. - Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu:

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition. - Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang:

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge.
Spoken Document Summarization
- Margaret Kroll, Kelsey Kraus:

Optimizing the role of human evaluation in LLM-based spoken document summarization systems. - Sangwon Ryu

, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok:
Key-Element-Informed sLLM Tuning for Document Summarization. - Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix:

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation. - Hengchao Shang, Zongyao Li, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Daimeng Wei, Hao Yang:

An End-to-End Speech Summarization Using Large Language Model. - Wonjune Kang, Deb Roy:

Prompting Large Language Models with Audio for General-Purpose Speech Summarization. - Khai Le-Duc, Khai-Nguyen Nguyen, Long Vo-Dang, Truong-Son Hy

:
Real-time Speech Summarization for Medical Conversations.
Speech and Language in Health: from Remote Monitoring to Medical Conversations - 2 (Special Sessions)
- Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Ilja Baumann, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet, Adolfo M. García, Juan Rafael Orozco-Arroyave:

It's Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson's Disease. - Malo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard:

Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models. - Catarina Botelho

, John Mendonça, Anna Pompili, Tanja Schultz, Alberto Abad, Isabel Trancoso:
Macro-descriptors for Alzheimer's disease detection using large language models. - Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer:

Infusing Acoustic Pause Context into Text-Based Dementia Assessment. - Oliver Roesler, Jackson Liscombe, Michael Neumann, Hardik Kothare, Abhishek Hosamath, Lakshmi Arbatti, Doug Habberstad, Christiane Suendermann-Oeft, Meredith Bartlett, Cathy Zhang, Nikhil Sukhdev, Kolja Wilms, Anusha Badathala, Sandrine Istas, Steve Ruhmel, Bryan Hansen, Madeline Hannan, David Henley, Arthur W. Wallace, Ira Shoulson, David Suendermann-Oeft, Vikram Ramanarayanan:

Towards Scalable Remote Assessment of Mild Cognitive Impairment Via Multimodal Dialog. - Mara Barberis, Pieter De Clercq, Bastiaan Tamm

, Hugo Van hamme
, Maaike Vandermosten:
Automatic recognition and detection of aphasic natural speech. - Giulia Sanguedolce

, Sophie Brook, Dragos-Cristian Gruia, Patrick A. Naylor
, Fatemeh Geranmayeh
:
When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition. - Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James R. Glass:

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer. - Hardik Kothare, Michael Neumann, Cathy Zhang, Jackson Liscombe, Jordi W. J. van Unnik, Lianne C. M. Botman, Leonard H. van den Berg, Ruben P. A van Eijk, Vikram Ramanarayanan:

How Consistent are Speech-Based Biomarkers in Remote Tracking of ALS Disease Progression Across Languages? A Case Study of English and Dutch. - Anika A. Spiesberger, Andreas Triantafyllopoulos, Alexander Kathan, Anastasia Semertzidou, Caterina Gawrilow, Tilman Reinelt

, Wolfgang A. Rauch, Björn W. Schuller:
"So . . . my child . . . " - How Child ADHD Influences the Way Parents Talk. - Judith Dineley

, Ewan Carr, Lauren L. White
, Catriona Lucas, Zahia Rahman
, Tian Pan, Faith Matcham, Johnny Downs
, Richard J. B. Dobson
, Thomas F. Quatieri, Nicholas Cummins:
Variability of speech timing features across repeated recordings: a comparison of open-source extraction techniques. - Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier:

Zero-Shot End-To-End Spoken Question Answering In Medical Domain. - Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian:

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition.
Show and Tell 2
- Kesavaraj V, Charan Devarkonda, Vamshiraghusimha Narasinga, Anil Kumar Vuppala:

Custom wake word detection. - Song Chen, Mandar Gogate, Kia Dashtipour, Jasper Kirton-Wingate, Adeel Hussain, Faiyaz Doctor, Tughrul Arslan, Amir Hussain:

Edged based audio-visual speech enhancement demonstrator. - Arif Reza Anway, Bryony Buck, Mandar Gogate, Kia Dashtipour, Michael Akeroyd, Amir Hussain:

Real-Time Gaze-directed speech enhancement for audio-visual hearing-aids. - Abhishek Kumar, Srikanth Konjeti, Jithendra Vepa:

Detection of background agents speech in contact centers. - Bramhendra Koilakuntla, Prajesh Rana, Paras Ahuja, Srikanth Konjeti, Jithendra Vepa:

Leveraging large language models for post-transcription correction in contact centers. - Leonie Schade, Nico Dallmann, Olcay Türk, Stefan Lazarov, Petra Wagner:

Understanding "understanding": presenting a richly annotated multimodal corpus of dyadic interaction. - João Vítor Possamai de Menezes, Arne-Lukas Fietkau, Tom Diener, Steffen Kürbis, Peter Birkholz:

A demonstrator for articulation-based command word recognition. - Nigel G. Ward, Andres Segura:

Pragmatically similar utterance finder demonstration. - Kai Liu, Ziqing Du, Huan Zhou, Xucheng Wan, Naijun Zheng:

Real-time scheme for rapid extraction of speaker embeddings in challenging recording conditions. - Szu-Yu Chen, Tien-Hong Lo, Yao-Ting Sung, Ching-Yu Tseng, Berlin Chen:

TEEMI: a speaking practice tool for L2 English learners.
Prosody
- Na Hu, Hugo Schnack, Amalia Arvaniti

:
Automatic pitch accent classification through image classification. - Tianqi Geng, Hui Feng:

Form and Function in Prosodic Representation: In the Case of 'ma' in Tianjin Mandarin. - Joyshree Chakraborty, Leena Dihingia, Priyankoo Sarmah, Rohit Sinha:

On Comparing Time- and Frequency-Domain Rhythm Measures in Classifying Assamese Dialects. - Chiara Riegger, Tina Bögel, George Walkden:

The prosody of the verbal prefix ge-: historical and experimental evidence. - Hongchen Wu, Jiwon Yun:

Influences of Morphosyntax and Semantics on the Intonation of Mandarin Chinese Wh-indeterminates. - Benazir Mumtaz, Miriam Butt:

Urdu Alternative Questions: A Hat Pattern.
Foundational Models for Deepfake and Spoofed Speech Detection
- Hoan My Tran, David Guennec, Philippe Martin, Aghilas Sini, Damien Lolive, Arnaud Delhay, Pierre-François Marteau:

Spoofed Speech Detection with a Focus on Speaker Embedding. - Juan M. Martín-Doñas

, Aitor Álvarez, Eros Rosello, Angel M. Gomez, Antonio M. Peinado:
Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection. - Zihan Pan, Tianchi Liu

, Hardik B. Sailor, Qiongqiong Wang:
Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection. - Haochen Wu, Wu Guo, Shengyu Peng, Zhuhai Li, Jie Zhang:

Adapter Learning from Pre-trained Model for Robust Spoof Speech Detection. - Kexu Liu

, Yuanxin Wang, Shengchen Li, Xi Shao:
Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing Attacks. - Thien-Phuc Doan

, Long Nguyen-Vu, Kihun Hong, Souhwan Jung:
Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake Detection.
Speaker Recognition 1
- Shengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang:

Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker Verification. - En-Lun Yu, Kuan-Hsun Ho, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen:

Speaker Conditional Sinc-Extractor for Personal VAD. - Shiu-Hsiang Liou, Po-Cheng Chan, Chia-Ping Chen, Tzu-Chieh Lin, Chung-Li Lu, Yu-Han Cheng, Hsiang-Feng Chuang, Wei-Yu Chen:

Enhancing ECAPA-TDNN with Feature Processing Module and Attention Mechanism for Speaker Verification. - Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu:

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms. - Kihyun Nam

, Hee-Soo Heo, Jee-Weon Jung, Joon Son Chung:
Disentangled Representation Learning for Environment-agnostic Speaker Recognition. - Ladislav Mosner, Romain Serizel, Lukás Burget, Oldrich Plchot, Emmanuel Vincent, Junyi Peng

, Jan Cernocký:
Multi-Channel Extension of Pre-trained Models for Speaker Verification. - Yishuang Li, Wenhao Guan, Hukai Huang, Shiyu Miao, Qi Su, Lin Li, Qingyang Hong:

Efficient Integrated Features Based on Pre-trained Models for Speaker Verification. - Tianhao Wang, Lantian Li, Dong Wang:

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition. - Wei-Lin Xie, Yu-Xuan Xi, Yan Song, Jian-Tao Zhang, Hao-Yu Song, Ian McLoughlin:

DB-PMAE: Dual-Branch Prototypical Masked AutoEncoder with locality for domain robust speaker verification. - Matthew Maciejewski, Dominik Klement, Ruizhe Huang, Matthew Wiesner

, Sanjeev Khudanpur:
Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language. - Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang:

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition.
Source Separation 1
- Helin Wang, Jesús Villalba, Laureano Moro-Velázquez, Jiarui Hai, Thomas Thebaud, Najim Dehak:

Noise-robust Speech Separation with Fast Generative Correction. - Roland Hartanto, Sakriani Sakti, Koichi Shinoda:

MSDET: Multitask Speaker Separation and Direction-of-Arrival Estimation Training. - Jacob Kealey, John R. Hershey, François Grondin:

Unsupervised Improved MVDR Beamforming for Sound Enhancement. - Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin:

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation. - Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang:

Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments. - Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma:

Towards Audio Codec-based Speech Separation.
Audio-Visual and Generative Speech Enhancement
- Zhengxiao Li, Nakamasa Inoue:

Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step Diffusion. - Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang:

Diffusion Gaussian Mixture Audio Denoise. - Bunlong Lay, Timo Gerkmann:

An Analysis of the Variance of Diffusion-based Speech Enhancement. - Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung:

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching. - Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic:

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement. - Junhui Li, Pu Wang, Jialu Li, Youshan Zhang:

Complex Image-Generative Diffusion Transformer for Audio Denoising. - Yuchen Hu, Chen Chen, Ruizhe Li

, Qiushi Zhu, Eng Siong Chng:
Noise-aware Speech Enhancement using Diffusion Probabilistic Model.
Speech Privacy and Bandwidth Expansion
- Mohammad Hassan Vali

, Tom Bäckström
:
Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization. - Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji:

SilentCipher: Deep Audio Watermarking. - Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv:

Frequency-mix Knowledge Distillation for Fake Speech Detection. - Nicolas M. Müller, Piotr Kawa, Shen Hu, Matthias Neu, Jennifer Williams

, Philip Sperl, Konstantin Böttinger:
A New Approach to Voice Authenticity. - Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen:

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking. - Liwei Liu, Huihui Wei, Dongya Liu, Zhonghua Fu:

HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature Fusion. - Denise Moussa, Sandra Bergmann, Christian Riess:

Unmasking Neural Codecs: Forensic Identification of AI-compressed Speech. - Yin-Tse Lin, Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee:

SWiBE: A Parameterized Stochastic Diffusion Process for Noise-Robust Bandwidth Expansion. - Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling:

MultiStage Speech Bandwidth Extension with Flexible Sampling Rate Control. - Xu Li, Qirui Wang, Xiaoyu Liu:

MaskSR: Masked Language Model for Full-band Speech Restoration.
Speech Synthesis: Prosody
- Yuliya Korotkova, Ilya Kalinovskiy, Tatiana Vakhrusheva:

Word-level Text Markup for Prosody Control in Speech Synthesis. - Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter:

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. - Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao, Naoyuki Kanda:

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems. - Himanshu Maurya, Atli Sigurgeirsson:

A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer. - Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang:

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling. - Jinzuomu Zhong

, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu:
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of Speech-Silence and Word-Punctuation.
Accented Speech, Prosodic Features, Dialect, Emotion, Sound Classification
- Darshan Prabhu, Abhishek Gupta

, Omkar Nitsure, Preethi Jyothi, Sriram Ganapathy:
Improving Self-supervised Pre-training using Accent-Specific Codebooks. - Tejumade Afonja, Tobi Olatunji, Sewade Ogun

, Naome A. Etori, Abraham Toluwase Owodunni, Moshood Yekini:
Performant ASR Models for Medical Entities in Accented Speech. - Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho Ittan George, Kaushal Santosh Bhogale, Deovrat Mehendale, Mitesh M. Khapra:

LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems. - Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim:

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech. - Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu, Qingyang Hong, Lin Li:

MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition. - Ying Hu, Huamin Yang, Hao Huang, Liang He

:
Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition. - Arnav Goel, Medha Hira, Anubha Gupta

:
Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning. - Hazim T. Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh:

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios. - Martijn Bentum, Louis ten Bosch, Tom Lentz

:
The Processing of Stress in End-to-End Automatic Speech Recognition Models. - Tuan Nguyen, Huy Dat Tran:

LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection. - Rhiannon Mogridge, Anton Ragni:

Learning from memory-based models. - Meiling Chen, Pengjie Liu, Heng Yang, Haofeng Wang:

Towards End-to-End Unified Recognition for Mandarin and Cantonese.
Neural Network Adaptation
- Thomas Rolland, Alberto Abad:

Shared-Adapters: A Novel Transformer-based Parameter Efficient Transfer Learning Approach For Children's Automatic Speech Recognition. - Zhouyuan Huo, Dongseong Hwang, Gan Song, Khe Chai Sim, Weiran Wang:

AdaRA: Adaptive Rank Allocation of Residual Adapters for Speech Foundation Model. - Kyuhong Shim, Jinkyu Lee, Hyunjae Kim:

Leveraging Adapter for Parameter-Efficient ASR Encoder. - Ji-Hun Kang, Jae-Hong Lee, Mun-Hak Lee, Joon-Hyuk Chang:

Whisper Multilingual Downstream Task Tuning Using Task Vectors. - Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang:

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR. - Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei:

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition.
ASR and LLMs
- Ji Won Yoon, Beom Jun Woo, Nam Soo Kim:

HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition. - Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen:

MaLa-ASR: Multimedia-Assisted LLM-Based ASR. - HyunJung Choi, Muyeol Choi, Yohan Lim, Minkyu Lee, Seon Hui Kim, Seung Yun, Donghyun Kim, Sang-Hun Kim:

Spoken-to-written text conversion with Large Language Model. - Zhiqi Ai

, Zhiyong Chen, Shugong Xu:
MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting. - Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogério Feris, James Glass:

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation. - K. R. Prajwal, Triantafyllos Afouras, Andrew Zisserman:

Speech Recognition Models are Strong Lip-readers.
Pathological Speech Analysis 3
- Ilja Baumann, Dominik Wagner, Maria Schuster, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet:

Towards Self-Attention Understanding for Automatic Articulatory Processes Analysis in Cleft Lip and Palate Speech. - Yin-Long Liu, Rui Feng, Jia-Hong Yuan, Zhen-Hua Ling:

Clever Hans Effect Found in Automatic Detection of Alzheimer's Disease through Speech. - Yeh-Sheng Lin, Shu-Chuan Tseng, Jyh-Shing Roger Jang:

Leveraging Phonemic Transcription and Whisper toward Clinically Significant Indices for Automatic Child Speech Assessment. - Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo:

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features. - Wei-Tung Hsu, Chin-Po Chen, Yun-Shao Lin, Chi-Chun Lee:

A Cluster-based Personalized Federated Learning Strategy for End-to-End ASR of Dementia Patients. - Stefan Kalabakov, Monica González Machorro, Florian Eyben, Björn W. Schuller, Bert Arnrich:

A Comparative Analysis of Federated Learning for Speech-Based Cognitive Decline Detection. - Michael Neumann, Hardik Kothare, Jackson Liscombe, Emma C. L. Leschly, Oliver Roesler, Vikram Ramanarayanan:

Multimodal Digital Biomarkers for Longitudinal Tracking of Speech Impairment Severity in ALS: An Investigation of Clinically Important Differences.
Speech Disorders 3
- Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee:

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design. - Neil Kumar Shah, Shirish S. Karande, Vineet Gandhi:

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models. - Seyun Um, Doyeon Kim, Hong-Goo Kang:

PARAN: Variational Autoencoder-based End-to-End Articulation-to-Speech System for Speech Intelligibility. - Si Chen, Bruce Xiao Wang, Yitian Hong, Fang Zhou, Angel Chan, Po-yi Tang, Bin Li

, Chunyi Wen, James Cheung, Yan Liu, Zhuoming Chen:
Acoustic changes in speech prosody produced by children with autism after robot-assisted speech training. - Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson:

Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility. - Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave

, Marilyn A. Ladewig, Rus Heywood, Jordan R. Green:
Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech. - Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

:
Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis. - Gábor Gosztolya, Mercedes Vetráb

, Veronika Svindt
, Judit Bóna, Ildikó Hoffmann:
Wav2vec 2.0 Embeddings Are No Swiss Army Knife - A Case Study for Multiple Sclerosis.
Speech Recognition with Large Pretrained Speech Models for Under-represented Languages (Special Session)
- Yi-Jen Shih, David Harwath:

Interface Design for Self-Supervised Speech Models. - Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu:

Comparing Discrete and Continuous Space LLMs for Speech Recognition. - Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang:

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text. - Kaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra:

Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling. - Zhengyang Li

, Patrick Blumenberg, Jing Liu, Thomas Graave
, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt:
Interleaved Audio/Audiovisual Transfer Learning for AV-ASR in Low-Resourced Languages. - Sathvik Udupa, Jesuraj Bandekar, Saurabh Kumar, Deekshitha G, Sandhya Badiger, Abhayjeet Singh Savitha Murthy, Priyanka Pai, Srinivasa Raghavan K. M., Raoul Nanavati, Prasanta Kumar Ghosh:

Adapter pre-training for improved speech recognition in unseen domains using low resource adapter tuning of self-supervised models. - Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie:

Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper. - Yaroslav Getman

, Tamás Grósz, Katri Hiovain-Asikainen, Mikko Kurimo:
Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on Northern Sámi. - Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, Mark J. F. Gales:

Learn and Don't Forget: Adding a New Language to ASR Foundation Models.
Speech Processing Using Discrete Speech Units (Special Session)
- Yuning Wu, Chunlei Zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin:

TokSing: Singing Voice Synthesis based on Discrete Tokens. - Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli:

How Should We Extract Discrete Audio Tokens from Self-Supervised Models? - Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe

, Yossi Adi, Xie Chen, Qin Jin:
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units. - Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin:

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models. - Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

:
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model. - Kunal Dhawan, Nithin Rao Koluguri, Ante Jukic, Ryan Langman, Jagadeesh Balam, Boris Ginsburg:

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations.
Keynote 3
- Elmar Nöth:

Analysis of Pathological Speech - Pitfalls along the Way.
Databases and Progress in Methodology
- Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung:

VoxSim: A perceptual voice similarity dataset. - Mewlude Nijat, Chen Chen, Dong Wang, Askar Hamdulla:

UY/CH-CHILD - A Public Chinese L2 Speech Database of Uyghur Children. - Prakash Kumar, Ye Tian, Yongwan Lim, Sophia X. Cui, Christina Hagedorn, Dani Byrd, Uttam K. Sinha, Shrikanth Narayanan, Krishna S. Nayak:

State-of-the-art speech production MRI protocol for new 0.55 Tesla scanners. - Mingyue Shi, Huali Zhou, Qinglin Meng, Nengheng Zheng:

DBD-CI: Doubling the Band Density for Bilateral Cochlear Implants. - Huihang Zhong, Yanlu Xie, ZiJin Yao:

Leveraging Large Language Models to Refine Automatic Feedback Generation at Articulatory Level in Computer Aided Pronunciation Training. - Bin Zhao, Mingxuan Huang, Chenlu Ma, Jinyi Xue

, Aijun Li, Kunyu Xu
:
Decoding Human Language Acquisition: EEG Evidence for Predictive Probabilistic Statistics in Word Segmentation.
Articulation, Convergence and Perception
- Jérémy Giroud, Jessica Lei, Kirsty Phillips

, Matthew H. Davis:
Behavioral evidence for higher speech rate convergence following natural than artificial time altered speech. - Qingye Shen, Leonardo Lancia, Noël Nguyen:

A novel experimental design for the study of listener-to-listener convergence in phoneme categorization. - Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, Guanglai Gao:

Cross-Attention-Guided WaveNet for EEG-to-MEL Spectrogram Reconstruction. - Nicolò Loddo, Francisca Pessanha

, Almila Akdag Salah
:
What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech Synthesis. - Malin Svensson Lundmark

:
Magnitude and timing of acceleration peaks in stressed and unstressed syllables.
Speech Emotion Recognition
- Shahin Amiriparian, Filip Packan, Maurice Gerczuk

, Björn W. Schuller:
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets. - Fabian Ritter Gutierrez, Kuan-Po Huang, Jeremy H. M. Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng:

Dataset-Distillation Generative Model for Speech Emotion Recognition. - Jialong Mai, Xiaofen Xing, Weidong Chen, Xiangmin Xu:

DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion Recognition. - Minxue Niu, Mimansa Jaiswal, Emily Mower Provost:

From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs.
Self-Supervised Models in Speaker Recognition
- Ju-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Minjae Lee, Ha-Jin Yu:

Self-supervised speaker verification with relational mask prediction. - Victor Miara, Théo Lepage

, Réda Dehak:
Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models. - Chan-yeong Lim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Kyo-Won Koo, Seung-bin Kim, Ha-Jin Yu:

Improving Noise Robustness in Self-supervised Pre-trained Model for Speaker Verification. - Abderrahim Fathan, Xiaolin Zhu, Jahangir Alam:

On the impact of several regularization techniques on label noise robustness of self-supervised speaker verification systems. - Zhe Li, Man-Wai Mak, Hung-yi Lee, Helen Meng:

Parameter-efficient Fine-tuning of Speaker-Aware Dynamic Prompts for Speaker Verification. - Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.
Speech Quality Assessment
- Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda:

Embedding Learning for Preference-based Speech Quality Assessment. - Sathvik Udupa, Soumi Maiti, Prasanta Kumar Ghosh:

IndicMOS: Multilingual MOS Prediction for 7 Indian languages. - Dan Wells, Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Erica Cooper, Aidan Pine, Junichi Yamagishi, Korin Richmond:

Experimental evaluation of MOS, AB and BWS listening test designs. - Bao Thang Ta, Minh Tu Le, Van Hai Do, Huynh Thi Thanh Binh:

Enhancing No-Reference Speech Quality Assessment with Pairwise, Triplet Ranking Losses, and ASR Pretraining.
Privacy and Security in Speech Communication 1
- Nicolas M. Müller, Nicholas W. D. Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger:

Harder or Different? Understanding Generalization of Audio Deepfake Detection. - Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki

, Taiki Miyagawa:
Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset. - David Looney, Nikolay D. Gaubitch:

Robust spread spectrum speech watermarking using linear prediction and deep spectral shaping. - Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Zhao Lv, Cunhang Fan:

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection. - Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung:

How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines. - Ching-Yu Yang, Shreya G. Upadhyay, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee:

RW-VoiceShield: Raw Waveform-based Adversarial Attack on One-shot Voice Conversion.
Speech Synthesis: Voice Conversion 2
- Aleksei Gusev, Anastasia Avdeeva:

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech. - Ji Sub Um, Hoirin Kim:

Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion. - Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie:

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy. - Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari:

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment. - Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima:

Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training. - Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao:

Residual Speaker Representation for One-Shot Voice Conversion. - Nicolas Gengembre

, Olivier Le Blouch, Cédric Gendrot:
Disentangling prosody and timbre embeddings via voice conversion. - Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, Lirong Dai:

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance.
Speech Synthesis: Text Processing
- Amit Roth, Arnon Turetzky, Yossi Adi:

A Language Modeling Approach to Diacritic-Free Hebrew TTS. - Avihu Dekel, Raul Fernandez:

Exploring the Benefits of Tokenization of Discrete Acoustic Units. - Markéta Rezácková, Daniel Tihelka, Jindrich Matousek:

Homograph Disambiguation with Text-to-Text Transfer Transformer. - Kiyoshi Kurihara, Masanori Sano:

Enhancing Japanese Text-to-Speech Accuracy with a Novel Combination Transformer-BERT-based G2P: Integrating Pronunciation Dictionaries and Accent Sandhi. - Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana:

Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data. - Xingxing Yang:

G2PA: G2P with Aligned Audio for Mandarin Chinese. - Siqi Sun, Korin Richmond:

Learning Pronunciation from Other Accents via Pronunciation Knowledge Transfer. - Deepanshu Gupta, Javier Latorre:

Positional Description for Numerical Normalization. - Christina Tånnander, Shivam Mehta, Jonas Beskow, Jens Edlund:

Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis.
Training Methods, Self-Supervised Learning, Adaptation
- Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu

, Maja Pantic:
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization. - Amrutha Prasad

, Srikanth R. Madikeri, Driss Khalil, Petr Motlícek, Christof Schüpbach:
Speech and Language Recognition with Low-rank Adaptation of Pretrained Models. - Kwangyoun Kim, Suwon Shon, Yi-Te Hsu, Prashant Sridhar, Karen Livescu, Shinji Watanabe

:
Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition. - Amit Meghanani, Thomas Hain

:
LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks. - Robert Flynn, Anton Ragni:

Self-Train Before You Transcribe. - Steven Vander Eeckt

, Hugo Van hamme
:
Unsupervised Online Continual Learning for Automatic Speech Recognition. - Hao Shi, Tatsuya Kawahara:

Dual-path Adaptation of Pretrained Feature Extraction Module for Robust Automatic Speech Recognition. - Nahomi Kusunoki, Yosuke Higuchi, Tetsuji Ogawa

, Tetsunori Kobayashi:
Hierarchical Multi-Task Learning with CTC and Recursive Operation. - Keigo Hojo, Yukoh Wakabayashi, Kengo Ohta, Atsunori Ogawa, Norihide Kitaoka:

Boosting CTC-based ASR using inter-layer attention-based CTC loss. - Hyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe

:
Self-training ASR Guided by Unsupervised ASR Teacher. - Yue Gu

, Zhihao Du, Shiliang Zhang, Jiqing Han, Yongjun He:
Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition. - George Joseph, Arun Baby:

Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank Adaptation. - Jae-Hong Lee, Sang-Eon Lee, Dong-Hyun Kim, Do-Hee Kim, Joon-Hyuk Chang:

Online Subloop Search via Uncertainty Quantization for Efficient Test-Time Adaptation. - Vishwanath Pratap Singh, Federico Malato, Ville Hautamäki, Md. Sahidullah, Tomi Kinnunen:

ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2vec2.0 Based ASR. - Jeehye Lee, Hyeji Seo:

Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition.
Novel Architectures for ASR
- Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara:

Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer. - Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

:
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting. - Virat Shejwalkar, Om Thakkar, Arun Narayanan:

Quantifying Unintended Memorization in BEST-RQ ASR Encoders. - Woo Hyun Kang, Srikanth Vishnubhotla, Rudolf Braun, Yogesh Virkar, Raghuveer Peri, Kyu J. Han:

SWAN: SubWord Alignment Network for HMM-free word timing estimation in end-to-end automatic speech recognition.
Multimodality and Foundation Models
- Ziyun Cui, Chang Lei

, Wen Wu
, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang:
Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models. - Mohammad Amaan Sayeed, Hanan Aldarmaki:

Spoken Word2Vec: Learning Skipgram Embeddings from Speech. - Pawel Bujnowski, Bartlomiej Kuzma, Bartlomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, Piotr Andruszkiewicz:

SAMSEMO: New dataset for multilingual and multimodal emotion recognition. - Bonian Jia, Huiyao Chen

, Yueheng Sun, Meishan Zhang, Min Zhang:
LLM-Driven Multimodal Opinion Expression Identification. - Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang:

Zero-Shot Fake Video Detection by Audio-Visual Consistency. - Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Ju, Tae-Hyun Oh:

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert.
Spoken Dialogue Systems and Conversational Analysis 1
- Matthew McNeill, Rivka Levitan

:
Autoregressive cross-interlocutor attention scores meaningfully capture conversational dynamics. - Conor Atkins, Ian D. Wood, Mohamed Ali Kâafar

, Hassan Asghar
, Nardine Basta, Michal Kepkowski:
ConvoCache: Smart Re-Use of Chatbot Responses. - Livia Qian, Gabriel Skantze:

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue. - Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah:

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing. - Siyang Wang, Éva Székely, Joakim Gustafson:

Contextual Interactive Evaluation of TTS Models in Dialogue Systems. - Min-Han Shih

, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee:
GSQA: An End-to-End Model for Generative Spoken Question Answering.
Speech Technology
- Mattias Nilsson

, Riccardo Miccini
, Clement Laroche, Tobias Piechowiak, Friedemann Zenke:
Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps. - Maryam Naderi, Enno Hermann

, Alexandre Nanchen, Sevada Hovsepyan
, Mathew Magimai-Doss
:
Towards interfacing large language models with ASR systems using confidence measures and prompting. - Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran:

Text Injection for Neural Contextual Biasing. - Minglin Wu, Jing Xu, Xixin Wu, Helen Meng:

Prompting Large Language Models with Mispronunciation Detection and Diagnosis Abilities. 


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID