


default search action
25th Interspeech 2024: Kos, Greece
- Itshak Lapidot, Sharon Gannot:
25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024. ISCA 2024
Keynote 1 ISCA Medallist
- Isabel Trancoso:
Towards Responsible Speech Processing.
L2 Speech, Bilingualism and Code-Switching
- Sarah Wesolek, Piotr Gulgowski
, Joanna Blaszczak, Marzena Zygis:
The influence of L2 accent strength and different error types on personality trait ratings. - Jie Chi, Electra Wallington, Peter Bell:
Characterizing code-switching: Applying Linguistic Principles for Metric Assessment and Development. - Wei Xue
, Ivan Yuen, Bernd Möbius:
Towards a better understanding of receptive multilingualism: listening conditions and priming effects. - Debasish Ray Mohapatra, Victor Zappi, Sidney Fels:
2.5D Vocal Tract Modeling: Bridging Low-Dimensional Efficiency with 3D Accuracy.
Speaker Diarization 1
- Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna:
Investigating Confidence Estimation Measures for Speaker Diarization. - Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan:
Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization. - Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang:
On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization. - Marc Härkönen, Samuel J. Broughton, Lahiru Samarakoon:
EEND-M2F: Masked-attention mask transformers for speaker diarization. - Yongkang Yin, Xu Li, Ying Shan, Yuexian Zou:
AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild. - Arunav Arya, Murtiza Ali, Karan Nathwani:
Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network Framework.
Speech and Audio Analysis and Representations
- Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma:
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning. - Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto:
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation. - Yusuke Fujita, Tatsuya Komatsu:
Audio Fingerprinting with Holographic Reduced Representations. - David Meyer, Eitan Abecassis, Clara Fernandez-Labrador, Christopher Schroers:
RAST: A Reference-Audio Synchronization Tool for Dubbed Content. - Xuefei Li, Hao Huang, Ying Hu, Liang He
, Jiabao Zhang, Yuyi Wang:
YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation. - Asad Ullah, Alessandro Ragano
, Andrew Hines
:
Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech Models. - Jaden Pieper, Stephen Voran:
AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators.
Acoustic Event Detection and Classification 2
- Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz:
Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation. - Da Mu, Zhicheng Zhang, Haobo Yue:
MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection. - Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee
, Yong-Hwa Park:
Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event Detection. - Tuan Vu Ho, Kota Dohi, Yohei Kawaguchi:
Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring. - Anbai Jiang, Bing Han, Zhiqiang Lv
, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan
:
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection. - Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu:
FakeSound: Deepfake General Audio Detection. - Shabnam Ghaffarzadegan, Luca Bondi, Wei-Cheng Lin, Abinaya Kumar, Ho-Hsiang Wu, Hans-Georg Horst, Samarjit Das:
Sound of Traffic: A Dataset for Acoustic Traffic Identification and Counting.
Detection and Classification of Bioacoustic Signals
- Sahil Kumar, Jialu Li, Youshan Zhang:
Vision Transformer Segmentation for Visual Bird Sound Denoising. - Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Björn W. Schuller:
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition. - Jules Cauzinille, Benoît Favre, Ricard Marxer
, Dena J. Clink, Abdul Hamid Ahmad, Arnaud Rey:
Investigating self-supervised speech models' ability to classify animal vocalizations: The case of gibbon's vocal signatures. - Xihang Qiu
, Lixian Zhu, Zikai Song, Zeyu Chen, Haojie Zhang
, Kun Qian, Ye Zhang, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller:
Study Selectively: An Adaptive Knowledge Distillation based on a Voting Network for Heart Sound Classification. - Jie Lin, Xiuping Yang, Li Xiao, Xinhong Li, Weiyan Yi, Yuhong Yang, Weiping Tu, Xiong Chen:
SimuSOE: A Simulated Snoring Dataset for Obstructive Sleep Apnea-Hypopnea Syndrome Evaluation during Wakefulness.
Acoustic Echo Cancellation
- Premanand Nayak, Kamini Sabu, M. Ali Basha Shaik:
Multi-mic Echo Cancellation Coalesced with Beamforming for Real World Adverse Acoustic Conditions. - Vahid Khanagha, Dimitris Koutsaidis, Kaustubh Kalgaonkar, Sriram Srinivasan:
Interference Aware Training Target for DNN based joint Acoustic Echo Cancellation and Noise Suppression. - Yi Gao, Xiang Su:
Low Complexity Echo Delay Estimator Based on Binarized Feature Matching. - Ye Ni
, Cong Pang
, Chengwei Huang, Cairong Zou:
MSA-DPCRN: A Multi-Scale Asymmetric Dual-Path Convolution Recurrent Network with Attentional Feature Fusion for Acoustic Echo Cancellation. - Ofer Schwartz, Sharon Gannot:
Efficient Joint Bemforming and Acoustic Echo Cancellation Structure for Conference Call Scenarios. - Fei Zhao, Jinjiang Liu, Xueliang Zhang:
SDAEC: Signal Decoupling for Advancing Acoustic Echo Cancellation.
Speech Synthesis: Voice Conversion 1
- Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura
, Hiroshi Saruwatari:
Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals. - Alan Baade, Puyuan Peng, David Harwath:
Neural Codec Language Models for Disentangled and Textless Voice Conversion. - Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo:
Fine-Grained and Interpretable Neural Speech Editing. - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo:
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation. - Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi:
DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion. - Tianhua Qi, Shiyan Wang, Cheng Lu, Yan Zhao, Yuan Zong, Wenming Zheng:
Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity.
Neural Network Architectures for ASR 2
- Yu Nakagome, Michael Hentschel:
InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions. - Qinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao:
SEQ-former: A context-enhanced and efficient automatic speech recognition framework. - Robert Flynn, Anton Ragni:
How Much Context Does My Attention-Based ASR System Need? - Vincenzo Norman Vitale, Loredana Schettino, Francesco Cutugno:
Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers' ability to model hesitation phenomena. - Tian-Hao Zhang, Xinyuan Qian, Feng Chen, Xu-Cheng Yin:
Transmitted and Aggregated Self-Attention for Automatic Speech Recognition. - Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe:
MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels. - Koichi Miyazaki, Yoshiki Masuyama, Masato Murata:
Exploring the Capability of Mamba in Speech Applications. - Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye:
Lightweight Transducer Based on Frame-Level Criterion. - Ankit Gupta, George Saon, Brian Kingsbury:
Exploring the limits of decoder-only models trained on public speech recognition corpora. - Xun Gong, Anqi Lv, Zhiming Wang, Yanmin Qian:
Contextual Biasing Speech Recognition in Speech-enhanced Large Language Model.
Decoding Algorithms
- Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu:
Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask. - Kun Zou, Fengyun Tan, Ziyang Zhuang, Chenfeng Miao, Tao Wei, Shaodan Zhai, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao:
E-Paraformer: A Faster and Better Parallel Transformer for Non-autoregressive End-to-End Mandarin Speech Recognition. - Martino Ciaperoni, Athanasios Katsamanis, Aristides Gionis, Panagiotis Karras:
Beam-search SIEVE for low-memory speech recognition. - Daniel Galvez, Vladimir Bataev
, Hainan Xu, Tim Kaldewey:
Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU. - Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Yanzhang He, Pedro Moreno Mengibar:
Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm. - Tatsunari Takagi, Yukoh Wakabayashi, Atsunori Ogawa, Norihide Kitaoka:
Text-only Domain Adaptation for CTC-based Speech Recognition through Substitution of Implicit Linguistic Information in the Search Space.
Pronunciation Assessment
- Xintong Wang, Mingqian Shi, Ye Wang:
Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis. - Yu-Wen Chen, Zhou Yu, Julia Hirschberg:
MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios. - Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi:
A Framework for Phoneme-Level Pronunciation Assessment Using CTC. - Mostafa Shahin
, Beena Ahmed:
Phonological-Level Mispronunciation Detection and Diagnosis. - Heejin Do, Wonjun Lee, Gary Geunbae Lee:
Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment. - Nhan Phan
, Anna von Zansen
, Maria Kautonen
, Ekaterina Voskoboinik, Tamás Grósz, Raili Hildén
, Mikko Kurimo:
Automated content assessment and feedback for Finnish L2 learners in a picture description speaking task.
Spoken Language Processing
- Zhenyu Wang, Shuyu Kong, Li Wan, Biqiao Zhang, Yiteng Huang, Mumin Jin, Ming Sun, Xin Lei, Zhaojun Yang:
Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning. - Youngmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoonyoung Cho:
Relational Proxy Loss for Audio-Text based Keyword Spotting. - Sichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han, Hoonyoung Cho:
CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting. - Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu:
Text-aware Speech Separation for Multi-talker Keyword Spotting. - Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee:
Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition. - Raul Monteiro:
Adding User Feedback To Enhance CB-Whisper. - Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo
, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe:
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer.
Spoken Machine Translation 2
- Nan Chen, Yonghe Wang, Feilong Bao:
Parameter-Efficient Adapter Based on Pre-trained Models for Speech Translation. - Badr M. Abdullah, Mohammed Maqsood Shaik, Dietrich Klakow:
Wave to Interlingua: Analyzing Representations of Multilingual Speech Transformers for Spoken Language Translation. - Nan Chen, Yonghe Wang, Feilong Bao:
Knowledge-Preserving Pluggable Modules for Multilingual Speech Translation Tasks. - Rastislav Rabatin, Frank Seide, Ernie Chang:
Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation. - Peidong Wang, Jian Xue, Jinyu Li, Jun-Kun Chen, Aswin Shanmugam Subramanian:
Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation. - Dan Oneata, Herman Kamper:
Translating speech with just images. - Sameer Khurana, Chiori Hori, Antoine Laurent, Gordon Wichern, Jonathan Le Roux:
ZeroST: Zero-Shot Speech Translation.
Biosignal-enabled Spoken Communication
- Jinyu Li, Leonardo Lancia:
A multimodal approach to study the nature of coordinative patterns underlying speech rhythm. - Peter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W. Black, Rikky Muller, Gopala Krishna Anumanchipalli:
Towards EMG-to-Speech with Necklace Form Factor. - Chris Bras, Tanvina Patel, Odette Scharenborg
:
Using articulated speech EEG signals for imagined speech decoding. - Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang:
Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals. - Yudong Yang, Rongfeng Su, Rukiye Ruzi, Manwa L. Ng, Shaofeng Zhao, Nan Yan, Lan Wang:
Optical Flow Guided Tongue Trajectory Generation for Diffusion-based Acoustic to Articulatory Inversion. - Rishi Jain, Bohan Yu, Peter Wu, Tejas S. Prabhune, Gopala Anumanchipalli:
Multimodal Segmentation for Vocal Tract Modeling. - Jesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh:
Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss. - Yujie Yan, Xiran Xu, Haolin Zhu, Pei Tian, Zhongshu Ge, Xihong Wu, Jing Chen:
Auditory Attention Decoding in Four-Talker Environment with EEG. - Zijie Lin, Tianyu He, Siqi Cai, Haizhou Li:
ASA: An Auditory Spatial Attention Dataset with Multiple Speaking Locations. - Saurav Pahuja, Gabriel Ivucic, Pascal Himmelmann, Siqi Cai, Tanja Schultz, Haizhou Li:
Leveraging Graphic and Convolutional Neural Networks for Auditory Attention Detection with EEG.
Individual and Social Factors in Phonetics
- Tillmann Pistor, Adrian Leemann:
Echoes of Implicit Bias Exploring Aesthetics and Social Meanings of Swiss German Dialect Features. - Vivian Guo Li:
In search of structure and correspondence in intra-speaker trial-to-trial variability. - Irene Smith, Morgan Sonderegger, Spade Consortium:
Modelled Multivariate Overlap: A method for measuring vowel merger. - Keiko Ochi, Koji Inoue, Divesh Lala, Tatsuya Kawahara:
Entrainment Analysis and Prosody Prediction of Subsequent Interlocutor's Backchannels in Dialogue. - James Tanner
, Morgan Sonderegger, Jane Stuart-Smith, Tyler Kendall, Jeff Mielke
, Robin Dodsworth
, Erik Thomas:
Exploring the anatomy of articulation rate in spontaneous English speech: relationships between utterance length effects and social factors. - Katelyn Taylor, Amelia Jane Gully, Helena Daffern:
Familiar and Unfamiliar Speaker Identification in Speech and Singing.
Paralinguistics
- Luis Felipe Parra-Gallego, Tilak Purohit
, Bogdan Vlasenko, Juan Rafael Orozco-Arroyave, Mathew Magimai-Doss
:
Cross-transfer Knowledge between Speech and Text Encoders to Evaluate Customer Satisfaction. - Manila Kodali
, Sudarsana Reddy Kadiri
, Paavo Alku
:
Fine-tuning of Pre-trained Models for Classification of Vocal Intensity Category from Speech Signals. - Alexander Kathan, Martin Bürger, Andreas Triantafyllopoulos, Sabrina Milkus, Jonas Hohmann, Pauline Muderlak, Jürgen Schottdorf, Richard Musil, Björn W. Schuller, Shahin Amiriparian:
Real-world PTSD Recognition: A Cross-corpus and Cross-linguistic Evaluation. - Debasmita Bhattacharya, Eleanor Lin, Run Chen, Julia Hirschberg:
Switching Tongues, Sharing Hearts: Identifying the Relationship between Empathy and Code-switching in Speech.
Speaker Recognition: Adversarial and Spoofing Attacks
- Eros Rosello, Angel M. Gomez, Iván López-Espejo, Antonio M. Peinado, Juan M. Martín-Doñas
:
Anti-spoofing Ensembling Model: Dynamic Weight Allocation in Ensemble Models for Improved Voice Biometrics Security. - Lin Zhang, Xin Wang, Erica Cooper, Mireia Díez, Federico Landini, Nicholas W. D. Evans, Junichi Yamagishi:
Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio. - Haochen Wu, Wu Guo, Zhentao Zhang, Wenting Zhao, Shengyu Peng, Jie Zhang:
Spoofing Speech Detection by Modeling Local Spectro-Temporal and Long-term Dependency. - Jingze Lu, Yuxiang Zhang, Zhuo Li, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang:
Improving Copy-Synthesis Anti-Spoofing Training Method with Rhythm and Speaker Perturbation. - Yip Keng Kan, Ke Xu, Hao Li, Jie Shi:
VoiceDefense: Protecting Automatic Speaker Verification Models Against Black-box Adversarial Attacks. - Xuanjun Chen, Jiawei Du
, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee:
Neural Codec-based Adversarial Sample Detection for Speaker Verification. - Sizhou Chen, Yibo Bai
, Jiadi Yao, Xiao-Lei Zhang, Xuelong Li
:
Textual-Driven Adversarial Purification for Speaker Verification. - Zhuhai Li, Jie Zhang, Wu Guo, Haochen Wu:
Boosting the Transferability of Adversarial Examples with Gradient-Aligned Ensemble Attack for Speaker Recognition. - Duc-Tuan Truong
, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng:
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection.
Audio Event Detection and Classification 1
- Tiantian Feng, Dimitrios Dimitriadis, Shrikanth S. Narayanan:
Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling? - Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang:
Scaling up masked audio encoder learning for general audio classification. - Sarthak Yadav
, Zheng-Hua Tan
:
Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations. - Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin:
MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection. - Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux:
Sound Event Bounding Boxes. - Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He:
Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network.
Source Separation 2
- Hassan Taherian, Vahid Ahmadi Kalkhorani, Ashutosh Pandey, Daniel Wong, Buye Xu, DeLiang Wang:
Towards Explainable Monaural Speaker Separation with Auditory-based Training. - Iva Ewert, Marvin Borsdorf
, Haizhou Li, Tanja Schultz:
Does the Lombard Effect Matter in Speech Separation? Introducing the Lombard-GRID-2mix Dataset. - Zexu Pan, Gordon Wichern, François G. Germain, Kohei Saijo, Jonathan Le Roux:
PARIS: Pseudo-AutoRegressIve Siamese Training for Online Speech Separation. - Yiru Zhang, Linyu Yao, Qun Yang:
OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech Extraction. - Tsun-An Hsieh, Heeyoul Choi, Minje Kim:
Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation. - Jingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li:
SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech. - Yiwen Wang, Xihong Wu:
TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information. - Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux:
Enhanced Reverberation as Supervision for Unsupervised Speech Separation.
Noise Reduction, Dereverberation, and Echo Cancellation
- Fei Zhao, Chenggang Zhang, Shulin He, Jinjiang Liu, Xueliang Zhang:
Deep Echo Path Modeling for Acoustic Echo Cancellation. - Hongmei Guo, Yijiang Chen, Xiaolei Zhang, Xuelong Li
:
Graph Attention Based Multi-Channel U-Net for Speech Dereverberation With Ad-Hoc Microphone Arrays. - Louis Bahrman
, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard:
Speech dereverberation constrained on room impulse response characteristics. - Kuang Yuan
, Shuo Han, Swarun Kumar
, Bhiksha Raj:
DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing. - Alexander Barnhill
, Elmar Nöth, Andreas K. Maier, Christian Bergler:
ANIMAL-CLEAN - A Deep Denoising Toolkit for Animal-Independent Signal Enhancement. - Premanand Nayak, M. Ali Basha Shaik:
Elucidating Clock-drift Using Real-world Audios In Wireless Mode For Time-offset Insensitive End-to-End Asynchronous Acoustic Echo Cancellation. - Shilin Wang, Haixin Guan
, Yanhua Long:
QMixCAT: Unsupervised Speech Enhancement Using Quality-guided Signal Mixing and Competitive Alternating Model Training.
Computationally-Efficient Speech Enhancement
- Hanbin Bae, Pavel Andreev, Azat Saginbaev, Nicholas Babaev, Won-Jun Lee, Hosang Sung, Hoon-Young Cho:
Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds. - Behnam Gholami, Mostafa El-Khamy, Kee-Bong Song:
Knowledge Distillation for Tiny Speech Enhancement with Latent Feature Augmentation. - Yuewei Zhang
, Huanbin Zou, Jie Zhu:
Sub-PNWR: Speech Enhancement Based on Signal Sub-Band Splitting and Pseudo Noisy Waveform Reconstruction Loss. - Zugang Zhao, Jinghong Zhang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He:
Streamlining Speech Enhancement DNNs: an Automated Pruning Method Based on Dependency Graph with Advanced Regularized Loss Strategies. - Zehua Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang:
Lightweight Dynamic Sparse Transformer for Monaural Speech Enhancement. - Zizhen Lin, Xiaoting Chen, Junyu Wang:
MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement. - Longbiao Cheng
, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Shih-Chii Liu:
Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement.
Zero-shot TTS
- Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li:
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model. - Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda:
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS. - Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima:
Lightweight Zero-shot Text-to-Speech with Mixture of Adapters. - Vikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva:
DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness.
Noise Robustness, Far-Field, and Multi-Talker ASR
- Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey:
LibriheavyMix: A 20, 000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization. - Xujiang Xing, Mingxing Xu, Thomas Fang Zheng:
A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification. - Ying Shi, Lantian Li, Shi Yin, Dong Wang, Jiqing Han:
Serialized Output Training by Learned Dominance. - Xianrui Zheng, Guangzhi Sun, Chao Zhang, Philip C. Woodland:
SOT Triggered Neural Clustering for Speaker Attributed ASR. - Yoshiaki Bando, Tomohiko Nakamura
, Shinji Watanabe:
Neural Blind Source Separation and Diarization for Distant Speech Recognition. - Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Mana Ihori, Naotaka Kawata, Shota Orihashi, Kazutoshi Shinoda, Taiga Yamane, Saki Mizuno, Keita Suzuki, Satoshi Suzuki, Nobukatsu Hojo, Takafumi Moriya, Atsushi Ando:
Unified Multi-Talker ASR with and without Target-speaker Enrollment.
Contextual Biasing and Adaptation
- Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet:
Keyword-Guided Adaptation of Automatic Speech Recognition. - Nguyen Manh Tien Anh, Thach Ho Sy:
Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor. - Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen:
Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer. - Chengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li:
Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition. - Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey:
Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation. - Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev
, Vitaly Lavrukhin, Boris Ginsburg:
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter. - Xizi Wei, Stephen McGregor:
Prompt Tuning for Speech Recognition on Unknown Spoken Name Entities. - Junzhe Liu, Jianwei Yu, Xie Chen:
Improved Factorized Neural Transducer Model For Text-only Domain Adaptation. - Pin-Yen Liu, Jen-Tzung Chien
:
Modality Translation Learning for Joint Speech-Text Model. - Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:
SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR. - Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura:
Factor-Conditioned Speaking-Style Captioning. - Yerbolat Khassanov, Zhipeng Chen
, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang:
Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR. - Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran:
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models. - Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang:
Domain-Aware Data Selection for Speech Classification via Meta-Reweighting.
Spoken Language Understanding
- Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe:
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model. - Dejan Porjazovski, Anssi Moisio, Mikko Kurimo:
Out-of-distribution generalisation in spoken language understanding. - Gaëlle Laperrière, Sahar Ghannay, Bassam Jabaian, Yannick Estève:
A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding. - Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier:
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. - Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang:
Using Large Language Model for End-to-End Chinese ASR and NER. - Alkis Koudounas, Flavio Giobergia, Eliana Pastor, Elena Baralis:
A Contrastive Learning Approach to Mitigate Bias in Speech Models.
Spoken Machine Translation 1
- Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri:
Investigating Decoder-only Large Language Models for Speech-to-text Translation. - Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi du Bois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang, Zoë Abrams, Morgan McGuire:
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation. - Nan Chen, Yonghe Wang, Feilong Bao:
Sign Value Constraint Decomposition for Efficient 1-Bit Quantization of Speech Translation Tasks. - Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung:
Lightweight Audio Segmentation for Long-form Speech Translation. - Haotian Tan, Sakriani Sakti:
Contrastive Feedback Mechanism for Simultaneous Speech Translation. - Cécile Macaire, Chloé Dion, Didier Schwab, Benjamin Lecouteux, Emmanuelle Esperança-Rodier:
Towards Speech-to-Pictograms Translation.
Hearing Disorders
- Seonwoo Lee, Sunhee Kim, Minhwa Chung:
Automatic Assessment of Speech Production Skills for Children with Cochlear Implants Using Wav2Vec2.0 Acoustic Embeddings. - Youngjin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim:
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. - Mark A. Huckvale, Gaston Hilkhuysen:
Evaluating a 3-factor listener model for prediction of speech intelligibility to hearing-impaired listeners. - Sophie Fagniart
, Brigitte Charlier, Véronique Delvaux, Bernard Harmegnies, Anne Huberlant, Myriam Piccaluga, Kathy Huet:
Production of fricative consonants in French-speaking children with cochlear implants and typical hearing: acoustic and phonological analyses. - Toshio Irino, Shintaro Doan, Minami Ishikawa:
Signal processing algorithm effective for sound quality of hearing loss simulators. - Yixiang Niu
, Ning Chen, Hongqing Zhu, Zhiying Zhu, Guangqiang Li, Yibo Chen:
Auditory Spatial Attention Detection Based on Feature Disentanglement and Brain Connectivity-Informed Graph Neural Networks. - Jessica Monaghan, Arun Sebastian, Nicky Chong-White, Vicky Zhang, Vijayalakshmi Easwar, Pádraig Kitterick:
Automatic Detection of Hearing Loss from Children's Speech using wav2vec 2.0 Features.
Speech Disorders 2
- Vrushank Changawala, Frank Rudzicz:
Whister: Using Whisper's representations for Stuttering detection. - Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti:
Improving Speech-Based Dysarthria Detection using Multi-task Learning with Gradient Projection. - Guanlin Chen, Yun Jin:
Cascaded Transfer Learning Strategy for Cross-Domain Alzheimer's Disease Recognition through Spontaneous Speech. - Loukas Ilias, Dimitris Askounis:
A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech. - Si-Ioi Ng, Lingfeng Xu
, Kimberly D. Mueller, Julie Liss, Visar Berisha:
Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance. - Katerina Papadimitriou, Gerasimos Potamianos:
Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning. - Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L. Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas K. Maier:
Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data. - Haojie Zhang, Tao Zhang, Ganjun Liu, Dehui Fu, Xiaohui Hou, Ying Lv:
DysArinVox: DYSphonia & DYSarthria mandARIN speech corpus. - Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli:
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. - Gábor Gosztolya, Veronika Svindt
, Judit Bóna, Ildikó Hoffmann:
Automatic Longitudinal Investigation of Multiple Sclerosis Subjects.
TAUKADIAL Challenge: Speech-Based Cognitive Assessment in Chinese and English (Special Session)
- Saturnino Luz, Sofia de la Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi, Ya-Ning Chang
, Chia-Ju Chou, Yi-Chien Liu:
Connected Speech-Based Cognitive Assessment in Chinese and English. - David Ortiz-Perez, José García Rodríguez, David Tomás:
Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis. - Gábor Gosztolya, László Tóth:
Combining Acoustic Feature Sets for Detecting Mild Cognitive Impairment in the Interspeech'24 TAUKADIAL Challenge. - Junwen Duan, Fangyuan Wei, Hong-Dong Li, Jin Liu:
Pre-trained Feature Fusion and Matching for Mild Cognitive Impairment Detection. - Benjamin Barrera-Altuna, Daeun Lee, Zaima Zarnaz, Jinyoung Han, Seungbae Kim:
The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal Approach. - Anna Favaro, Tianyu Cao, Najim Dehak, Laureano Moro-Velázquez:
Leveraging Universal Speech Representations for Detecting and Assessing the Severity of Mild Cognitive Impairment Across Languages. - Bao Hoang
, Yijiang Pang, Hiroko H. Dodge, Jiayu Zhou:
Translingual Language Markers for Cognitive Assessment from Spontaneous Speech. - Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Philipp Klumpp, Tobias Weise, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier:
Multilingual Speech and Language Analysis for the Assessment of Mild Cognitive Impairment: Outcomes from the Taukadial Challenge.
Show and Tell 1
- Takayuki Arai, Ryohei Suzuki, Chandler Earp, Shinya Tsuji, Keiko Ochi:
Production of phrases by mechanical models of the human vocal tract. - Vishal Gourav, Ankit Tyagi, Phanindra Mankale:
Faster Vocoder: a multi threading approach to achieve low latency during TTS Inference. - Aanchan Mohan, Monideep Chakraborti, Katelyn Eng, Nailia Kushaeva, Mirjana Prpa, Jordan Lewis, Tianyi Zhang, Vince Geisler, Carol Geisler:
A powerful and modern AAC composition tool for impaired speakers. - Grzegorz P. Mika, Konrad Zielinski, Pawel Cyrta, Marek Grzelec:
VoxFlow AI: wearable voice converter for atypical speech. - Sai Akarsh C, Vamshiraghusimha Narasinga, Anil Kumar Vuppala:
Stress transfer in speech-to-speech machine translation. - Takuma Okamoto, Yamato Ohtani, Hisashi Kawai:
Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis. - Alex Peiró Lilja, José Giraldo, Martí Llopart-Font, Carme Armentano-Oller, Baybars Külebi, Mireia Farrús:
Multi-speaker and multi-dialectal Catalan TTS models for video gaming. - Juliana Francis, Éva Székely, Joakim Gustafson:
ConnecTone: a modular AAC system prototype with contextual generative text prediction and style-adaptive conversational TTS. - Mahdin Rohmatillah, Bryan Gautama Ngo, Willianto Sulaiman, Po-Chuan Chen, Jen-Tzung Chien:
Reliable dialogue system for facilitating student-counselor communication. - Harm Lameris, Joakim Gustafson, Éva Székely:
CreakVC: a voice conversion tool for modulating creaky voice. - Yu-Sheng Tsao, Yung-Chang Hsu, Jiun-Ting Li, Siang-Hong Weng, Tien-Hong Lo, Berlin Chen:
EZTalking: English assessment platform for teachers and students.
Keynote 2
- Shoko Araki:
Frontier of Frontend for Conversational Speech Processing.
Phonetics and Phonology of Second Language Acquisition
- Paige Tuttösí
, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim:
Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation. - Anisia Popescu, Lori Lamel, Ioana Vasilescu, Laurence Devillers:
Automatic Speech Recognition with parallel L1 and L2 acoustic phone models to evaluate /l/ allophony in L2 English speech production. - Kevin Huang, Jack Goldberg, Louis Goldstein, Shrikanth Narayanan:
Analysis of articulatory setting for L1 and L2 English speakers using MRI data. - Ioana Colgiu, Laura Spinu, Rajiv Rao, Yasaman Rafat:
Bilingual Rhotic Production Patterns: A Generational Comparison of Spanish-English Bilingual Speakers in Canada. - Sylvain Coulange, Tsuneo Kato, Solange Rossato, Monica Masperi:
Exploring Impact of Pausing and Lexical Stress Patterns on L2 English Comprehensibility in Real Time. - Qi Wu:
Mandarin T3 Production by Chinese and Japanese Native Speakers.
Corpora-based Approaches in Automatic Emotion Recognition
- Sumit Ranjan, Rupayan Chakraborty, Sunil Kumar Kopparapu:
Reinforcement Learning based Data Augmentation for Noise Robust Speech Emotion Recognition. - Pravin Mote, Berrak Sisman, Carlos Busso:
Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion. - Jincen Wang, Yan Zhao, Cheng Lu, Hailun Lian, Hongli Chang, Yuan Zong, Wenming Zheng:
Confidence-aware Hypothesis Transfer Networks for Source-Free Cross-Corpus Speech Emotion Recognition. - Yuxuan Xi, Yan Song, Lirong Dai, Haoyu Song, Ian McLoughlin:
An Effective Local Prototypical Mapping Network for Speech Emotion Recognition. - Yuan Gao, Hao Shi, Chenhui Chu, Tatsuya Kawahara:
Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction.
Analysis of Speakers States and Traits
- Oliver Niebuhr
, Nafiseh Taghva:
How rhythm metrics are linked to produced and perceived speaker charisma. - Zhu Li
, Xiyuan Gao
, Yuqing Zhang
, Shekhar Nayak
, Matt Coler
:
A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm. - John Murzaku, Adil Soubki, Owen Rambow:
Multimodal Belief Prediction. - Run Chen, Haozhe Chen, Anushka Kulkarni, Eleanor Lin, Linda Pang, Divya Tadimeti, Jun Shin, Julia Hirschberg:
Detecting Empathy in Speech. - Dehua Tao, Tan Lee, Harold Chui, Sarah Luk:
Learning Representation of Therapist Empathy in Counseling Conversation Using Siamese Hierarchical Attention Network. - Han Kunmei:
Modelling Lexical Characteristics of the Healthy Aging Population: A Corpus-Based Study. - Maurice Gerczuk
, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller:
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment.
Spoofing and Deepfake Detection
- Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury:
Source Tracing of Audio Deepfake Systems. - Tianchi Liu
, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li:
How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio? - Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi:
Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis. - Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali:
SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures. - Menglu Li, Xiao-Ping Zhang:
Interpretable Temporal Class Activation Representation for Audio Spoofing Detection. - Zirui Ge
, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang, Björn W. Schuller:
DGPN: A Dual Graph Prototypical Network for Few-Shot Speech Spoofing Algorithm Recognition.
Audio Captioning, Tagging, and Audio-Text Retrieval
- Jianyuan Sun, Wenwu Wang, Mark D. Plumbley:
PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio Captioning. - Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang:
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding. - Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou:
Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation. - Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang:
Streaming Audio Transformers for Online Audio Tagging. - Aryan Chaudhary, Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley:
Efficient CNNs with Quaternion Transformations and Pruning for Audio Tagging. - Xin Jing, Andreas Triantafyllopoulos, Björn W. Schuller:
ParaCLAP - Towards a general language-audio model for computational paralinguistic tasks. - Xuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang, Mark D. Plumbley:
Efficient Audio Captioning with Encoder-Level Knowledge Distillation.
Generative Speech Enhancement
- Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu:
Universal Score-based Speech Enhancement with High Content Preservation. - Haici Yang, Jiaqi Su, Minje Kim, Zeyu Jin:
Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens. - Ante Jukic, Roman Korostik, Jagadeesh Balam, Boris Ginsburg:
Schrödinger Bridge for Generative Speech Enhancement. - Thanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich:
Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge. - Yiyuan Yang, Niki Trigoni, Andrew Markham:
Pre-training Feature Guided Diffusion Model for Speech Enhancement. - Dail Kim, Da-Hee Yang, Donghyun Kim, Joon-Hyuk Chang, Jeonghwan Choi, Moa Lee, Jaemo Yang, Han-gil Moon:
Guided conditioning with predictive network on score-based diffusion model for speech enhancement.
Speech Synthesis: Evaluation
- Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang:
SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models. - Srija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra:
Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies. - Jens Edlund, Christina Tånnander, Sébastien Le Maguer
, Petra Wagner
:
Assessing the impact of contextual framing on subjective TTS quality. - Adaeze Adigwe
, Sarenne Wallbridge, Simon King:
What do people hear? Listeners' Perception of Conversational Speech. - Hui Wang, Shiwan Zhao, Jiaming Zhou, Xiguang Zheng, Haoqin Sun, Xuechen Wang, Yong Qin:
Uncertainty-Aware Mean Opinion Score Prediction. - Félix Saget, Meysam Shamsi, Marie Tahon
:
Lifelong Learning MOS Prediction for Synthetic Speech Quality Evaluation.
Multilingual ASR
- Kwok Chin Yuen, Jia Qi Yip, Eng Siong Chng:
Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems. - Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe:
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets. - Andrés Piñeiro Martín
, Carmen García-Mateo, Laura Docío Fernández, Maria del Carmen Lopez Perez, Georg Rehm:
Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition. - A. F. M. Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen:
M2ASR: Multilingual Multi-task Automatic Speech Recognition via Multi-objective Optimization. - Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan:
MSR-86K: An Evolving, Multilingual Corpus with 86, 300 Hours of Transcribed Audio for Speech Recognition Research. - Brady Houston, Omid Sadjadi, Zejiang Hou, Srikanth Vishnubhotla, Kyu J. Han:
Improving Multilingual ASR Robustness to Errors in Language Input.
General Topics in ASR
- Jiwon Suh, Injae Na, Woohwan Jung:
Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions. - Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang:
A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting. - Mario Zusag, Laurin Wagner, Bernhard Thallinger:
CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions. - Péter Mihajlik, Yan Meng, Mate S. Kadar, Julian Linke
, Barbara Schuppler
, Katalin Mády:
On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition. - Dena F. Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss
, Caryn Herring, Jia Bin:
Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation. - Hao Tan
, Xiaochen Liu, Huan Zhang, Junjian Zhang, Yaguan Qian, Zhaoquan Gu:
DualPure: An Efficient Adversarial Purification Method for Speech Command Recognition. - Jan Lehecka, Josef V. Psutka, Lubos Smídl
, Pavel Ircing, Josef Psutka:
A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives. - Antón de la Fuente, Dan Jurafsky:
A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models. - Spyretta Leivaditi, Tatsunari Matsushima, Matt Coler
, Shekhar Nayak
, Vass Verkhodanova:
Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating the Impact of Healthy, Disease-Specific, and Speaker-Specific Data. - I-Ting Hsieh, Chung-Hsien Wu
:
Dysarthric Speech Recognition Using Curriculum Learning and Articulatory Feature Embedding. - Shiyao Wang, Shiwan Zhao, Jiaming Zhou, Aobo Kong, Yong Qin:
Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation. - Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou:
An efficient text augmentation approach for contextualized Mandarin speech recognition. - Sheng Li
, Chen Chen, Kwok Chin Yuen, Chenhui Chu, Eng Siong Chng, Hisashi Kawai:
Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses. - Lun Wang, Om Thakkar, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan:
Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping.
Spoken Language Understanding
- Emmy Phung, Harsh Deshpande, Ahmad Emami, Kanishk Singh:
AR-NLU: A Framework for Enhancing Natural Language Understanding Model Robustness against ASR Errors. - Mohan Li, Simon Keizer, Rama Doddipatla:
Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding. - Tuyen Tran, Khanh Le, Ngoc Dang Nguyen, Minh Vu, Huyen Ngo, Woomyoung Park, Thi Thu Trang Nguyen:
VN-SLU: A Vietnamese Spoken Language Understanding Dataset. - Shunsuke Kando, Yusuke Miyao, Jason Naradowsky, Shinnosuke Takamichi:
Textless Dependency Parsing by Labeled Sequence Prediction. - Yaoyao Yue, Michael Proctor
, Luping Zhou, Rijul Gupta, Tharinda Piyadasa, Amelia Gully, Kirrie Ballard
, Craig T. Jin:
Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI. - Alexander Johnson, Peter Plantinga, Pheobe Sun, Swaroop Gadiyaram, Abenezer Girma, Ahmad Emami:
Efficient SQA from Long Audio Contexts: A Policy-driven Approach.
Speech and Multimodal Resources
- Jan Pesán, Vojtech Jurík, Martin Karafiát, Jan Cernocký:
BESST Dataset: A Multimodal Resource for Speech-based Stress Detection and Analysis. - Arnon Turetzky, Or Tal, Yael Segal, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya Roni Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi:
HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing. - Wenbin Wang, Yang Song, Sanjay Jha:
GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech. - Yuexuan Kong, Viet-Anh Tran, Romain Hennequin:
STraDa: A Singer Traits Dataset. - Katharina Anderer
, Andreas Reich, Matthias Wölfel:
MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features. - Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh:
MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset. - Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer:
Towards measuring fairness in speech recognition: Fair-Speech dataset. - Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi:
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio. - Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem:
SER Evals: In-domain and Out-of-domain benchmarking for speech emotion recognition.
Pathological Speech Analysis 1
- Vidar Freyr Gudmundsson, Keve Márton Gönczi, Malin Svensson Lundmark
, Donna Erickson, Oliver Niebuhr
:
The MARRYS helmet: A new device for researching and training "jaw dancing". - Moreno La Quatra, Maria Francesca Turco, Torbjørn Svendsen, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi:
Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions. - Andreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian B. Pokorny, Maurice Gerczuk
, Shahin Amiriparian, Thomas M. Berghaus, Björn W. Schuller:
Sustained Vowels for Pre- vs Post-Treatment COPD Classification. - Mahdi Amiri, Ina Kodrasi:
Adversarial Robustness Analysis in Automatic Pathological Speech Detection Approaches. - Gahye Kim, Yunjung Eom, Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So:
Automatic Children Speech Sound Disorder Detection with Age and Speaker Bias Mitigation.
Speech and Language in Health: from Remote Monitoring to Medical Conversations - 1 (Special Session)
- Mojtaba Kadkhodaie Elyaderani, John Glover, Thomas Schaaf:
Reference-Free Estimation of the Quality of Clinical Notes Generated from Doctor-Patient Conversations. - Jihyun Mun, Sunhee Kim, Minhwa Chung:
Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder. - Vladimir Despotovic
, Abir Elbéji, Petr V. Nazarov
, Guy Fagherazzi:
Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention. - Stefano Goria, Roseline Polle, Salvatore Fara, Nicholas Cummins:
Revealing Confounding Biases: A Novel Benchmarking Approach for Aggregate-Level Performance Metrics in Health Assessments. - Anaïs Rameau, Satrajit Ghosh, Alexandros Sigaras
, Olivier Elemento, Jean-Christophe Bélisle-Pipon, Vardit Ravitsky, Maria Powell, Alistair Johnson, David Dorr, Philip R. O. Payne, Micah Boyer, Stephanie Watts, Ruth Bahr, Frank Rudzicz, Jordan Lerner-Ellis, Shaheen Awan, Don Bolser, Yael Bensoussan:
Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI. - Sri Harsha Dumpala, Katerina Dikaios, Abraham Nunes, Frank Rudzicz, Rudolf Uher, Sageev Oore:
Self-Supervised Embeddings for Detecting Individual Symptoms of Depression. - Daryush D. Mehta, Jarrad H. Van Stan, Hamzeh Ghasemzadeh, Robert E. Hillman:
Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction. - Jennifer Williams
, Eike Schneiders
, Henry Card, Tina Seabrooke
, Beatrice Pakenham-Walsh, Tayyaba Azim, Lucy Valls-Reed, Ganesh Vigneswaran
, John Robert Bautista, Rohan Chandra, Arya Farahi:
Predicting Acute Pain Levels Implicitly from Vocal Features. - Kubilay Can Demir, Belén Lojo Rodríguez, Tobias Weise, Andreas K. Maier, Seung Hee Yang:
Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis. - Gowtham Premananth
, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Y. Espy-Wilson:
A Multimodal Framework for the Assessment of the Schizophrenia Spectrum.
Speech and Brain
- Yuzhe Wang, Anna Favaro, Thomas Thebaud, Jesús Villalba, Najim Dehak, Laureano Moro-Velázquez:
Exploring the Complementary Nature of Speech and Eye Movements for Profiling Neurological Disorders. - Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling:
Refining Self-supervised Learnt Speech Representation using Brain Activations. - Yuejiao Wang, Xianmin Gong
, Lingwei Meng, Xixin Wu, Helen Meng:
Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder. - Kumar Neelabh, Vishnu Sreekumar:
From Sound to Meaning in the Auditory Cortex: A Neuronal Representation and Classification Analysis. - Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang:
Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models. - Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, Shrikanth Narayanan:
Toward Fully-End-to-End Listened Speech Decoding from EEG Signals.
Innovative Methods in Phonetics and Phonology
- Emily P. Ahn, Eleanor Chodroff, Myriam Lapierre, Gina-Anne Levow:
The Use of Phone Categories and Cross-Language Modeling for Phone Alignment of Panãra. - Sneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma:
Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN. - Harsha Veena Tadavarthy, Austin Jones, Margaret E. L. Renwick:
Phonological Feature Detection for US English using the Phonet Library. - Constantijn Kaland, Jeremy Steffman, Jennifer Cole
:
K-means and hierarchical clustering of f0 contours. - Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff:
Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment. - Lila Kim, Cédric Gendrot:
Using wav2vec 2.0 for phonetic classification tasks: methodological aspects. - Michael Lambropoulos
, Frantz Clermont, Shunichi Ishihara
:
The sub-band cepstrum as a tool for locating local spectral regions of phonetic sensitivity: A first attempt with multi-speaker vowel data. - Woo-Jin Chung
, Hong-Goo Kang:
Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator. - Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Nöth, Björn Heismann, Andreas K. Maier, Seung Hee Yang:
Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech. - Anna Oura, Hideaki Kikuchi, Tetsunori Kobayashi:
Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech.
Voice, Tones and F0
- Chenyu Li, Jalal Al-Tamimi:
Impact of the tonal factor on diphthong realizations in Standard Mandarin with Generalized Additive Mixed Models. - Xiaowang Liu, Jinsong Zhang:
A Study on the Information Mechanism of the 3rd Tone Sandhi Rule in Mandarin Disyllabic Words. - Melanie Weirich, Daniel Duran, Stefanie Jannedy
:
Gender and age based f0-variation in the German Plapper Corpus. - Chenzi Xu
, Jessica Wormald, Paul Foulkes, Philip Harrison, Vincent Hughes, Poppy Welch, Finnian Kelly
, David van der Vloed:
Voice quality in telephone speech: Comparing acoustic measures between VoIP telephone and high-quality recordings. - Iona Gessinger
, Bistra Andreeva, Benjamin R. Cowan:
The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer Partners.
Emotion Recognition: Resources and Benchmarks
- Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain
:
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark. - Andreas Triantafyllopoulos, Anton Batliner, Simon David Noel Rampp, Manuel Milling, Björn W. Schuller:
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition. - Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni
, Mukhtar Mohamed, Muhammad Abdul-Mageed:
What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark. - Abinay Reddy Naini, Lucas Goncalves, Mary A. Kohler, Donita Robinson, Elizabeth Richerson, Carlos Busso:
WHiSER: White House Tapes Speech Emotion Recognition Corpus. - Siddique Latif
, Raja Jurdak
, Björn W. Schuller:
Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion Recognition. - Jincen Wang, Yan Zhao, Cheng Lu, Chuangao Tang, Sunan Li, Yuan Zong, Wenming Zheng:
Boosting Cross-Corpus Speech Emotion Recognition using CycleGAN with Contrastive Learning.
Speaker and Language Identification and Diarization
- Bilal Rahou, Hervé Bredin:
Multi-latency look-ahead for streaming speaker segmentation. - Christoph Boeddeker, Tobias Cord-Landwehr
, Reinhold Haeb-Umbach:
Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment. - Théo Mariotte
, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas:
ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings. - Gabriel Pirlogeanu, Octavian Pascu, Alexandru-Lucian Georgescu, Horia Cucu:
Hybrid-Diarization System with Overlap Post-Processing for the DISPLACE 2024 Challenge. - Shareef Babu Kalluri
, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K. T., S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy:
The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments. - Joonas Kalda, Tanel Alumäe, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer
:
TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024. - Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng:
Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification. - Martina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino:
Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech. - Rohit Paturi, Xiang Li, Sundararajan Srinivasan:
AG-LSEC: Audio Grounded Lexical Speaker Error Correction. - Hang Su, Yuxiang Kong, Lichun Fan, Peng Gao, Yujun Wang, Zhiyong Wu:
Speaker Change Detection with Weighted-sum Knowledge Distillation based on Self-supervised Pre-trained Models. - Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura:
SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization. - Hokuto Munakata, Ryo Terashima, Yusuke Fujita:
Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework.
Audio-Text Retrieval
- Yifei Xin, Xuxin Cheng, Zhihong Zhu, Xusheng Yang, Yuexian Zou:
DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval. - Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang:
Bridging Language Gaps in Audio-Text Retrieval. - Soham Deshmukh, Rita Singh, Bhiksha Raj:
Domain Adaptation for Contrastive Audio-Language Models. - Francesco Paissan, Elisabetta Farella:
tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models. - June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung:
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification. - Yuwu Tang, Ziang Ma, Haitao Zhang:
Enhanced Feature Learning with Normalized Knowledge Distillation for Audio Tagging.
Speech Enhancement
- Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie:
RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention. - Xi Liu, John H. L. Hansen:
DNN-based monaural speech enhancement using alternate analysis windows for phase and magnitude modification. - Li Li, Shogo Seki:
Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio. - Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu:
Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression. - Jizhen Li, Xinmeng Xu, Weiping Tu, Yuhong Yang, Rong Zhu:
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer. - Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, Haizhou Li:
An Exploration of Length Generalization in Transformer-Based Speech Enhancement. - Haixin Guan
, Wei Dai, Guangyong Wang, Xiaobin Tan, Peng Li, Jiaen Liang:
Reducing Speech Distortion and Artifacts for Speech Enhancement by Loss Function. - Candy Olivia Mawalim, Shogo Okada, Masashi Unoki:
Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments? - Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian:
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement.
Speech Coding
- Jinghong Zhang, Zugang Zhao, Yonghui Liu, Jianbing Liu, Zhiqiang He, Kai Niu:
TD-PLC: A Semantic-Aware Speech Encoding for Improved Packet Loss Concealment. - Zihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie:
BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation. - Kishan Gupta, Nicola Pia, Srikanth Korse
, Andreas Brendel, Guillaume Fuchs, Markus Multrus:
On Improving Error Resilience of Neural End-to-End Speech Coders. - Thomas Muller, Stéphane Ragot, Laetitia Gros, Pierrick Philippe, Pascal Scalart:
Speech quality evaluation of neural audio codecs. - Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, Zhen-Hua Ling:
A Low-Bitrate Neural Audio Codec Framework with Bandwidth Reduction and Recovery for High-Sampling-Rate Waveforms. - Haibin Wu, Yuan Tseng, Hung-yi Lee:
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems.
Speech Synthesis: Expressivity and Emotion
- Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong
, Pinxin Liu, Zhiyao Duan:
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis. - Donghyun Seong, Hoyoung Lee, Joon-Hyuk Chang:
TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech. - Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng:
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models. - Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie:
Text-aware and Context-aware Expressive Audiobook Speech Synthesis. - Thomas Bott, Florian Lux, Ngoc Thang Vu:
Controlling Emotion in Text-to-Speech with Natural Language Prompts. - Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li:
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining. - Pavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya:
Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation. - Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee:
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech. - Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang:
Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder. - Chin-Yun Yu
, György Fazekas:
Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis.
Speech Synthesis: Tools and Data
- Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari:
SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark. - Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra:
Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings. - Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani:
FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks. - Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie:
WenetSpeech4TTS: A 12, 800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark. - Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing Huai:
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis. - Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana:
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning. - Sewade Ogun, Abraham Toluwase Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin P. Adewumi:
1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis. - Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari:
SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis.
Speech Synthesis: Singing Voice Synthesis
- Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin Choi, Nam Soo Kim:
MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance. - Takuma Okamoto, Yamato Ohtani, Sota Shimizu, Tomoki Toda, Hisashi Kawai:
Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural Vocoder. - Taewoo Kim, Choongsang Cho, Young Han Lee:
Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis. - Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe:
Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing. - Ji-Sang Hwang, HyeongRae Noh, Yoonseok Hong, Insoo Oh:
X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning. - Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu:
An End-to-End Approach for Chord-Conditioned Song Generation.
LLM in ASR
- Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng:
Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions. - Frank Seide, Yangyang Shi, Morrie Doulaty, Yashesh Gaur, Junteng Jia, Chunyang Wu:
Speech ReaLLM - Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of Time. - Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie:
A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition. - Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang:
Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models.
Vision and Speech
- Jongsuk Kim, Jiwon Shin, Junmo Kim:
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning. - Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha:
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition. - Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu:
Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition. - Chen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang:
CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge.
Spoken Document Summarization
- Margaret Kroll, Kelsey Kraus:
Optimizing the role of human evaluation in LLM-based spoken document summarization systems. - Sangwon Ryu
, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok:
Key-Element-Informed sLLM Tuning for Document Summarization. - Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix:
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation. - Hengchao Shang, Zongyao Li, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Daimeng Wei, Hao Yang:
An End-to-End Speech Summarization Using Large Language Model. - Wonjune Kang, Deb Roy:
Prompting Large Language Models with Audio for General-Purpose Speech Summarization. - Khai Le-Duc, Khai-Nguyen Nguyen, Long Vo-Dang, Truong-Son Hy:
Real-time Speech Summarization for Medical Conversations.
Speech and Language in Health: from Remote Monitoring to Medical Conversations - 2 (Special Sessions)
- Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Ilja Baumann, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet, Adolfo M. García, Juan Rafael Orozco-Arroyave:
It's Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson's Disease. - Malo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard:
Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models. - Catarina Botelho
, John Mendonça, Anna Pompili, Tanja Schultz, Alberto Abad, Isabel Trancoso:
Macro-descriptors for Alzheimer's disease detection using large language models. - Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer:
Infusing Acoustic Pause Context into Text-Based Dementia Assessment. - Oliver Roesler, Jackson Liscombe, Michael Neumann, Hardik Kothare, Abhishek Hosamath, Lakshmi Arbatti, Doug Habberstad, Christiane Suendermann-Oeft, Meredith Bartlett, Cathy Zhang, Nikhil Sukhdev, Kolja Wilms, Anusha Badathala, Sandrine Istas, Steve Ruhmel, Bryan Hansen, Madeline Hannan, David Henley, Arthur Wallace, Ira Shoulson, David Suendermann-Oeft, Vikram Ramanarayanan:
Towards Scalable Remote Assessment of Mild Cognitive Impairment Via Multimodal Dialog. - Mara Barberis, Pieter De Clercq, Bastiaan Tamm
, Hugo Van hamme
, Maaike Vandermosten:
Automatic recognition and detection of aphasic natural speech. - Giulia Sanguedolce
, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor
, Fatemeh Geranmayeh
:
When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition. - Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James R. Glass:
Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer. - Hardik Kothare, Michael Neumann, Cathy Zhang, Jackson Liscombe, Jordi W. J. van Unnik, Lianne C. M. Botman, Leonard H. van den Berg, Ruben P. A van Eijk, Vikram Ramanarayanan:
How Consistent are Speech-Based Biomarkers in Remote Tracking of ALS Disease Progression Across Languages? A Case Study of English and Dutch. - Anika A. Spiesberger, Andreas Triantafyllopoulos, Alexander Kathan, Anastasia Semertzidou, Caterina Gawrilow, Tilman Reinelt
, Wolfgang A. Rauch, Björn W. Schuller:
"So . . . my child . . . " - How Child ADHD Influences the Way Parents Talk. - Judith Dineley
, Ewan Carr, Lauren L. White
, Catriona Lucas, Zahia Rahman, Tian Pan, Faith Matcham, Johnny Downs
, Richard J. B. Dobson
, Thomas F. Quatieri, Nicholas Cummins:
Variability of speech timing features across repeated recordings: a comparison of open-source extraction techniques. - Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier:
Zero-Shot End-To-End Spoken Question Answering In Medical Domain. - Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian:
Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition.
Show and Tell 2
- Kesavaraj V, Charan Devarkonda, Vamshiraghusimha Narasinga, Anil Kumar Vuppala:
Custom wake word detection. - Song Chen, Mandar Gogate, Kia Dashtipour, Jasper Kirton-Wingate, Adeel Hussain, Faiyaz Doctor, Tughrul Arslan, Amir Hussain:
Edged based audio-visual speech enhancement demonstrator. - Arif Reza Anway, Bryony Buck, Mandar Gogate, Kia Dashtipour, Michael Akeroyd, Amir Hussain:
Real-Time Gaze-directed speech enhancement for audio-visual hearing-aids. - Abhishek Kumar, Srikanth Konjeti, Jithendra Vepa:
Detection of background agents speech in contact centers. - Bramhendra Koilakuntla, Prajesh Rana, Paras Ahuja, Srikanth Konjeti, Jithendra Vepa:
Leveraging large language models for post-transcription correction in contact centers. - Leonie Schade, Nico Dallmann, Olcay Türk, Stefan Lazarov, Petra Wagner:
Understanding "understanding": presenting a richly annotated multimodal corpus of dyadic interaction. - João Vítor Possamai de Menezes, Arne-Lukas Fietkau, Tom Diener, Steffen Kürbis, Peter Birkholz:
A demonstrator for articulation-based command word recognition. - Nigel G. Ward, Andres Segura:
Pragmatically similar utterance finder demonstration. - Kai Liu, Ziqing Du, Huan Zhou, Xucheng Wan, Naijun Zheng:
Real-time scheme for rapid extraction of speaker embeddings in challenging recording conditions. - Szu-Yu Chen, Tien-Hong Lo, Yao-Ting Sung, Ching-Yu Tseng, Berlin Chen:
TEEMI: a speaking practice tool for L2 English learners.
Prosody
- Na Hu, Hugo Schnack, Amalia Arvaniti
:
Automatic pitch accent classification through image classification. - Tianqi Geng, Hui Feng:
Form and Function in Prosodic Representation: In the Case of 'ma' in Tianjin Mandarin. - Joyshree Chakraborty, Leena Dihingia, Priyankoo Sarmah, Rohit Sinha:
On Comparing Time- and Frequency-Domain Rhythm Measures in Classifying Assamese Dialects. - Chiara Riegger, Tina Bögel, George Walkden:
The prosody of the verbal prefix ge-: historical and experimental evidence. - Hongchen Wu, Jiwon Yun:
Influences of Morphosyntax and Semantics on the Intonation of Mandarin Chinese Wh-indeterminates. - Benazir Mumtaz, Miriam Butt:
Urdu Alternative Questions: A Hat Pattern.
Foundational Models for Deepfake and Spoofed Speech Detection
- Hoan My Tran, David Guennec, Philippe Martin, Aghilas Sini, Damien Lolive, Arnaud Delhay, Pierre-François Marteau:
Spoofed Speech Detection with a Focus on Speaker Embedding. - Juan M. Martín-Doñas
, Aitor Álvarez, Eros Rosello, Angel M. Gomez, Antonio M. Peinado:
Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection. - Zihan Pan, Tianchi Liu
, Hardik B. Sailor, Qiongqiong Wang:
Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection. - Haochen Wu, Wu Guo, Shengyu Peng, Zhuhai Li, Jie Zhang:
Adapter Learning from Pre-trained Model for Robust Spoof Speech Detection. - Kexu Liu, Yuanxin Wang, Shengchen Li, Xi Shao:
Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing Attacks. - Thien-Phuc Doan
, Long Nguyen-Vu, Kihun Hong, Souhwan Jung:
Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake Detection.
Speaker Recognition 1
- Shengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang:
Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker Verification. - En-Lun Yu, Kuan-Hsun Ho, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen:
Speaker Conditional Sinc-Extractor for Personal VAD. - Shiu-Hsiang Liou, Po-Cheng Chan, Chia-Ping Chen, Tzu-Chieh Lin, Chung-Li Lu, Yu-Han Cheng, Hsiang-Feng Chuang, Wei-Yu Chen:
Enhancing ECAPA-TDNN with Feature Processing Module and Attention Mechanism for Speaker Verification. - Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu:
MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms. - Kihyun Nam, Hee-Soo Heo, Jee-Weon Jung, Joon Son Chung:
Disentangled Representation Learning for Environment-agnostic Speaker Recognition. - Ladislav Mosner, Romain Serizel, Lukás Burget, Oldrich Plchot, Emmanuel Vincent, Junyi Peng, Jan Cernocký:
Multi-Channel Extension of Pre-trained Models for Speaker Verification. - Yishuang Li, Wenhao Guan, Hukai Huang, Shiyu Miao, Qi Su, Lin Li, Qingyang Hong:
Efficient Integrated Features Based on Pre-trained Models for Speaker Verification. - Tianhao Wang, Lantian Li, Dong Wang:
SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition. - Wei-Lin Xie, Yu-Xuan Xi, Yan Song, Jian-Tao Zhang, Hao-Yu Song, Ian McLoughlin:
DB-PMAE: Dual-Branch Prototypical Masked AutoEncoder with locality for domain robust speaker verification. - Matthew Maciejewski, Dominik Klement, Ruizhe Huang, Matthew Wiesner, Sanjeev Khudanpur:
Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language. - Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang:
A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition.
Source Separation 1
- Helin Wang, Jesús Villalba, Laureano Moro-Velázquez, Jiarui Hai, Thomas Thebaud, Najim Dehak:
Noise-robust Speech Separation with Fast Generative Correction. - Roland Hartanto, Sakriani Sakti, Koichi Shinoda:
MSDET: Multitask Speaker Separation and Direction-of-Arrival Estimation Training. - Jacob Kealey, John R. Hershey, François Grondin:
Unsupervised Improved MVDR Beamforming for Sound Enhancement. - Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin:
Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation. - Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang:
Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments. - Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma:
Towards Audio Codec-based Speech Separation.
Audio-Visual and Generative Speech Enhancement
- Zhengxiao Li, Nakamasa Inoue:
Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step Diffusion. - Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang:
Diffusion Gaussian Mixture Audio Denoise. - Bunlong Lay, Timo Gerkmann:
An Analysis of the Variance of Diffusion-based Speech Enhancement. - Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung:
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching. - Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic:
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement. - Junhui Li, Pu Wang, Jialu Li, Youshan Zhang:
Complex Image-Generative Diffusion Transformer for Audio Denoising. - Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng:
Noise-aware Speech Enhancement using Diffusion Probabilistic Model.
Speech Privacy and Bandwidth Expansion
- Mohammad Hassan Vali
, Tom Bäckström
:
Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization. - Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji:
SilentCipher: Deep Audio Watermarking. - Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv:
Frequency-mix Knowledge Distillation for Fake Speech Detection. - Nicolas M. Müller, Piotr Kawa, Shen Hu, Matthias Neu, Jennifer Williams, Philip Sperl, Konstantin Böttinger:
A New Approach to Voice Authenticity. - Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen:
TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking. - Liwei Liu, Huihui Wei, Dongya Liu, Zhonghua Fu:
HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature Fusion. - Denise Moussa, Sandra Bergmann, Christian Riess:
Unmasking Neural Codecs: Forensic Identification of AI-compressed Speech. - Yin-Tse Lin, Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee:
SWiBE: A Parameterized Stochastic Diffusion Process for Noise-Robust Bandwidth Expansion. - Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling:
MultiStage Speech Bandwidth Extension with Flexible Sampling Rate Control. - Xu Li, Qirui Wang, Xiaoyu Liu:
MaskSR: Masked Language Model for Full-band Speech Restoration.
Speech Synthesis: Prosody
- Yuliya Korotkova, Ilya Kalinovskiy, Tatiana Vakhrusheva:
Word-level Text Markup for Prosody Control in Speech Synthesis. - Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter:
Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. - Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li, Sheng Zhao, Naoyuki Kanda:
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems. - Himanshu Maurya, Atli Sigurgeirsson:
A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer. - Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang:
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling. - Jinzuomu Zhong
, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu:
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of Speech-Silence and Word-Punctuation.
Accented Speech, Prosodic Features, Dialect, Emotion, Sound Classification
- Darshan Prabhu, Abhishek Gupta
, Omkar Nitsure, Preethi Jyothi, Sriram Ganapathy:
Improving Self-supervised Pre-training using Accent-Specific Codebooks. - Tejumade Afonja, Tobi Olatunji, Sewade Ogun
, Naome A. Etori, Abraham Toluwase Owodunni, Moshood Yekini:
Performant ASR Models for Medical Entities in Accented Speech. - Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho Ittan George, Kaushal Santosh Bhogale, Deovrat Mehendale, Mitesh M. Khapra:
LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems. - Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim:
LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech. - Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu, Qingyang Hong, Lin Li:
MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition. - Ying Hu, Huamin Yang, Hao Huang, Liang He
:
Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition. - Arnav Goel, Medha Hira, Anubha Gupta
:
Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning. - Hazim T. Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh:
SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios. - Martijn Bentum, Louis ten Bosch, Tom Lentz
:
The Processing of Stress in End-to-End Automatic Speech Recognition Models. - Tuan Nguyen, Huy Dat Tran:
LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection. - Rhiannon Mogridge, Anton Ragni:
Learning from memory-based models. - Meiling Chen, Pengjie Liu, Heng Yang, Haofeng Wang:
Towards End-to-End Unified Recognition for Mandarin and Cantonese.
Neural Network Adaptation
- Thomas Rolland, Alberto Abad:
Shared-Adapters: A Novel Transformer-based Parameter Efficient Transfer Learning Approach For Children's Automatic Speech Recognition. - Zhouyuan Huo, Dongseong Hwang, Gan Song, Khe Chai Sim, Weiran Wang:
AdaRA: Adaptive Rank Allocation of Residual Adapters for Speech Foundation Model. - Kyuhong Shim, Jinkyu Lee, Hyunjae Kim:
Leveraging Adapter for Parameter-Efficient ASR Encoder. - Ji-Hun Kang, Jae-Hong Lee, Mun-Hak Lee, Joon-Hyuk Chang:
Whisper Multilingual Downstream Task Tuning Using Task Vectors. - Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang:
Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR. - Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei:
Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition.
ASR and LLMs
- Ji Won Yoon, Beom Jun Woo, Nam Soo Kim:
HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition. - Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen:
MaLa-ASR: Multimedia-Assisted LLM-Based ASR. - HyunJung Choi, Muyeol Choi, Yohan Lim, Minkyu Lee, Seon Hui Kim, Seung Yun, Donghyun Kim, Sang-Hun Kim:
Spoken-to-written text conversion with Large Language Model. - Zhiqi Ai, Zhiyong Chen, Shugong Xu:
MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting. - Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogério Feris, James Glass:
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation. - K. R. Prajwal, Triantafyllos Afouras, Andrew Zisserman:
Speech Recognition Models are Strong Lip-readers.
Pathological Speech Analysis 3
- Ilja Baumann, Dominik Wagner, Maria Schuster, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet:
Towards Self-Attention Understanding for Automatic Articulatory Processes Analysis in Cleft Lip and Palate Speech. - Yin-Long Liu, Rui Feng, Jia-Hong Yuan, Zhen-Hua Ling:
Clever Hans Effect Found in Automatic Detection of Alzheimer's Disease through Speech. - Yeh-Sheng Lin, Shu-Chuan Tseng, Jyh-Shing Roger Jang:
Leveraging Phonemic Transcription and Whisper toward Clinically Significant Indices for Automatic Child Speech Assessment. - Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo:
Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features. - Wei-Tung Hsu, Chin-Po Chen, Yun-Shao Lin, Chi-Chun Lee:
A Cluster-based Personalized Federated Learning Strategy for End-to-End ASR of Dementia Patients. - Stefan Kalabakov, Monica González Machorro, Florian Eyben, Björn W. Schuller, Bert Arnrich:
A Comparative Analysis of Federated Learning for Speech-Based Cognitive Decline Detection. - Michael Neumann, Hardik Kothare, Jackson Liscombe, Emma C. L. Leschly, Oliver Roesler, Vikram Ramanarayanan:
Multimodal Digital Biomarkers for Longitudinal Tracking of Speech Impairment Severity in ALS: An Investigation of Clinically Important Differences.
Speech Disorders 3
- Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee:
Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design. - Neil Kumar Shah, Shirish S. Karande, Vineet Gandhi:
Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models. - Seyun Um, Doyeon Kim, Hong-Goo Kang:
PARAN: Variational Autoencoder-based End-to-End Articulation-to-Speech System for Speech Intelligibility. - Si Chen, Bruce Xiao Wang, Yitian Hong, Fang Zhou, Angel Chan, Po-yi Tang, Bin Li, Chunyi Wen, James Cheung, Yan Liu, Zhuoming Chen:
Acoustic changes in speech prosody produced by children with autism after robot-assisted speech training. - Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson:
Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility. - Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave
, Marilyn A. Ladewig, Rus Heywood, Jordan R. Green:
Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech. - Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze
:
Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis. - Gábor Gosztolya, Mercedes Vetráb, Veronika Svindt
, Judit Bóna, Ildikó Hoffmann:
Wav2vec 2.0 Embeddings Are No Swiss Army Knife - A Case Study for Multiple Sclerosis.
Speech Recognition with Large Pretrained Speech Models for Under-represented Languages (Special Session)
- Yi-Jen Shih, David Harwath:
Interface Design for Self-Supervised Speech Models. - Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu:
Comparing Discrete and Continuous Space LLMs for Speech Recognition. - Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang:
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text. - Kaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra:
Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling. - Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave
, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt:
Interleaved Audio/Audiovisual Transfer Learning for AV-ASR in Low-Resourced Languages. - Sathvik Udupa, Jesuraj Bandekar, Saurabh Kumar, Deekshitha G, Sandhya Badiger, Abhayjeet Singh Savitha Murthy, Priyanka Pai, Srinivasa Raghavan K. M., Raoul Nanavati, Prasanta Kumar Ghosh:
Adapter pre-training for improved speech recognition in unseen domains using low resource adapter tuning of self-supervised models. - Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie:
Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper. - Yaroslav Getman
, Tamás Grósz, Katri Hiovain-Asikainen, Mikko Kurimo:
Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on Northern Sámi. - Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, Mark J. F. Gales:
Learn and Don't Forget: Adding a New Language to ASR Foundation Models.
Speech Processing Using Discrete Speech Units (Special Session)
- Yuning Wu, Chunlei Zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin:
TokSing: Singing Voice Synthesis based on Discrete Tokens. - Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli:
How Should We Extract Discrete Audio Tokens from Self-Supervised Models? - Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin:
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units. - Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin:
SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models. - Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe:
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model. - Kunal Dhawan, Nithin Rao Koluguri, Ante Jukic, Ryan Langman, Jagadeesh Balam, Boris Ginsburg:
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations.
Keynote 3
- Elmar Nöth:
Analysis of Pathological Speech - Pitfalls along the Way.
Databases and Progress in Methodology
- Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung:
VoxSim: A perceptual voice similarity dataset. - Mewlude Nijat, Chen Chen, Dong Wang, Askar Hamdulla:
UY/CH-CHILD - A Public Chinese L2 Speech Database of Uyghur Children. - Prakash Kumar, Ye Tian, Yongwan Lim, Sophia X. Cui, Christina Hagedorn, Dani Byrd, Uttam K. Sinha, Shrikanth Narayanan, Krishna S. Nayak:
State-of-the-art speech production MRI protocol for new 0.55 Tesla scanners. - Mingyue Shi, Huali Zhou, Qinglin Meng, Nengheng Zheng:
DBD-CI: Doubling the Band Density for Bilateral Cochlear Implants. - Huihang Zhong, Yanlu Xie, ZiJin Yao:
Leveraging Large Language Models to Refine Automatic Feedback Generation at Articulatory Level in Computer Aided Pronunciation Training. - Bin Zhao, Mingxuan Huang, Chenlu Ma, Jinyi Xue
, Aijun Li, Kunyu Xu
:
Decoding Human Language Acquisition: EEG Evidence for Predictive Probabilistic Statistics in Word Segmentation.
Articulation, Convergence and Perception
- Jérémy Giroud, Jessica Lei, Kirsty Phillips
, Matthew H. Davis:
Behavioral evidence for higher speech rate convergence following natural than artificial time altered speech. - Qingye Shen, Leonardo Lancia, Noël Nguyen:
A novel experimental design for the study of listener-to-listener convergence in phoneme categorization. - Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, Guanglai Gao:
Cross-Attention-Guided WaveNet for EEG-to-MEL Spectrogram Reconstruction. - Nicolò Loddo, Francisca Pessanha
, Almila Akdag Salah
:
What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech Synthesis. - Malin Svensson Lundmark
:
Magnitude and timing of acceleration peaks in stressed and unstressed syllables.
Speech Emotion Recognition
- Shahin Amiriparian, Filip Packan, Maurice Gerczuk
, Björn W. Schuller:
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets. - Fabian Ritter Gutierrez, Kuan-Po Huang, Jeremy H. M. Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng:
Dataset-Distillation Generative Model for Speech Emotion Recognition. - Jialong Mai, Xiaofen Xing, Weidong Chen, Xiangmin Xu:
DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion Recognition. - Minxue Niu, Mimansa Jaiswal, Emily Mower Provost:
From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs.
Self-Supervised Models in Speaker Recognition
- Ju-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Minjae Lee, Ha-Jin Yu:
Self-supervised speaker verification with relational mask prediction. - Victor Miara, Théo Lepage
, Réda Dehak:
Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models. - Chan-yeong Lim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Kyo-Won Koo, Seung-bin Kim, Ha-Jin Yu:
Improving Noise Robustness in Self-supervised Pre-trained Model for Speaker Verification. - Abderrahim Fathan, Xiaolin Zhu, Jahangir Alam:
On the impact of several regularization techniques on label noise robustness of self-supervised speaker verification systems. - Zhe Li, Man-Wai Mak, Hung-yi Lee, Helen Meng:
Parameter-efficient Fine-tuning of Speaker-Aware Dynamic Prompts for Speaker Verification. - Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:
Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.
Speech Quality Assessment
- Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda:
Embedding Learning for Preference-based Speech Quality Assessment. - Sathvik Udupa, Soumi Maiti, Prasanta Kumar Ghosh:
IndicMOS: Multilingual MOS Prediction for 7 Indian languages. - Dan Wells, Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Erica Cooper, Aidan Pine, Junichi Yamagishi, Korin Richmond:
Experimental evaluation of MOS, AB and BWS listening test designs. - Bao Thang Ta, Minh Tu Le, Van Hai Do, Huynh Thi Thanh Binh:
Enhancing No-Reference Speech Quality Assessment with Pairwise, Triplet Ranking Losses, and ASR Pretraining.
Privacy and Security in Speech Communication 1
- Nicolas M. Müller, Nicholas W. D. Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger:
Harder or Different? Understanding Generalization of Audio Deepfake Detection. - Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki
, Taiki Miyagawa:
Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset. - David Looney, Nikolay D. Gaubitch:
Robust spread spectrum speech watermarking using linear prediction and deep spectral shaping. - Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Zhao Lv, Cunhang Fan:
RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection. - Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung:
How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines. - Ching-Yu Yang, Shreya G. Upadhyay, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee:
RW-VoiceShield: Raw Waveform-based Adversarial Attack on One-shot Voice Conversion.
Speech Synthesis: Voice Conversion 2
- Aleksei Gusev, Anastasia Avdeeva:
Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech. - Ji Sub Um, Hoirin Kim:
Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion. - Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie:
Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy. - Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari:
Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment. - Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima:
Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training. - Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao:
Residual Speaker Representation for One-Shot Voice Conversion. - Nicolas Gengembre, Olivier Le Blouch, Cédric Gendrot:
Disentangling prosody and timbre embeddings via voice conversion. - Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, Lirong Dai:
LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance.
Speech Synthesis: Text Processing
- Amit Roth, Arnon Turetzky, Yossi Adi:
A Language Modeling Approach to Diacritic-Free Hebrew TTS. - Avihu Dekel, Raul Fernandez:
Exploring the Benefits of Tokenization of Discrete Acoustic Units. - Markéta Rezácková, Daniel Tihelka, Jindrich Matousek:
Homograph Disambiguation with Text-to-Text Transfer Transformer. - Kiyoshi Kurihara, Masanori Sano:
Enhancing Japanese Text-to-Speech Accuracy with a Novel Combination Transformer-BERT-based G2P: Integrating Pronunciation Dictionaries and Accent Sandhi. - Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana:
Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data. - Xingxing Yang:
G2PA: G2P with Aligned Audio for Mandarin Chinese. - Siqi Sun, Korin Richmond:
Learning Pronunciation from Other Accents via Pronunciation Knowledge Transfer. - Deepanshu Gupta, Javier Latorre:
Positional Description for Numerical Normalization. - Christina Tånnander, Shivam Mehta, Jonas Beskow, Jens Edlund:
Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis.
Training Methods, Self-Supervised Learning, Adaptation
- Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu
, Maja Pantic:
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization. - Amrutha Prasad, Srikanth R. Madikeri, Driss Khalil, Petr Motlícek, Christof Schüpbach:
Speech and Language Recognition with Low-rank Adaptation of Pretrained Models. - Kwangyoun Kim, Suwon Shon, Yi-Te Hsu, Prashant Sridhar, Karen Livescu, Shinji Watanabe:
Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition. - Amit Meghanani, Thomas Hain
:
LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks. - Robert Flynn, Anton Ragni:
Self-Train Before You Transcribe. - Steven Vander Eeckt, Hugo Van hamme
:
Unsupervised Online Continual Learning for Automatic Speech Recognition. - Hao Shi, Tatsuya Kawahara:
Dual-path Adaptation of Pretrained Feature Extraction Module for Robust Automatic Speech Recognition. - Nahomi Kusunoki, Yosuke Higuchi, Tetsuji Ogawa
, Tetsunori Kobayashi:
Hierarchical Multi-Task Learning with CTC and Recursive Operation. - Keigo Hojo, Yukoh Wakabayashi, Kengo Ohta, Atsunori Ogawa, Norihide Kitaoka:
Boosting CTC-based ASR using inter-layer attention-based CTC loss. - Hyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe:
Self-training ASR Guided by Unsupervised ASR Teacher. - Yue Gu, Zhihao Du, Shiliang Zhang, Jiqing Han, Yongjun He:
Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition. - George Joseph, Arun Baby:
Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank Adaptation. - Jae-Hong Lee, Sang-Eon Lee, Dong-Hyun Kim, Do-Hee Kim, Joon-Hyuk Chang:
Online Subloop Search via Uncertainty Quantization for Efficient Test-Time Adaptation. - Vishwanath Pratap Singh, Federico Malato, Ville Hautamäki, Md. Sahidullah, Tomi Kinnunen:
ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2vec2.0 Based ASR. - Jeehye Lee, Hyeji Seo:
Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition.
Novel Architectures for ASR
- Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara:
Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer. - Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe:
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting. - Virat Shejwalkar, Om Thakkar, Arun Narayanan:
Quantifying Unintended Memorization in BEST-RQ ASR Encoders. - Woo Hyun Kang, Srikanth Vishnubhotla, Rudolf Braun, Yogesh Virkar, Raghuveer Peri, Kyu J. Han:
SWAN: SubWord Alignment Network for HMM-free word timing estimation in end-to-end automatic speech recognition.
Multimodality and Foundation Models
- Ziyun Cui, Chang Lei
, Wen Wu
, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang:
Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models. - Mohammad Amaan Sayeed, Hanan Aldarmaki:
Spoken Word2Vec: Learning Skipgram Embeddings from Speech. - Pawel Bujnowski, Bartlomiej Kuzma, Bartlomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, Piotr Andruszkiewicz:
SAMSEMO: New dataset for multilingual and multimodal emotion recognition. - Bonian Jia, Huiyao Chen
, Yueheng Sun, Meishan Zhang, Min Zhang:
LLM-Driven Multimodal Opinion Expression Identification. - Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang:
Zero-Shot Fake Video Detection by Audio-Visual Consistency. - Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Ju, Tae-Hyun Oh:
Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert.
Spoken Dialogue Systems and Conversational Analysis 1
- Matthew McNeill, Rivka Levitan
:
Autoregressive cross-interlocutor attention scores meaningfully capture conversational dynamics. - Conor Atkins, Ian D. Wood, Mohamed Ali Kâafar
, Hassan Asghar
, Nardine Basta, Michal Kepkowski:
ConvoCache: Smart Re-Use of Chatbot Responses. - Livia Qian, Gabriel Skantze:
Joint Learning of Context and Feedback Embeddings in Spoken Dialogue. - Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah:
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing. - Siyang Wang, Éva Székely, Joakim Gustafson:
Contextual Interactive Evaluation of TTS Models in Dialogue Systems. - Min-Han Shih
, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee:
GSQA: An End-to-End Model for Generative Spoken Question Answering.
Speech Technology
- Mattias Nilsson, Riccardo Miccini
, Clement Laroche, Tobias Piechowiak, Friedemann Zenke:
Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps. - Maryam Naderi, Enno Hermann, Alexandre Nanchen, Sevada Hovsepyan, Mathew Magimai-Doss
:
Towards interfacing large language models with ASR systems using confidence measures and prompting. - Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran:
Text Injection for Neural Contextual Biasing. - Minglin Wu, Jing Xu, Xixin Wu, Helen Meng:
Prompting Large Language Models with Mispronunciation Detection and Diagnosis Abilities. - Haitong Sun, Jaehyun Choi, Nobuaki Minematsu, Daisuke Saito:
Acceleration of Posteriorgram-based DTW by Distilling the Class-to-class Distances Encoded in the Classifier Used to Calculate Posteriors.