


default search action
22nd Interspeech 2021: Brno, Czechia
- Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, Petr Motlícek:
22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021. ISCA 2021
Speech Synthesis: Other Topics
- Michael Pucher, Thomas Woltron:
Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks. 1-5 - Markéta Rezácková, Jan Svec
, Daniel Tihelka
:
T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion. 6-10 - Olivier Perrotin, Hussein El Amouri
, Gérard Bailly, Thomas Hueber:
Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values. 11-15 - Phat Do
, Matt Coler
, Jelske Dijkstra
, Esther Klabbers:
A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages. 16-20
Disordered Speech
- Tanya Talkar, Nancy Pearl Solomon, Douglas S. Brungart, Stefanie E. Kuchinsky, Megan M. Eitel, Sara M. Lippa, Tracey A. Brickell, Louis M. French
, Rael T. Lange, Thomas F. Quatieri:
Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury. 21-25 - Juan Camilo Vásquez-Correa
, Julian Fritsch, Juan Rafael Orozco-Arroyave
, Elmar Nöth, Mathew Magimai-Doss:
On Modeling Glottal Source Information for Phonation Assessment in Parkinson's Disease. 26-30 - Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Anne Pavy-Le Traon, Olivier Rascol, Wassilios G. Meissner
, Virginie Woisard:
Distortion of Voiced Obstruents for Differential Diagnosis Between Parkinson's Disease and Multiple System Atrophy. 31-35 - Pu Wang
, Bagher BabaAli, Hugo Van hamme
:
A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech. 36-40 - Rosanna Turrisi
, Arianna Braccia
, Marco Emanuele
, Simone Giulietti, Maura Pugliatti
, Mariachiara Sensi
, Luciano Fadiga
, Leonardo Badino
:
EasyCall Corpus: A Dysarthric Speech Dataset. 41-45
Speech Signal Analysis and Representation II
- Xiaoyu Bie
, Laurent Girin, Simon Leglaive
, Thomas Hueber, Xavier Alameda-Pineda:
A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling. 46-50 - Metehan Yurt, Pavan Kantharaju, Sascha Disch, Andreas Niedermeier, Alberto N. Escalante-B., Veniamin I. Morgenshtern:
Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods. 51-55 - RaviShankar Prasad, Mathew Magimai-Doss:
Identification of F1 and F2 in Speech Using Modified Zero Frequency Filtering. 56-60 - Yann Teytaut, Axel Roebel
:
Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice. 61-65
Feature, Embedding and Neural Architecture for Speaker Recognition
- Seong-Hu Kim, Yong-Hwa Park
:
Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition. 66-70 - Jiajun Qi, Wu Guo, Bin Gu:
Bidirectional Multiscale Feature Aggregation for Speaker Verification. 71-75 - Yu-Jia Zhang, Yih-Wen Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan:
Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods. 76-80 - Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu:
Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification. 81-85 - Tinglong Zhu, Xiaoyi Qin, Ming Li:
Binary Neural Network for Speaker Verification. 86-90 - Youzhi Tu, Man-Wai Mak:
Mutual Information Enhanced Training for Speaker Embedding. 91-95 - Ge Zhu, Fei Jiang, Zhiyao Duan:
Y-Vector: Multiscale Waveform Encoder for Speaker Embedding. 96-100 - Yan Liu, Zheng Li, Lin Li, Qingyang Hong:
Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification. 101-105 - Hongning Zhu, Kong Aik Lee
, Haizhou Li:
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. 106-110
Speech Synthesis: Toward End-to-End Synthesis II
- Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang:
TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions. 111-115 - Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho:
FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis. 116-120 - Taiki Nakamura, Tomoki Koriyama
, Hiroshi Saruwatari:
Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer. 121-125 - Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima:
Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech. 126-130 - Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang:
Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis. 131-135 - Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J. F. Gales:
Deliberation-Based Multi-Pass Speech Synthesis. 136-140 - Isaac Elias, Heiga Zen
, Jonathan Shen, Yu Zhang, Ye Jia, R. J. Skerry-Ryan, Yonghui Wu:
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. 141-145 - Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Köhler, Qing He:
Transformer-Based Acoustic Modeling for Streaming Speech Synthesis. 146-150 - Ye Jia, Heiga Zen
, Jonathan Shen, Yu Zhang, Yonghui Wu:
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS. 151-155 - Zhenhao Ge, Lakshmish Kaushik, Masanori Omote, Saket Kumar:
Speed up Training with Variable Length Inputs by Efficient Batching Strategies. 156-160
Speech Enhancement and Intelligibility
- Yuhang Sun, Linju Yang, Huifeng Zhu, Jie Hao:
Funnel Deep Complex U-Net for Phase-Aware Speech Enhancement. 161-165 - Qiquan Zhang, Qi Song, Aaron Nicolson, Tian Lan, Haizhou Li:
Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement. 166-170 - Changjie Pan, Feng Yang, Fei Chen:
Perceptual Contributions of Vowels and Consonant-Vowel Transitions in Understanding Time-Compressed Mandarin Sentences. 171-175 - Ritujoy Biswas
, Karan Nathwani, Vinayak Abrol:
Transfer Learning for Speech Intelligibility Improvement in Noisy Environments. 176-180 - Ayako Yamamoto, Toshio Irino, Kenichi Arai
, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita
, Tomohiro Nakatani:
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility. 181-185 - Wenzhe Liu
, Andong Li, Yuxuan Ke, Chengshi Zheng, Xiaodong Li:
Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement. 186-190 - Qiuqiang Kong, Haohe Liu, Xingjian Du
, Li Chen, Rui Xia, Yuxuan Wang:
Speech Enhancement with Weakly Labelled Data from AudioSet. 191-195 - Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao:
Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement. 196-200 - Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao:
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. 201-205 - Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty:
A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction. 206-210 - Yuanhang Qiu, Ruili Wang, Satwinder Singh, Zhizhong Ma
, Feng Hou:
Self-Supervised Learning Based Phone-Fortified Speech Enhancement. 211-215 - Khandokar Md. Nayem
, Donald S. Williamson
:
Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement. 216-220 - Jianwei Zhang, Suren Jayasuriya, Visar Berisha:
Restoring Degraded Speech via a Modified Diffusion Model. 221-225
Spoken Dialogue Systems I
- Hoang Long Nguyen, Vincent Renkens, Joris Pelemans, Srividya Pranavi Potharaju, Anil Kumar Nalamalapu, Murat Akbacak:
User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems. 226-230 - Nuo Chen, Chenyu You, Yuexian Zou:
Self-Supervised Dialogue Learning for Spoken Conversational Question Answering. 231-235 - Ruolin Su, Ting-Wei Wu, Biing-Hwang Juang:
Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking. 236-240 - Yuya Chiba, Ryuichiro Higashinaka:
Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information. 241-245 - Yoshihiro Yamazaki, Yuya Chiba, Takashi Nose
, Akinori Ito:
Neural Spoken-Response Generation Using Prosodic and Linguistic Context for Conversational Systems. 246-250 - Weiyuan Xu, Peilin Zhou, Chenyu You, Yuexian Zou:
Semantic Transportation Prototypical Network for Few-Shot Intent Detection. 251-255 - Li Tang, Yuke Si, Longbiao Wang, Jianwu Dang:
Domain-Specific Multi-Agent Dialog Policy Learning in Multi-Domain Task-Oriented Scenarios. 256-260 - Haoyu Wang, John Chen, Majid Laali, Kevin Durda, Jeff King, William Campbell, Yang Liu:
Leveraging ASR N-Best in Deep Entity Retrieval. 261-265
Topics in ASR: Robustness, Feature Extraction, and Far-Field ASR
- Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen:
End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition. 266-270 - Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen
, Michael R. Marlo, Graham Neubig:
Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties. 271-275 - Erfan Loweimi, Zoran Cvetkovic, Peter Bell, Steve Renals:
Speech Acoustic Modelling Using Raw Source and Filter Components. 276-280 - Masakiyo Fujimoto, Hisashi Kawai:
Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture. 281-285 - Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition. 286-290 - Junqi Chen, Xiao-Lei Zhang:
Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays. 291-295 - Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo:
Multi-Channel Transformer Transducer for Speech Recognition. 296-300 - Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe:
Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios. 301-305 - Guodong Ma
, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang:
Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition. 306-310 - Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve:
Rethinking Evaluation in ASR: Are Our Models Robust Enough? 311-315 - Max W. Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu:
Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition. 316-320
Voice Activity Detection and Keyword Spotting
- Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du
, Bilei Zhu, Zejun Ma, Dick Botteldooren:
Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams. 321-325 - Ui-Hyun Kim:
Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection. 326-330 - Hyun-Jin Park, Pai Zhu, Ignacio López-Moreno, Niranjan Subrahmanya:
Noisy Student-Teacher Training for Robust Keyword Spotting. 331-335 - Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu:
Multi-Channel VAD for Transcription of Group Discussion. 336-340 - Hengshun Zhou, Jun Du, Hang Chen, Zijun Jing, Shifu Xiong, Chin-Hui Lee:
Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments. 341-345 - Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura:
Enrollment-Less Training for Personalized Voice Activity Detection. 346-350 - Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki:
Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model. 351-355 - Young D. Kwon
, Jagmohan Chauhan, Cecilia Mascolo:
FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications. 356-360 - Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee
, Kiho Cho, Sung-Un Park:
End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention. 361-365 - Saurabhchand Bhati, Jesús Villalba
, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak:
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. 366-370 - Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu:
A Lightweight Framework for Online Voice Activity Detection in the Wild. 371-375
Voice and Voicing
- Aurélie Chlébowski, Nicolas Ballier:
"See what I mean, huh?" Evaluating Visual Inspection of F0 Tracking in Nasal Grunts. 376-380 - Bruce Xiao Wang
, Vincent Hughes:
System Performance as a Function of Calibration Methods, Sample Size and Sampling Variability in Likelihood Ratio-Based Forensic Voice Comparison. 381-385 - Anne Bonneau:
Voicing Assimilations by French Speakers of German in Stop-Fricative Sequences. 386-390 - Titas Chakraborty, Vaishali Patil, Preeti Rao:
The Four-Way Classification of Stops with Voicing and Aspiration for Non-Native Speech Evaluation. 391-395 - Saba Urooj, Benazir Mumtaz, Sarmad Hussain, Ehsan ul Haq:
Acoustic and Prosodic Correlates of Emotions in Urdu Speech. 396-400 - Nour Tamim, Silke Hamann:
Voicing Contrasts in the Singleton Stops of Palestinian Arabic: Production and Perception. 401-405 - Thomas Coy, Vincent Hughes, Philip Harrison, Amelia Jane Gully:
A Comparison of the Accuracy of Dissen and Keshet's (2016) DeepFormants and Traditional LPC Methods for Semi-Automatic Speaker Recognition. 406-410 - Michael Jessen:
MAP Adaptation Characteristics in Forensic Long-Term Formant Analysis. 411-415 - Justin J. H. Lo:
Cross-Linguistic Speaker Individuality of Long-Term Formant Distributions: Phonetic and Forensic Perspectives. 416-420 - Rachel Soo, Khia A. Johnson, Molly Babel:
Sound Change in Spontaneous Bilingual Speech: A Corpus Study on the Cantonese n-l Merger in Cantonese-English Bilinguals. 421-425 - Wendy Lalhminghlui
, Priyankoo Sarmah:
Characterizing Voiced and Voiceless Nasals in Mizo. 426-430
The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) - COVID-19 Cough, COVID-19 Speech, Escalation & Primates
- Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya
, Shahin Amiriparian
, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown
, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia
, Pietro Cicuta, Léon J. M. Rothkrantz, Joeri A. Zwerts
, Jelle Treep
, Casper S. Kaandorp:
The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates. 431-435 - Rubén Solera-Ureña
, Catarina Botelho, Francisco Teixeira
, Thomas Rolland, Alberto Abad
, Isabel Trancoso
:
Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19. 436-440 - Philipp Klumpp, Tobias Bocklet
, Tomás Arias-Vergara
, Juan Camilo Vásquez-Correa
, Paula Andrea Pérez-Toro
, Sebastian P. Bayerl, Juan Rafael Orozco-Arroyave
, Elmar Nöth:
The Phonetic Footprint of Covid-19? 441-445 - Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Junior, Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva:
Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021. 446-450 - Steffen Illium
, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien:
Visual Transformers for Primates Classification and Covid Detection. 451-455 - Thomas Pellegrini:
Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment. 456-460 - Robert Müller, Steffen Illium
, Claudia Linnhoff-Popien:
A Deep and Recurrent Architecture for Primate Vocalization Classification. 461-465 - Joeri A. Zwerts
, Jelle Treep
, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya
:
Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification. 466-470 - Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell
, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller:
Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild. 471-475 - José Vicente Egas López, Mercedes Vetráb
, László Tóth, Gábor Gosztolya:
Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features. 476-480 - Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina
, Alena Velichko
, Danila Mamontov, Wolfgang Minker, Alexey Karpov:
Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech. 481-485 - Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André
:
Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification. 486-490
Survey Talk 1: Heidi Christensen
- Heidi Christensen:
Towards Automatic Speech Recognition for People with Atypical Speech.
Embedding and Network Architecture for Speaker Recognition
- Chau Luu, Peter Bell, Steve Renals:
Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization. 491-495 - Magdalena Rybicka
, Jesús Villalba
, Piotr Zelasko, Najim Dehak
, Konrad Kowalczyk
:
Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition. 496-500 - Themos Stafylakis
, Johan Rohdin, Lukás Burget:
Speaker Embeddings by Modeling Channel-Wise Correlations. 501-505 - Weipeng He, Petr Motlícek, Jean-Marc Odobez
:
Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction. 506-510 - Junyi Peng, Xiaoyang Qu, Jianzong Wang
, Rongzhi Gu, Jing Xiao, Lukás Burget, Jan Cernocký:
ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform. 511-515
Speech Perception I
- Xiao Xiao, Nicolas Audibert, Grégoire Locqueville, Christophe d'Alessandro, Barbara Kühnert, Claire Pillot-Loiseau:
Prosodic Disambiguation Using Chironomic Stylization of Intonation with Native and Non-Native Speakers. 516-520 - Aleese Block
, Michelle Cohn
, Georgia Zellou:
Variation in Perceptual Sensitivity and Compensation for Coarticulation Across Adult and Child Naturally-Produced and TTS Voices. 521-525 - Mohammad Jalilpour-Monesi
, Bernd Accou
, Tom Francart, Hugo Van hamme:
Extracting Different Levels of Speech Information from EEG Using an LSTM-Based Model. 526-530 - Louis ten Bosch, Lou Boves:
Word Competition: An Entropy-Based Approach in the DIANA Model of Human Word Comprehension. 531-535 - Louis ten Bosch, Lou Boves:
Time-to-Event Models for Analyzing Reaction Time Sequences. 536-540 - Sophie Brand, Kimberley Mulder, Louis ten Bosch, Lou Boves:
Models of Reaction Times in Auditory Lexical Decision: RTonset versus RToffset. 541-545
Acoustic Event Detection and Acoustic Scene Classification
- Gwantae Kim, David K. Han, Hanseok Ko
:
SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features. 546-550 - Helin Wang, Yuexian Zou, Wenwu Wang:
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification. 551-555 - Xu Zheng, Yan Song, Li-Rong Dai, Ian McLoughlin
, Lin Liu:
An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection. 556-560 - Ritika Nandi, Shashank Shekhar, Manjunath Mulimani:
Acoustic Scene Classification Using Kervolution-Based SubSpectralNet. 561-565 - Harshavardhan Sundar, Ming Sun, Chao Wang:
Event Specific Attention for Polyphonic Sound Event Detection. 566-570 - Yuan Gong
, Yu-An Chung, James R. Glass:
AST: Audio Spectrogram Transformer. 571-575 - Soonshin Seo, Donghyun Lee, Ji-Hwan Kim:
Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene. 576-580 - Helen L. Bear, Veronica Morfi, Emmanouil Benetos
:
An Evaluation of Data Augmentation Methods for Sound Scene Geotagging. 581-585 - Chiori Hori, Takaaki Hori, Jonathan Le Roux:
Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers. 586-590 - Shijing Si, Jianzong Wang
, Huiming Sun, Jianhan Wu, Chuanyao Zhang, Xiaoyang Qu, Ning Cheng, Lei Chen, Jing Xiao:
Variational Information Bottleneck for Effective Low-Resource Audio Classification. 591-595 - Soham Deshmukh, Bhiksha Raj, Rita Singh:
Improving Weakly Supervised Sound Event Detection with Self-Supervised Auxiliary Tasks. 596-600 - Tatsuya Komatsu, Shinji Watanabe
, Koichi Miyazaki, Tomoki Hayashi:
Acoustic Event Detection with Classifier Chains. 601-605
Diverse Modes of Speech Acquisition and Processing
- Shu-Chuan Tseng
, Yi-Fen Liu:
Segment and Tone Production in Continuous Speech of Hearing and Hearing-Impaired Children. 606-610 - Feng Wang, Jing Chen, Fei Chen:
Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-Acoustic Hearing. 611-615 - Manthan Sharma, Navaneetha Gaddam, Tejas Umesh, Aditya Murthy, Prasanta Kumar Ghosh:
A Comparative Study of Different EMG Features for Acoustics-to-EMG Mapping. 616-620 - Ajish K. Abraham, V. Sivaramakrishnan, N. Swapna, N. Manohar:
Image-Based Assessment of Jaw Parameters and Jaw Kinematics for Articulatory Simulation: Preliminary Results. 621-625 - Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu
:
An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech. 626-630 - Judith Dineley
, Grace Lavelle, Daniel Leightley
, Faith Matcham
, Sara Siddi
, Maria Teresa Peñarrubia-María, Katie M. White
, Alina Ivan, Carolin Oetzmann
, Sara Simblett
, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl
, Yatharth Ranjan, Zulqarnain Rashid
, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J. B. Dobson
, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins
, RADAR-CNS Consortium:
Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder. 631-635 - Sarah R. Li, Colin T. Annand, Sarah Dugan, Sarah M. Schwab, Kathryn J. Eary, Michael Swearengen, Sarah Stack, Suzanne Boyce
, Michael A. Riley, T. Douglas Mast:
An Automatic, Simple Ultrasound Biofeedback Parameter for Distinguishing Accurate and Misarticulated Rhotic Syllables. 636-640 - Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond
, Steve Renals:
Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video. 641-645 - David Ferreira
, Samuel S. Silva, Francisco Curado, António J. S. Teixeira
:
RaSSpeR: Radar-Based Silent Speech Recognition. 646-650 - Beiming Cao, Nordine Sebkhi
, Arpan Bhavsar, Omer T. Inan, Robin Samlan, Ted Mau, Jun Wang:
Investigating Speech Reconstruction for Laryngectomees for Silent Speech Interfaces. 651-655
Multi-Channel Speech Enhancement and Hearing Aids
- Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B., Andreas K. Maier:
LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement. 656-660 - Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
:
Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation. 661-665 - Siyuan Zhang, Xiaofei Li:
Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement. 666-670 - Hyungchan Song, Jong Won Shin:
Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks. 671-675 - Pablo Pérez Zarazaga
, Mariem Bouafif Mansali
, Tom Bäckström
, Zied Lachiri:
Cancellation of Local Competing Speaker with Near-Field Localization for Distributed ad-hoc Sensor Network. 676-680 - Hao Zhang, DeLiang Wang:
A Deep Learning Method to Multi-Channel Active Noise Control. 681-685 - Simone Graetzer, Jon Barker, Trevor J. Cox, Michael Akeroyd, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz
:
Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing. 686-690 - Zehai Tu, Ning Ma
, Jon Barker:
Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model. 691-695 - Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr:
Explaining Deep Learning Models for Speech Enhancement. 696-700 - Weilong Huang, Jinwei Feng:
Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones. 701-705
Self-Supervision and Semi-Supervision for Neural ASR Training
- Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma:
Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning. 706-710 - Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas:
wav2vec-C: A Self-Supervised Model for Speech Representation Learning. 711-715 - Electra Wallington, Benji Kershenbaum, Ondrej Klejch, Peter Bell:
On the Learning Dynamics of Semi-Supervised Training for ASR. 716-720 - Wei-Ning Hsu, Anuroop Sriram
, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli:
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. 721-725 - Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori:
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition. 726-730 - Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim:
A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models. 731-735 - Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen
, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno:
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation. 736-740 - Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert:
slimIPL: Language-Model-Free Iterative Pseudo-Labeling. 741-745 - Xianghu Yue, Haizhou Li:
Phonetically Motivated Self-Supervised Speech Representation Learning. 746-750 - Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li
, Yifan Gong, Lei He:
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS. 751-755
Spoken Language Processing I
- Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff:
Speaker-Conversation Factorial Designs for Diarization Error Analysis. 756-760 - Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel:
SmallER: Scaling Neural Entity Resolution for Edge Devices. 761-765 - Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, Daniel J. Liebling:
Disfluency Detection with Unlabeled Data and Small BERT Models. 766-770 - Qian Chen
, Wen Wang, Mengzhe Chen, Qinglin Zhang:
Discriminative Self-Training for Punctuation Prediction. 771-775 - Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura:
Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks Using Switching Tokens. 776-780 - Binghuai Lin, Liyuan Wang:
A Noise Robust Method for Word-Level Pronunciation Assessment. 781-785 - Jonathan Wintrode:
Targeted Keyword Filtering for Accelerated Spoken Topic Identification. 786-790 - Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze:
Multimodal Speech Summarization Through Semantic Concept Learning. 791-795 - Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon:
Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization. 796-800 - Marcin Wlodarczak, Emer Gilmartin:
Speaker Transition Patterns in Three-Party Conversation: Evidence from English, Estonian and Swedish. 801-805
Voice Conversion and Adaptation II
- Samuel J. Broughton, Md. Asif Jalal, Roger K. Moore
:
Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion. 806-810 - Kun Zhou, Berrak Sisman, Haizhou Li:
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training. 811-815 - Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhen-Hua Ling:
Adversarial Voice Conversion Against Neural Spoofing Detectors. 816-820 - Xiangheng He, Junjie Chen, Georgios Rizos, Björn W. Schuller:
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation. 821-825 - Ziyi Chen, Pengyuan Zhang:
TVQVC: Transformer Based Vector Quantized Variational Autoencoder with CTC Loss for Voice Conversion. 826-830 - Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei Xie, Wendong Gan, Haitao Chen, Hai Li:
Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion. 831-835 - Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee:
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. 836-840 - Christopher Liberatore, Ricardo Gutierrez-Osuna:
An Exemplar Selection Algorithm for Native-Nonnative Voice Conversion. 841-845 - Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng:
Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion. 846-850 - Manh Luong
, Viet-Anh Tran:
Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder. 851-855
Privacy-Preserving Machine Learning for Audio & Speech Processing
- Oubaïda Chouchane
, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino, Nicholas W. D. Evans, Melek Önen, Massimiliano Todisco:
Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation. 856-860 - Ranya Aloufi, Hamed Haddadi, David Boyle:
Configurable Privacy-Preserving Automatic Speech Recognition. 861-865 - Scott Novotney, Yile Gu, Ivan Bulyko:
Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation. 866-870 - Jae Ro, Mingqing Chen, Rajiv Mathews, Mehryar Mohri, Ananda Theertha Suresh:
Communication-Efficient Agnostic Federated Averaging. 871-875 - Timm Koppelmann, Alexandru Nelus, Lea Schönherr
, Dorothea Kolossa
, Rainer Martin
:
Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification. 876-880 - Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee:
PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification. 881-885 - Haoxin Ma, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang:
Continual Learning for Fake Audio Detection. 886-890 - Muhammad A. Shah, Joseph Szurley, Markus Müller, Athanasios Mouchtaris, Jasha Droppo:
Evaluating the Vulnerability of End-to-End Automatic Speech Recognition Models to Membership Inference Attacks. 891-895 - Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, Jasha Droppo:
SynthASR: Unlocking Synthetic Data for Speech Recognition. 896-900
The First DiCOVA Challenge: Diagnosis of COVID-19 Using Acoustics
- Ananya Muguli
, Lancelot Pinto, Nirmala R., Neeraj Kumar Sharma
, Prashant Krishnan V, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli
, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda:
DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics. 901-905 - Madhu R. Kamble, José Andrés González López, Teresa Grau, Juan M. Espín, Lorenzo Cascioli
, Yiqing Huang, Alejandro Gómez Alanís, Jose Patino, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas W. D. Evans, Maria A. Zuluaga, Massimiliano Todisco:
PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge. 906-910 - Vincent Karas, Björn W. Schuller:
Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features. 911-915 - Isabella Södergren, Maryam Pahlavan Nodeh, Prakash Chandra Chhipa, Konstantina Nikolaidou, György Kovács:
Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines. 916-920 - Rohan Kumar Das, Maulik C. Madhavi, Haizhou Li:
Diagnosis of COVID-19 Using Auditory Acoustic Cues. 921-925 - John B. Harvill, Yash R. Wani, Mark Hasegawa-Johnson, Narendra Ahuja, David G. Beiser
, David Chestek:
Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation. 926-930 - Gauri Deshpande, Björn W. Schuller:
The DiCOVA 2021 Challenge - An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio. 931-935 - Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, Deepu Vijayasenan:
COVID-19 Detection from Spectral Features on the DiCOVA Dataset. 936-940 - Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller:
Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information. 941-945 - Swapnil Bhosale, Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu:
Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis. 946-950 - Flávio Ávila, Amir H. Poorjam, Deepak Mittal, Charles Dognin, Ananya Muguli
, Rohit Kumar, Srikanth Raj Chetupalli
, Sriram Ganapathy, Maneesh Singh:
Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds. 951-955
Show and Tell 1
- Gábor Kiss, Dávid Sztahó, Miklós Gábriel Tulics:
Application for Detecting Depression, Parkinson's Disease and Dysphonic Speech. 956-957 - Lenka Weingartová, Veronika Volna, Ewa Balejová:
Beey: More Than a Speech-to-Text Editor. 958-959 - Takayuki Arai:
Downsizing of Vocal-Tract Models to Line up Variations and Reduce Manufacturing Costs. 960-961 - Maël Fabien, Shantipriya Parida, Petr Motlícek, Dawei Zhu, Aravind Krishnan, Hoang H. Nguyen:
ROXANNE Research Platform: Automate Criminal Investigations. 962-964 - Alexandre Flucha, Anthony Larcher, Ambuj Mehrish, Sylvain Meignier, Florian Plaut, Nicolas Poupon, Yevhenii Prokopalo, Adrien Puertolas, Meysam Shamsi, Marie Tahon:
The LIUM Human Active Correction Platform for Speaker Diarization. 965-966 - Yoo Rhee Oh, Kiyoung Park:
On-Device Streaming Transformer-Based End-to-End Speech Recognition. 967-968 - Jaroslav Cmejla, Tomás Kounovský, Jakub Janský, Jirí Málek, M. Rozkovec, Zbynek Koldovský:
Advanced Semi-Blind Speaker Extraction and Tracking Implemented in Experimental Device with Revolving Dense Microphone Array. 969-970
Keynote 1: Hermann Ney
- Hermann Ney:
Forty Years of Speech and Language Processing: From Bayes Decision Rule to Deep Learning.
ASR Technologies and Systems
- Jan Chorowski, Grzegorz Ciesielski, Jaroslaw Dzikowski, Adrian Lancucki, Ricard Marxer
, Mateusz Opala, Piotr Pusz, Pawel Rychlikowski, Michal Stypulkowski:
Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw. 971-975 - Jan Chorowski, Grzegorz Ciesielski, Jaroslaw Dzikowski, Adrian Lancucki, Ricard Marxer
, Mateusz Opala, Piotr Pusz, Pawel Rychlikowski, Michal Stypulkowski:
Aligned Contrastive Predictive Coding. 976-980 - Benjamin Suter
, Josef Novák:
Neural Text Denormalization for Speech Transcripts. 981-985 - Aditya Joglekar, Seyed Omid Sadjadi, Meena Chandra Shekar, Christopher Cieri, John H. L. Hansen:
Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio. 986-990
Phonation and Voicing
- Hannah Leykum
:
Voice Quality in Verbal Irony: Electroglottographic Analyses of Ironic Utterances in Standard Austrian German. 991-995 - Mathilde Hutin
, Yaru Wu, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker:
Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing. 996-1000 - Ivan Kraljevski
, Maria Paola Bissiri, Frank Duckhorn, Constanze Tschöpe, Matthias Wolff:
Glottal Stops in Upper Sorbian: A Data-Driven Approach. 1001-1005 - Bogdan Ludusan, Petra Wagner
, Marcin Wlodarczak:
Cue Interaction in the Perception of Prosodic Prominence: The Role of Voice Quality. 1006-1010 - Jenifer Vega Rodríguez, Nathalie Vallée:
Glottal Sounds in Korebaju. 1011-1014 - Anaïs Chanclu, Imen Ben Amor
, Cédric Gendrot, Emmanuel Ferragne, Jean-François Bonastre:
Automatic Classification of Phonation Types in Spontaneous Speech: Towards a New Workflow for the Characterization of Speakers' Voice Quality. 1015-1018
Health and Affect I
- Rob J. J. H. van Son
:
Measuring Voice Quality Parameters After Speaker Pseudonymization. 1019-1023 - Lars Steinert, Felix Putze, Dennis Küster, Tanja Schultz
:
Audio-Visual Recognition of Emotional Engagement of People with Dementia. 1024-1028 - Pascal Hecker
, Florian B. Pokorny
, Katrin D. Bartl-Pokorny, Uwe D. Reichel, Zhao Ren, Simone Hantke, Florian Eyben, Dagmar M. Schuller, Bert Arnrich, Björn W. Schuller:
Speaking Corona? Human and Machine Recognition of COVID-19 from Voice. 1029-1033 - Huyen Nguyen, Ralph Vente
, David Lupea, Sarah Ita Levitan
, Julia Hirschberg:
Acoustic-Prosodic, Lexical and Demographic Cues to Persuasiveness in Competitive Debate Speeches. 1034-1038
Robust Speaker Recognition
- Bengt J. Borgström:
Unsupervised Bayesian Adaptation of PLDA for Speaker Verification. 1039-1043 - Weiqing Wang, Danwei Cai, Jin Wang, Qingjian Lin, Xuyang Wang, Mi Hong, Ming Li:
The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III. 1044-1048 - Yafeng Chen, Wu Guo, Bin Gu:
Improved Meta-Learning Training for Speaker Verification. 1049-1053 - Dan Wang, Yuanjie Dong, Yaxing Li, Yunfei Zi
, Zhihui Zhang, Xiaoqi Li, Shengwu Xiong:
Variational Information Bottleneck Based Regularization for Speaker Recognition. 1054-1058 - Niko Brümmer, Luciana Ferrer, Albert Swart:
Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make? 1059-1063 - Roza Chojnacka, Jason Pelecanos, Quan Wang, Ignacio López-Moreno:
SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System. 1064-1068 - Zhiming Wang, Furong Xu, Kaisheng Yao, Yuan Cheng, Tao Xiong, Huijia Zhu:
AntVoice Neural Speaker Embedding System for FFSVC 2020. 1069-1073 - Jianchen Li
, Jiqing Han, Hongwei Song:
Gradient Regularization for Noise-Robust Speaker Verification. 1074-1078 - Saurabh Kataria, Jesús Villalba
, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak
:
Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification. 1079-1083 - Jie Pu, Yuguang Yang, Ruirui Li, Oguz Elibol, Jasha Droppo:
Scaling Effect of Self-Supervised Speech Models. 1084-1088 - Yibo Wu, Longbiao Wang, Kong Aik Lee
, Meng Liu, Jianwu Dang:
Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network. 1089-1093 - Li Zhang, Qing Wang, Kong Aik Lee
, Lei Xie, Haizhou Li:
Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification. 1094-1098 - Jose Patino, Natalia A. Tomashenko
, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans:
Speaker Anonymisation Using the McAdams Coefficient. 1099-1103
Source Separation, Dereverberation and Echo Cancellation
- Yiyu Luo, Jing Wang, Liang Xu, Lidong Yang:
Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments. 1104-1108 - Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu:
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation. 1109-1113 - Jianjun Gu, Longbiao Cheng
, Xingwei Sun, Junfeng Li, Yonghong Yan:
Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function. 1114-1118 - Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu:
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation. 1119-1123 - Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy:
Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement. 1124-1128 - Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot:
Scene-Agnostic Multi-Microphone Speech Dereverberation. 1129-1133 - Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi:
Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding Vectors Based on Regular Simplex. 1134-1138 - Hao Zhang, DeLiang Wang:
A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation. 1139-1143 - Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian, Qiang Fu:
Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation. 1144-1148 - Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix
, Keisuke Kinoshita
, Takafumi Moriya, Naoyuki Kamo:
Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition. 1149-1153
Speech Signal Analysis and Representation I
- Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Estimating Articulatory Movements in Speech Production with Transformer Networks. 1154-1158 - Dongchao Yang, Helin Wang, Yuexian Zou:
Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification. 1159-1163 - Alfredo Esquivel Jaramillo
, Jesper Kjær Nielsen
, Mads Græsbøll Christensen
:
Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation. 1164-1168 - Jian Luo, Jianzong Wang
, Ning Cheng, Jing Xiao:
Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation. 1169-1173 - Chiranjeevi Yarra
, Prasanta Kumar Ghosh:
Noise Robust Pitch Stylization Using Minimum Mean Absolute Error Criterion. 1174-1178 - Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
:
An Attribute-Aligned Strategy for Learning Speech Representation. 1179-1183 - Abdolreza Sabzi Shahrebabaki
, Sabato Marco Siniscalchi, Torbjørn Svendsen
:
Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. 1184-1188 - Jason Lilley
, H. Timothy Bunnell
:
Unsupervised Training of a DNN-Based Formant Tracker. 1189-1193 - Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe
, Abdelrahman Mohamed, Hung-yi Lee:
SUPERB: Speech Processing Universal PERformance Benchmark. 1194-1198 - Cong Zhang, Jian Zhu:
Synchronising Speech Segments with Musical Beats in Mandarin and English Singing. 1199-1203 - Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, Shwetak N. Patel:
FRILL: A Non-Semantic Speech Embedding for Mobile Devices. 1204-1208 - Hiroki Mori
:
Pitch Contour Separation from Overlapping Speech. 1209-1213 - Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen:
Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning. 1214-1218
Spoken Language Understanding I
- Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao:
Data Augmentation for Spoken Language Understanding via Pretrained Language Models. 1219-1223 - Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow:
FANS: Fusing ASR and NLU for On-Device SLU. 1224-1228 - Yiran Cao, Nihal Potdar, Anderson R. Avila:
Sequential End-to-End Intent and Slot Label Classification and Localization. 1229-1233 - Deepak Muralidharan, Joel Ruben Antony Moniz, Weicheng Zhang, Stephen Pulman, Lin Li, Megan Barnes, Jingjing Pan, Jason D. Williams
, Alex Acero
:
DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants. 1234-1238 - Ting-Wei Wu, Ruolin Su, Biing-Hwang Juang:
A Context-Aware Hierarchical BERT Fusion Network for Multi-Turn Dialog Act Detection. 1239-1243 - Qian Chen
, Wen Wang, Qinglin Zhang:
Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning. 1244-1248 - Quynh Do, Judith Gaspers, Daniil Sorokin, Patrick Lehnen:
Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models. 1249-1253 - Jatin Ganhotra, Samuel Thomas, Hong-Kwang Jeff Kuo, Sachindra Joshi, George Saon
, Zoltán Tüske, Brian Kingsbury:
Integrating Dialog History into End-to-End Spoken Language Understanding Systems. 1254-1258 - Ting Han, Chongxuan Huang, Wei Peng:
Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking. 1259-1263 - Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe
, Alan W. Black:
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding. 1264-1268
Topics in ASR: Adaptation, Transfer Learning, Children’s Speech, and Low-Resource Settings
- Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li:
Semantic Data Augmentation for End-to-End Mandarin Speech Recognition. 1269-1273 - Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian:
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition. 1274-1278 - Jinhan Wang, Yunzheng Zhu, Ruchao Fan
, Wei Chu, Abeer Alwan:
Low Resource German ASR with Untranscribed Data Spoken by Non-Native Children - INTERSPEECH 2021 Shared Task SPAPL System. 1279-1283 - Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Tsendsuren Munkhdalai, Françoise Beaufays:
Robust Continuous On-Device Personalization for Automatic Speech Recognition. 1284-1288 - Shashi Kumar, Shakti P. Rath, Abhishek Pandey:
Speaker Normalization Using Joint Variational Autoencoder. 1289-1293 - Gaopeng Xu, Song Yang, Lu Ma, Chengfei Li, Zhongqin Wu:
The TAL System for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Childrens Speech. 1294-1298 - Tsz Kin Lam, Mayumi Ohta
, Shigehiko Schamoni, Stefan Riezler:
On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR. 1299-1303 - Heting Gao, Junrui Ni
, Yang Zhang, Kaizhi Qian, Shiyu Chang, Mark Hasegawa-Johnson:
Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding. 1304-1308 - Yan Huang, Guoli Ye, Jinyu Li
, Yifan Gong:
Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need. 1309-1313 - Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau:
Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning. 1314-1318 - Wei Chu, Peng Chang, Jing Xiao:
Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII's System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge. 1319-1323
Voice Conversion and Adaptation I
- Tingle Li, Yichen Liu, Chenxu Hu, Hang Zhao:
CVC: Contrastive Learning for Non-Parallel Voice Conversion. 1324-1328 - Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang
, Tomoki Toda:
A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion. 1329-1333 - Sefik Emre Eskimez, Dimitrios Dimitriadis, Ken'ichi Kumatani, Robert Gmyr:
One-Shot Voice Conversion with Speaker-Agnostic StarGAN. 1334-1338 - Takeshi Koshizuka, Hidefumi Ohmura, Kouichi Katsurada:
Fine-Tuning Pre-Trained Voice Conversion Model for Adding New Target Speakers with Limited Data. 1339-1343 - Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng:
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion. 1344-1348 - Yinghao Aaron Li, Ali Zare, Nima Mesgarani:
StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. 1349-1353 - Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall:
Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis. 1354-1358 - Shoki Sakamoto, Akira Taniguchi
, Tadahiro Taniguchi, Hirokazu Kameoka:
StarGAN-VC+ASR: StarGAN-Based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition. 1359-1363 - Xuexin Xu
, Liang Shi, Jinhui Chen, Xunquan Chen, Jie Lian, Pingyuan Lin, Zhihong Zhang, Edwin R. Hancock
:
Two-Pathway Style Embedding for Arbitrary Voice Conversion. 1364-1368 - Yufei Liu, Chengzhu Yu, Shuai Wang, Zhenchuan Yang, Yang Chao, Weibin Zhang:
Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics. 1369-1373 - Yi Zhou, Xiaohai Tian, Zhizheng Wu, Haizhou Li:
Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation. 1374-1378 - Hongqiang Du, Lei Xie:
Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder. 1379-1383
Voice Quality Characterization for Clinical Voice Assessment: Voice Production, Acoustics, and Auditory Perception
- Hannah White
, Joshua Penney
, Andy Gibson
, Anita Szakay
, Felicity Cox
:
Optimizing an Automatic Creaky Voice Detection Method for Australian English Speaking Females. 1384-1388 - Joshua Penney
, Andy Gibson
, Felicity Cox
, Michael I. Proctor
, Anita Szakay
:
A Comparison of Acoustic Correlates of Voice Quality Across Different Recording Devices: A Cautionary Tale. 1389-1393 - Anna Sfakianaki, George P. Kafentzis:
Investigating Voice Function Characteristics of Greek Speakers with Hearing Loss Using Automatic Glottal Source Feature Extraction. 1394-1398 - Mark A. Huckvale, Catinca Buciuleac:
Automated Detection of Voice Disorder in the Saarbrücken Voice Database: Effects of Pathology Subset and Audio Materials. 1399-1403 - Steven M. Lulich, Rita R. Patel:
Accelerometer-Based Measurements of Voice Quality in Children During Semi-Occluded Vocal Tract Exercise with a Narrow Straw in Air. 1404-1408 - Matthew Perez
, Amrit Romana, Angela Roberts
, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost:
Articulatory Coordination for Speech Motor Tracking in Huntington Disease. 1409-1413 - Carlos A. Ferrer, Efren Aragón, María E. Hdez-Díaz, Marc S. De Bodt, Roman Cmejla
, Marina Englert, Mara Behlau, Elmar Nöth:
Modeling Dysphonia Severity as a Function of Roughness and Breathiness Ratings in the GRBAS Scale. 1414-1418
Miscellanous Topics in ASR
- Nikolay Karpov, Alexander Denisenko, Fedor Minkin:
Golos: Russian Dataset for Speech Research. 1419-1423 - Samik Sadhu, Hynek Hermansky
:
Radically Old Way of Computing Spectra: Applications in End-to-End ASR. 1424-1428 - Ragheb Al-Ghezi, Yaroslav Getman
, Aku Rouhe, Raili Hildén
, Mikko Kurimo:
Self-Supervised End-to-End ASR for Low Resource L2 Swedish. 1429-1433 - Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe
, Georg Kucsko:
SPGISpeech: 5, 000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition. 1434-1438 - Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar
, Sina Alisamir, Ziyi Tong, Natalia A. Tomashenko
, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier:
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech. 1439-1443
Phonetics I
- Pavel Sturm, Radek Skarnitzl
, Tomás Nechanský:
Prosodic Accommodation in Face-to-Face and Telephone Dialogues. 1444-1448 - Josiane Riverin-Coutlée, Conceição Cunha, Enkeleida Kapia, Jonathan Harrington:
Dialect Features in Heterogeneous and Homogeneous Gheg Speaking Communities. 1449-1453 - Margaret Zellers
, Alena Witzlack-Makarevich
, Lilja Saeboe
, Saudah Namyalo:
An Exploration of the Acoustic Space of Rhotics and Laterals in Ruruuli. 1454-1458 - Kübra Bodur
, Sweeney Branje, Morgane Peirolo
, Ingrid Tiscareno
, James Sneed German:
Domain-Initial Strengthening in Turkish: Acoustic Cues to Prosodic Hierarchy in Stop Consonants. 1459-1463
Target Speaker Detection, Localization and Separation
- Katerina Zmolíková
, Marc Delcroix
, Desh Raj
, Shinji Watanabe
, Jan Cernocký:
Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics. 1464-1468 - Marvin Borsdorf
, Chenglin Xu, Haizhou Li, Tanja Schultz
:
Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers. 1469-1473 - Lukás Mateju
, Frantisek Kynych, Petr Cerva, Jindrich Zdánský, Jirí Málek:
Using X-Vectors for Speech Activity Detection in Broadcast Streams. 1474-1478 - Daniele Salvati
, Carlo Drioli, Gian Luca Foresti:
Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features. 1479-1483 - Midia Yousefi, John H. L. Hansen:
Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network. 1484-1488
Language and Accent Recognition
- Hexin Liu, Leibny Paola García-Perera
, Xinyi Zhang, Justin Dauwels, Andy W. H. Khong, Sanjeev Khudanpur, Suzy J. Styles
:
End-to-End Language Diarization for Bilingual Code-Switching Speech. 1489-1493 - Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina:
Modeling and Training Strategies for Language Recognition Systems. 1494-1498 - Hui Wang, Lin Liu, Yan Song, Lei Fang, Ian McLoughlin
, Li-Rong Dai:
A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification. 1499-1503 - Keqi Deng, Songjun Cao, Long Ma:
Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning. 1504-1508 - Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu:
Exploring wav2vec 2.0 on Speaker Verification and Language Identification. 1509-1513 - Gundluru Ramesh, C. Shiva Kumar, K. Sri Rama Murty
:
Self-Supervised Phonotactic Representations for Language Identification. 1514-1518 - Jicheng Zhang, Yizhou Peng, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng:
E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition. 1519-1523 - Moakala Tzudir, Shikha Baghel, Priyankoo Sarmah, S. R. Mahadeva Prasanna:
Excitation Source Feature Based Dialect Identification in Ao - A Low Resource Language. 1524-1528
Low-Resource Speech Recognition
- Shreya Khare, Ashish R. Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj:
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration. 1529-1533 - Siyuan Feng
, Piotr Zelasko, Laureano Moro-Velázquez, Odette Scharenborg
:
Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation. 1534-1538 - Herman Kamper
, Benjamin van Niekerk
:
Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks. 1539-1543 - Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li:
Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning. 1544-1548 - Christiaan Jacobs, Herman Kamper
:
Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language. 1549-1553 - Benjamin van Niekerk
, Leanne Nortje, Matthew Baas, Herman Kamper
:
Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing. 1554-1558 - Shun Takahashi, Sakriani Sakti, Satoshi Nakamura:
Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages. 1559-1563 - Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe
, Alexander I. Rudnicky
:
Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021. 1564-1568 - Xia Cui
, Amila Gamage, Terry Hanley, Tingting Mu:
Identifying Indicators of Vulnerability from Short Speech Segments Using Acoustic and Textual Features. 1569-1573 - Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux:
The Zero Resource Speech Challenge 2021: Spoken Language Modelling. 1574-1578 - Gautham Krishna Gudur, Satheesh Kumar Perepu:
Zero-Shot Federated Learning with New Classes for Audio Classification. 1579-1583 - Andrew Rouditchenko, Angie W. Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne
, Rameswar Panda, Rogério Schmidt Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James R. Glass:
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. 1584-1588
Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis
- Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, Hoon-Young Cho:
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement. 1589-1593 - Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis:
Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features. 1594-1598 - Haoyue Zhan, Haitong Zhang, Wenjie Ou, Yue Lin:
Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information. 1599-1603 - Zhenchuan Yang, Weibin Zhang, Yufei Liu, Xiaofen Xing:
Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations. 1604-1608 - Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder. 1609-1613 - Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama
, Hiroshi Saruwatari:
Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis. 1614-1618 - Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan:
Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech. 1619-1623 - Ege Kesim, Engin Erzin
:
Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation. 1624-1628 - Shijing Si, Jianzong Wang
, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, Jing Xiao:
Speech2Video: Cross-Modal Distillation for Speech to Video Generation. 1629-1633
Speech Coding and Privacy
- Junhyeok Lee
, Seungu Han:
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. 1634-1638 - Gang-Xuan Lin, Shih-Wei Hu, Yen-Ju Lu, Yu Tsao, Chun-Shien Lu:
QISTA-Net-Audio: Audio Super-Resolution via Non-Convex ℓ_q-Norm Minimization. 1639-1643 - Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park, Kwang Pyo Choi:
X-net: A Joint Scale Down and Scale Up Method for Voice Call. 1644-1648 - Kexun Zhang, Yi Ren, Changliang Xu, Zhou Zhao:
WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution. 1649-1653 - Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu:
Half-Truth: A Partially Fake Audio Detection Dataset. 1654-1658 - Bhusan Chettri
, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen:
Data Quality as Predictor of Voice Anti-Spoofing Generalization. 1659-1663 - Youngju Cheon, Soojoong Hwang, Sangwook Han, Inseon Jang, Jong Won Shin:
Coded Speech Enhancement Using Neural Network-Based Vector-Quantized Residual Features. 1664-1668 - Lukas Drude, Jahn Heymann, Andreas Schwarz, Jean-Marc Valin:
Multi-Channel Opus Compression for Far-Field Automatic Speech Recognition with a Fixed Bitrate Budget. 1669-1673 - Ingo Siegert:
Effects of Prosodic Variations on Accidental Triggers of a Commercial Voice Assistant. 1674-1678 - Adam Gabrys, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote:
Improving the Expressiveness of Neural Vocoding with Non-Affine Normalizing Flows. 1679-1683 - Gauri P. Prajapati, Dipesh K. Singh, Preet P. Amin
, Hemant A. Patil:
Voice Privacy Through x-Vector and CycleGAN-Based Anonymization. 1684-1688 - Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen:
A Two-Stage Approach to Speech Bandwidth Extension. 1689-1693 - Joon Byun, Seungmin Shin
, Youngcheol Park, Jongmo Sung, Seungkwon Beack:
Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder. 1694-1698 - Dimitrios Stoidis, Andrea Cavallaro:
Protecting Gender and Identity with Disentangled Speech Representations. 1699-1703
Speech Perception II
- Yahya Aldholmi
, Rawan Aldhafyan, Asma Alqahtani:
Perception of Standard Arabic Synthetic Speech Rate. 1704-1707 - Takeshi Kishiyama
:
The Influence of Parallel Processing on Illusory Vowels. 1708-1712 - Anupama Chingacham, Vera Demberg, Dietrich Klakow:
Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors. 1713-1717 - Olympia Simantiraki, Martin Cooke:
SpeechAdjuster: A Tool for Investigating Listener Preferences and Speech Intelligibility. 1718-1722 - Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa:
VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification. 1723-1727 - Min Xu, Jing Shao
, Lan Wang:
Effects of Aging and Age-Related Hearing Loss on Talker Discrimination. 1728-1732 - Yuqing Zhang
, Zhu Li
, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang:
Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication. 1733-1737 - Camryn Terblanche, Philip Harrison, Amelia Jane Gully:
Human Spoofing Detection Performance on Degraded Speech. 1738-1742 - Marieke Einfeldt, Rita Sevastjanova, Katharina Zahner-Ritter, Ekaterina Kazak
, Bettina Braun:
Reliable Estimates of Interpretable Cue Effects with Active Learning in Psycholinguistic Research. 1743-1747 - Puneet Kumar, Vishesh Kaushik, Balasubramanian Raman:
Towards the Explainability of Multimodal Speech Emotion Recognition. 1748-1752 - Biao Zeng, Rui Wang, Guoxing Yu, Christian Dobel:
Primacy of Mouth over Eyes: Eye Movement Evidence from Audiovisual Mandarin Lexical Tones and Vowels. 1753-1756 - Takanori Ashihara, Takafumi Moriya, Makio Kashino:
Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance. 1757-1761
Streaming for ASR/RNN Transducers
- Thai-Son Nguyen, Sebastian Stüker, Alex Waibel:
Super-Human Performance in Online Low-Latency Recognition of Conversational Speech. 1762-1766 - Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li
, Yifan Gong:
Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems. 1767-1771 - Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer:
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion. 1772-1776 - Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon:
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling. 1777-1781 - Liang Lu, Naoyuki Kanda, Jinyu Li
, Yifan Gong:
Streaming Multi-Talker Speech Recognition with Joint Speaker Identification. 1782-1786 - Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix
, Taichi Asami:
Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture. 1787-1791 - Andreas Schwarz, Ilya Sklyar, Simon Wiesler:
Improving RNN-T ASR Accuracy Using Context Audio. 1792-1796 - Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma:
HMM-Free Encoder Pre-Training for Streaming RNN Transducer. 1797-1801 - Xiaodong Cui, Brian Kingsbury, George Saon
, David Haws, Zoltán Tüske:
Reducing Exposure Bias in Training Recurrent Neural Network Transducers. 1802-1806 - Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao:
Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models. 1807-1811 - Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno:
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition. 1812-1816 - Hirofumi Inaguma, Tatsuya Kawahara
:
StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR. 1817-1821 - Niko Moritz, Takaaki Hori, Jonathan Le Roux:
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition. 1822-1826 - Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu Jeong Han, Shinji Watanabe
:
Multi-Mode Transformer Transducer with Stochastic Future Context. 1827-1831
ConferencingSpeech 2021 Challenge: Far-Field Multi-Channel Speech Enhancement for Video Conferencing
- Xinlei Ren, Xu Zhang, Lianwu Chen, Xiguang Zheng
, Chen Zhang, Liang Guo, Bing Yu:
A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement. 1832-1836 - Rui Zhu, Feiran Yang, Yuepeng Li, Shidong Shang:
A Partitioned-Block Frequency-Domain Adaptive Kalman Filter for Stereophonic Acoustic Echo Cancellation. 1837-1841 - Taihui Wang, Feiran Yang, Rui Zhu, Jun Yang:
Real-Time Independent Vector Analysis Using Semi-Supervised Nonnegative Matrix Factorization as a Source Model. 1842-1846 - Jiangyu Han
, Wei Rao, Yannan Wang, Yanhua Long:
Improving Channel Decorrelation for Multi-Channel Target Speech Extraction. 1847-1851 - Jinjiang Liu, Xueliang Zhang:
Inplace Gated Convolutional Recurrent Neural Network for Dual-Channel Speech Enhancement. 1852-1856 - R. G. Prithvi Raj, Rohit Kumar, M. K. Jayesh, Anurenjan Purushothaman
, Sriram Ganapathy, M. Ali Basha Shaik:
SRIB-LEAP Submission to Far-Field Multi-Channel Speech Enhancement Challenge for Video Conferencing. 1857-1861 - Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng:
Real-Time Multi-Channel Speech Enhancement Based on Neural Network Masking with Attention Model. 1862-1866
Survey Talk 2: Sriram Ganapathy
- Sriram Ganapathy:
Uncovering the Acoustic Cues of COVID-19 Infection.
Keynote 2: Pascale Fung
- Pascale Fung:
Ethical and Technological Challenges of Conversational AI.
Language Modeling and Text-Based Innovations for ASR
- Dominique Fohr, Irina Illina:
BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List. 1867-1871 - Karel Benes, Lukás Burget:
Text Augmentation for Language Models in High Error Recognition Scenario. 1872-1876 - Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter
, Hermann Ney:
On Sampling-Based Training Criteria for Neural Language Modeling. 1877-1881 - Janne Pylkkönen, Antti Ukkonen
, Juho Kilpikoski, Samu Tamminen, Hannes Heikinheimo:
Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network. 1882-1886
Speaker, Language, and Privacy
- Christopher Cieri, James Fiumara, Jonathan Wright:
Using Games to Augment Corpora for Language Recognition and Confusability. 1887-1891 - Gianni Fenu, Mirko Marras, Giacomo Medda
, Giacomo Meloni:
Fair Voice Biometrics: Impact of Demographic Imbalance on Group Fairness in Speaker Recognition. 1892-1896 - Leying Zhang
, Zhengyang Chen, Yanmin Qian:
Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification. 1897-1901 - Paul-Gauthier Noé, Mohammad MohammadAmini, Driss Matrouf, Titouan Parcollet, Andreas Nautsch, Jean-François Bonastre:
Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation. 1902-1906
Assessment of Pathological Speech and Language I
- Amrit Romana, John Bandon, Matthew Perez
, Stephanie Gutierrez, Richard Richter, Angela Roberts
, Emily Mower Provost:
Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson's Disease. 1907-1911 - Robin Vaysse, Jérôme Farinas
, Corine Astésano
, Régine André-Obrecht:
Automatic Extraction of Speech Rhythm Descriptors for Speech Intelligibility Assessment in the Context of Head and Neck Cancers. 1912-1916 - Jinzi Qi, Hugo Van hamme
:
Speech Disorder Classification Using Extended Factorized Hierarchical Variational Auto-Encoders. 1917-1921 - Vikram C. Mathad, Tristan J. Mahr, Nancy Scherer, Kathy Chapman, Katherine C. Hustad
, Julie Liss, Visar Berisha:
The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation. 1922-1926 - Esaú Villatoro-Tello
, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlícek, Mathew Magimai-Doss:
Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition. 1927-1931 - Amin Honarmandi Shandiz
, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó:
Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces. 1932-1936
Communication and Interaction, Multimodality
- Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan:
Cross-Modal Learning for Audio-Visual Video Parsing. 1937-1941 - Darren Cook
, Miri Zilka
, Simon Maskell, Laurence Alison:
A Psychology-Driven Computational Analysis of Political Interviews. 1942-1946 - Jennifer Santoso, Takeshi Yamada, Shoji Makino, Kenkichi Ishizuka, Takekatsu Hiramura:
Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure. 1947-1951 - Alif Silpachai, Ivana Rehman, Taylor Anne Barriuso, John Levis, Evgeny Chukharev-Hudilainen
, Guanlong Zhao, Ricardo Gutierrez-Osuna:
Effects of Voice Type and Task on L2 Learners' Awareness of Pronunciation Errors. 1952-1956 - Alla Menshikova, Daniil Kocharov
, Tatiana Kachkovskaia:
Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues. 1957-1961 - Shamila Nasreen, Julian Hough, Matthew Purver
:
Detecting Alzheimer's Disease Using Interactional and Acoustic Features from Spontaneous Speech. 1962-1966 - Hardik Kothare, Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, Jackson Liscombe, William Burke, Andrew Cornish, Doug Habberstad, Alaa Sakallah, Sara Markuson, Seemran Kansara, Afik Faerman, Yasmine Bensidi-Slimane, Laura Fry, Saige Portera, David Suendermann-Oeft, David Pautler, Carly Demopoulos:
Investigating the Interplay Between Affective, Phonatory and Motoric Subsystems in Autism Spectrum Disorder Using a Multimodal Dialogue Agent. 1967-1971 - Carlos Toshinori Ishi, Taiken Shintani:
Analysis of Eye Gaze Reasons and Gaze Aversions During Three-Party Conversations. 1972-1976
Language and Lexical Modeling for ASR
- Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer:
Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding. 1977-1981 - Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li
:
A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems. 1982-1986 - Ning Shi, Wei Wang, Boxin Wang, Jinfeng Li, Xiangyu Liu, Zhouhan Lin:
Incorporating External POS Tagger for Punctuation Restoration. 1987-1991 - Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo:
Phonetically Induced Subwords for End-to-End Speech Recognition. 1992-1996 - Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright
, Mari Ostendorf:
Revisiting Parity of Human vs. Machine Conversational Speech Transcription. 1997-2001 - W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman:
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition. 2002-2006 - Jesús Andrés-Ferrer, Dario Albesano, Puming Zhan, Paul Vozila:
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems. 2007-2011 - Qiushi Huang, Tom Ko, H. Lilian Tang
, Xubo Liu, Bo Wu:
Token-Level Supervised Contrastive Learning for Punctuation Restoration. 2012-2016 - Yun Zhao
, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, Yuanfu Zhou:
BART Based Semantic Correction for Mandarin Automatic Speech Recognition System. 2017-2021 - Lingfeng Dai, Qi Liu, Kai Yu:
Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR. 2022-2026 - Gakuto Kurata, George Saon
, Brian Kingsbury, David Haws, Zoltán Tüske:
Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio. 2027-2031 - Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel:
A Discriminative Entity-Aware Language Model for Virtual Assistants. 2032-2036 - Mahdi Namazifar, John Malik
, Li Erran Li, Gökhan Tür
, Dilek Hakkani-Tür:
Correcting Automated and Manual Speech Transcription Errors Using Warped Language Models. 2037-2041
Novel Neural Network Architectures for ASR
- Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer:
Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency. 2042-2046 - Shiqi Zhang, Yan Liu, Deyi Xiong, Pei Zhang, Boxing Chen:
Domain-Aware Self-Attention for Multi-Domain Neural Machine Translation. 2047-2051 - Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter
, Hermann Ney:
Librispeech Transducer Model with Internal Language Model Prior Correction. 2052-2056 - Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu:
A Deliberation-Based Joint Acoustic and Text Decoder. 2057-2061 - Zoltán Tüske, George Saon
, Brian Kingsbury:
On the Limit of English Conversational Speech Recognition. 2062-2066 - Keyu An, Yi Zhang, Zhijian Ou:
Deformable TDNN with Adaptive Receptive Fields for Speech Recognition. 2067-2071 - Zhao You, Shulin Feng, Dan Su, Dong Yu:
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts. 2077-2081 - Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien
:
Online Compressive Transformer for End-to-End Speech Recognition. 2082-2086 - Binghuai Lin, Liyuan Wang:
End to End Transformer-Based Contextual Speech Recognition Based on Pointer Network. 2087-2091 - Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones:
A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition. 2092-2096 - Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux:
Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers. 2097-2101 - Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh:
Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation. 2102-2106 - Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer:
Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios. 2107-2111
Speech Localization, Enhancement, and Quality Assessment
- Przemyslaw Falkowski-Gilski
:
Difference in Perceived Speech Signal Quality Assessment Among Monolingual and Bilingual Teenage Students. 2112-2116 - Christopher Schymura, Benedikt T. Bönninghoff, Tsubasa Ochiai, Marc Delcroix
, Keisuke Kinoshita
, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa:
PILOT: Introducing Transformers for Probabilistic Sound Event Localization. 2117-2121 - Masahito Togami, Robin Scheibler:
Sound Source Localization with Majorization Minimization. 2122-2126 - Gabriel Mittag, Babak Naderi, Assmaa Chehadi, Sebastian Möller:
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. 2127-2131 - Babak Naderi, Ross Cutler:
Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing. 2132-2136 - Jianhua Geng, Sifan Wang, Juan Li, Jingwei Li, Xin Lou:
Reliable Intensity Vector Selection for Multi-Source Direction-of-Arrival Estimation Using a Single Acoustic Vector Sensor. 2137-2141 - Meng Yu, Chunlei Zhang, Yong Xu, Shi-Xiong Zhang, Dong Yu:
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment. 2142-2146 - Andrea Toma, Daniele Salvati
, Carlo Drioli, Gian Luca Foresti:
CNN-Based Processing of Acoustic and Radio Frequency Signals for Speaker Localization from MAVs. 2147-2151 - Katsutoshi Itoyama, Yoshiya Morimoto, Shungo Masaki, Ryosuke Kojima
, Kenji Nishida, Kazuhiro Nakadai:
Assessment of von Mises-Bernoulli Deep Neural Network in Sound Source Localization. 2152-2156 - Rongliang Liu, Nengheng Zheng, Xi Chen:
Feature Fusion by Attention Networks for Robust DOA Estimation. 2157-2161 - Shoufeng Lin, Zhaojie Luo
:
Far-Field Speaker Localization and Adaptive GLMB Tracking. 2162-2166 - Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Andreas Spanias:
On the Design of Deep Priors for Unsupervised Audio Restoration. 2167-2171 - Weiguang Chen, Cheng Xue, Xionghu Zhong:
Cramér-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments. 2172-2176
Speech Synthesis: Neural Waveform Generation
- Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, Gyeongsu Chae:
GAN Vocoder: Multi-Resolution Discriminator Is All You Need. 2177-2181 - Jian Cong, Shan Yang, Lei Xie, Dan Su:
Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis. 2182-2186 - Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda:
Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN. 2187-2191 - Kazuki Mizuta, Tomoki Koriyama
, Hiroshi Saruwatari:
Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator. 2192-2196 - Ji-Hoon Kim, Sang-Hoon Lee
, Ji-Hyun Lee, Seong-Whan Lee:
Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis. 2197-2201 - Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Young-Ik Kim, Hoon-Young Cho:
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. 2202-2206 - Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim:
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. 2207-2211 - Mohammed Salah Al-Radhi
, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh
:
Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis. 2212-2216 - Patrick Lumban Tobing
, Tomoki Toda:
High-Fidelity and Low-Latency Universal Neural Vocoder Based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling. 2217-2221 - Zhengxi Liu, Yanmin Qian:
Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition. 2222-2226 - Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim:
High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model. 2227-2231
Spoken Machine Translation
- Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang:
SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction. 2232-2236 - Colin Cherry, Naveen Arivazhagan, Dirk Padfield, Maxim Krikun:
Subtitle Translation as Markup Translation. 2237-2241 - Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau:
Large-Scale Self- and Semi-Supervised Learning for Speech Translation. 2242-2246 - Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino:
CoVoST 2 and Massively Multilingual Speech Translation. 2247-2251 - Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang
:
AlloST: Low-Resource Speech Translation Without Source Transcription. 2252-2256 - Johanes Effendi, Sakriani Sakti, Satoshi Nakamura:
Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer. 2257-2261 - Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh
, Satoshi Nakamura:
Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation. 2262-2266 - Rong Ye, Mingxuan Wang, Lei Li:
End-to-End Speech Translation via Cross-Modal Progressive Training. 2267-2271 - Yuka Ko, Katsuhito Sudoh
, Sakriani Sakti, Satoshi Nakamura:
ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation. 2272-2276 - Alejandro Pérez González de Martos
, Javier Iranzo-Sánchez, Adrià Giménez-Pastor, Javier Jorge, Joan Albert Silvestre-Cerdà
, Jorge Civera, Albert Sanchís, Alfons Juan:
Towards Simultaneous Machine Interpretation. 2277-2281 - Giuseppe Martucci, Mauro Cettolo, Matteo Negri
, Marco Turchi:
Lexical Modeling of ASR Errors for Robust Speech Translation. 2282-2286 - Piyush Vyas, Anastasia Kuznetsova, Donald S. Williamson
:
Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation. 2287-2291 - Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu
:
Effects of Feature Scaling and Fusion on Sign Language Translation. 2292-2296
SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification
- Alexander Alenin, Anton Okhotnikov, Rostislav Makarov, Nikita Torgashov
, Ilya Shigabeev, Konstantin Simonchik:
The ID R&D System Description for Short-Duration Speaker Verification Challenge 2021. 2297-2301 - Jenthe Thienpondt
, Brecht Desplanques, Kris Demuynck:
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification. 2302-2306 - Aleksei Gusev, Alisa Vinogradova, Sergey Novoselov, Sergei Astapov:
SdSVC Challenge 2021: Tips and Tricks to Boost the Short-Duration Speaker Verification System Performance. 2307-2311 - Woo Hyun Kang, Nam Soo Kim:
Team02 Text-Independent Speaker Verification System for SdSV Challenge 2021. 2312-2316 - Xiaoyi Qin, Chao Wang
, Yong Ma, Min Liu, Shilei Zhang, Ming Li:
Our Learned Lessons from Cross-Lingual Speaker Verification: The CRMI-DKU System Description for the Short-Duration Speaker Verification Challenge 2021. 2317-2321 - Peng Zhang, Peng Hu, Xueliang Zhang:
Investigation of IMU&Elevoc Submission for the Short-Duration Speaker Verification Challenge 2021. 2322-2326 - Jie Yan, Shengyu Yao, Yiqian Pan, Wei Chen:
The Sogou System for Short-Duration Speaker Verification Challenge 2021. 2327-2331 - Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian:
The SJTU System for Short-Duration Speaker Verification Challenge 2021. 2332-2336
Show and Tell 2
- Sungjae Cho, Soo-Young Lee:
Multi-Speaker Emotional Text-to-Speech Synthesizer. 2337-2338 - Ales Prazák, Zdenek Loose, Josef V. Psutka, Vlasta Radová, Josef Psutka, Jan Svec:
Live TV Subtitling Through Respeaking. 2339-2340 - Stefan Fragner, Tobias Topar, Maximilian Giller, Lukas Pfeifenberger, Franz Pernkopf:
Autonomous Robot for Measuring Room Impulse Responses. 2341-2342 - Jonas Beskow, Charlie Caper, Johan Ehrenfors, Nils Hagberg, Anne Jansen, Chris Wood:
Expressive Robot Performance Based on Facial Motion Capture. 2343-2344 - Mónica Domínguez, Juan Soler Company, Leo Wanner:
ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction. 2345-2346 - Sai Guruju, Jithendra Vepa:
Addressing Compliance in Call Centers with Entity Extraction. 2347-2348 - Krishnachaitanya Gogineni, Tarun Reddy Yadama, Jithendra Vepa:
Audio Segmentation Based Conversational Silence Detection for Contact Center Calls. 2349-2350
Graph and End-to-End Learning for Speaker Recognition
- Desh Raj
, Sanjeev Khudanpur:
Reformulating DOVER-Lap Label Mapping as a Graph Partitioning Problem. 2351-2355 - Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas W. D. Evans:
Graph Attention Networks for Anti-Spoofing. 2356-2360 - Victoria Mingote
, Antonio Miguel, Alfonso Ortega Giménez
, Eduardo Lleida:
Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems. 2361-2365 - Junyi Peng, Xiaoyang Qu, Rongzhi Gu, Jianzong Wang
, Jing Xiao, Lukás Burget, Jan Cernocký:
Effective Phase Encoding for End-To-End Speaker Verification. 2366-2370
Spoken Language Processing II
- Ha Nguyen, Yannick Estève, Laurent Besacier:
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. 2371-2375 - Dominik Machácek
, Matús Zilinec, Ondrej Bojar:
Lost in Interpreting: Speech Translation from Source or Interpreter? 2376-2380 - Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, Charles Bouveyron, Frédéric Precioso:
Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion. 2381-2385 - Sarenne Wallbridge
, Peter Bell, Catherine Lai:
It's Not What You Said, it's How You Said it: Discriminative Perception of Speech as a Multichannel Communication System. 2386-2390
Speech and Audio Analysis
- Thilo Michael, Gabriel Mittag, Andreas Bütow, Sebastian Möller
:
Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations. 2391-2395 - Christian Bergler, Manuel Schmitt, Andreas K. Maier, Helena Symonds, Paul Spong, Steven R. Ness
, George Tzanetakis, Elmar Nöth:
ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification. 2396-2400 - Wim Boes
, Hugo Van hamme
:
Audiovisual Transfer Learning for Audio Tagging and Sound Event Detection. 2401-2405 - Natalia Nessler, Milos Cernak, Paolo Prandoni, Pablo Mainar:
Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-Specific Scaling. 2406-2410 - Andreea-Maria Oncescu
, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie:
Audio Retrieval with Natural Language Queries. 2411-2415
Cross/Multi-Lingual and Code-Switched ASR
- Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett:
Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio. 2416-2420 - Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stüker, Alex Waibel:
Efficient Weight Factorization for Multilingual Speech Recognition. 2421-2425 - Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli:
Unsupervised Cross-Lingual Representation Learning for Speech Recognition. 2426-2430 - Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki:
Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition. 2431-2435 - Krishna D. N, Pinyi Wang, Bruno Bozza:
Using Large Self-Supervised Models for Low-Resource Speech Recognition. 2436-2440 - Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal
, Anusha Prakash, Hema A. Murthy:
Dual Script E2E Framework for Multilingual and Code-Switching ASR. 2441-2445 - Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan K. M., Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra
, Ashish R. Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan:
MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2446-2450 - Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven C. H. Hoi:
Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition. 2451-2455 - Hardik B. Sailor
, Kiran Praveen, Vikas Agrawal, Abhinav Jain, Abhishek Pandey:
SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2456-2460 - Xinjian Li, Juncheng Li, Florian Metze, Alan W. Black:
Hierarchical Phone Recognition with Compositional Phonetics. 2461-2465 - Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali:
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR. 2466-2470 - Brian Yan, Siddharth Dalmia, David R. Mortensen
, Florian Metze, Shinji Watanabe
:
Differentiable Allophone Graphs for Language-Universal Speech Recognition. 2471-2475
Health and Affect II
- Vincent P. Martin
, Jean-Luc Rouas
, Florian Boyer, Pierre Philip:
Automatic Speech Recognition Systems Errors for Objective Sleepiness Detection Through Voice. 2476-2480 - Jon Gillick, Wesley Deng, Kimiko Ryokai, David Bamman:
Robust Laughter Detection in Noisy Environments. 2481-2485 - Mizuki Nagano, Yusuke Ijima, Sadao Hiroya:
Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech. 2486-2490 - Huda Alsofyani, Alessandro Vinciarelli:
Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children. 2491-2495 - Nujud Aloshban, Anna Esposito
, Alessandro Vinciarelli:
Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units. 2496-2500 - Aniruddha Tammewar, Alessandra Cervone, Giuseppe Riccardi:
Emotion Carrier Recognition from Personal Narratives. 2501-2505 - Scott Condron, Georgia Clarke, Anita Klementiev, Daniela Morse-Kopp, Jack Parry, Dimitri Palaz:
Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training. 2506-2510 - Cong Cai, Mingyue Niu, Bin Liu, Jianhua Tao, Xuefei Liu:
TDCA-Net: Time-Domain Channel Attention Network for Depression Detection. 2511-2515 - Catarina Botelho
, Alberto Abad
, Tanja Schultz
, Isabel Trancoso
:
Visual Speech for Obstructive Sleep Apnea Detection. 2516-2520 - Héctor A. Cordourier Maruri, Sinem Aslan, Georg Stemmer
, Nese Alyüz, Lama Nachman:
Analysis of Contextual Voice Changes in Remote Meetings. 2521-2525 - Nadee Seneviratne, Carol Y. Espy-Wilson:
Speech Based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model. 2526-2530
Neural Network Training Methods for ASR
- Ho-Gyeong Kim, Min-Joong Lee, Hoshik Lee, Tae Gyoon Kang, Jihyun Lee, Eunho Yang, Sung Ju Hwang:
Multi-Domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models. 2531-2535 - Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow:
Learning a Neural Diff for Speech Models. 2536-2540 - Shucong Zhang, Erfan Loweimi
, Peter Bell, Steve Renals:
Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models. 2541-2545 - Jiabin Xue, Tieran Zheng, Jiqing Han:
Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training. 2546-2550 - Heng-Jui Chang
, Hung-yi Lee, Lin-Shan Lee:
Towards Lifelong Learning of End-to-End ASR. 2551-2555 - Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu:
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence. 2556-2560 - Hainan Xu, Kartik Audhkhasi, Yinghui Huang, Jesse Emond, Bhuvana Ramabhadran:
Regularizing Word Segmentation by Creating Misspellings. 2561-2565 - Peidong Wang, Tara N. Sainath, Ron J. Weiss:
Multitask Training with Text Data for End-to-End Speech Recognition. 2566-2570 - Xianzhao Chen, Hao Ni, Yi He, Kang Wang, Zejun Ma, Zongxia Xie
:
Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition. 2571-2575 - Jasha Droppo, Oguz Elibol:
Scaling Laws for Acoustic Models. 2576-2580 - Jayadev Billa:
Leveraging Non-Target Language Resources to Improve ASR Performance in a Target Language. 2581-2585 - Andrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Xiao Sun
, Naigang Wang, Swagath Venkataramani, George Saon
, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan:
4-Bit Quantization of LSTM-Based Speech Recognition Models. 2586-2590 - Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi:
Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation. 2591-2595 - Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li
, Yifan Gong:
Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition. 2596-2600 - Dongcheng Jiang, Chao Zhang, Philip C. Woodland:
Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning. 2601-2605
Prosodic Features and Structure
- Constantijn Kaland
, Matthew Gordon:
How f0 and Phrase Position Affect Papuan Malay Word Identification. 2606-2610 - Anna Bothe Jespersen
, Pavel Sturm, Mísa Hejná
:
On the Feasibility of the Danish Model of Intonational Transcription: Phonetic Evidence from Jutlandic Danish. 2611-2615 - Adrien Méli, Nicolas Ballier, Achille Falaise, Alice Henderson
:
An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus. 2616-2620 - Branislav Gerazov
, Michael Wagner:
ProsoBeast Prosody Annotation Tool. 2621-2625 - Trang Tran
, Mari Ostendorf:
Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts. 2626-2630 - Roger Cheng-yen Liu
, Feng-fan Hsieh
, Yueh-Chin Chang:
Targeted and Targetless Neutral Tones in Taiwanese Southern Min. 2631-2635 - Mária Gósy, Kálmán Abari:
The Interaction of Word Complexity and Word Duration in an Agglutinative Language. 2636-2640 - Ho-hsien Pan
, Shao-Ren Lyu:
Taiwan Min Nan (Taiwanese) Checked Tones Sound Change. 2641-2645 - Moritz Jakob
, Bettina Braun, Katharina Zahner-Ritter:
In-Group Advantage in the Perception of Emotions: Evidence from Three Varieties of German. 2646-2650 - Christer Gobl
:
The LF Model in the Frequency Domain for Glottal Airflow Modelling Without Aliasing Distortion. 2651-2655 - Michael Wagner, Alvaro Iturralde Zurita, Sijia Zhang:
Parsing Speech for Grouping and Prominence, and the Typology of Rhythm. 2656-2660 - Benazir Mumtaz, Massimiliano Canzi, Miriam Butt:
Prosody of Case Markers in Urdu. 2661-2665 - Brynhildur Stefansdottir, Francesco Burroni
, Sam Tilsen:
Articulatory Characteristics of Icelandic Voiced Fricative Lenition: Gradience, Categoricity, and Speaker/Gesture-Specific Effects. 2666-2670 - Khia A. Johnson:
Leveraging the Uniformity Framework to Examine Crosslinguistic Similarity for Long-Lag Stops in Spontaneous Cantonese-English Bilingual Speech. 2671-2675
Single-Channel Speech Enhancement
- Aswin Sivaraman
, Sunwoo Kim, Minje Kim:
Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification. 2676-2680 - Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott:
Speech Denoising with Auditory Models. 2681-2685 - Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka:
Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement. 2686-2690 - Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen:
Multi-Stage Progressive Speech Enhancement Network. 2691-2695 - Oscar Chang, Dung N. Tran, Kazuhito Koishida:
Single-Channel Speech Enhancement Using Learnable Loss Mixup. 2696-2700 - Xiaoqi Zhang, Jun Du, Li Chai, Chin-Hui Lee:
A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement. 2701-2705 - Vikas Agrawal, Shashi Kumar, Shakti P. Rath:
Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition. 2706-2710 - Lukas Lee, Youna Ji, Minjae Lee, Min-Seok Choi:
DEMUCS-Mobile : On-Device Lightweight Speech Enhancement. 2711-2715 - Madhav Mahesh Kashyap
, Anuj Tambwekar
, Krishnamoorthy Manohara, S. Natarajan:
Speech Denoising Without Clean Training Data: A Noise2Noise Approach. 2716-2720 - Feng Dang
, Pengyuan Zhang, Hangting Chen:
Improved Speech Enhancement Using a Complex-Domain GAN with Fused Time-Domain and Time-Frequency Domain Constraints. 2721-2725 - Xudong Zhang, Liang Zhao, Feng Gu:
Speech Enhancement with Topology-Enhanced Generative Adversarial Networks (GANs). 2726-2730 - Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han:
Learning Speech Structure to Improve Time-Frequency Masks. 2731-2735 - Eesung Kim, Hyeji Seo:
SE-Conformer: Time-Domain Speech Enhancement Using Conformer. 2736-2740
Speech Synthesis: Tools, Data, Evaluation
- Thananchai Kongthaworn, Burin Naowarat, Ekapol Chuangsuwanich
:
Spectral and Latent Speech Representation Distortion for TTS Evaluation. 2741-2745 - Cassia Valentini-Botinhao, Simon King:
Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech. 2746-2750 - Rohola Zandie, Mohammad H. Mahoor
, Julia Madsen, Eshrat S. Emamian:
RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis. 2751-2755 - Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li:
AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. 2756-2760 - Nicholas Eng, C. T. Justine Hui, Yusuke Hioka, Catherine I. Watson:
Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis. 2761-2765 - Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao:
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. 2766-2770 - Sai Sirisha Rallabandi, Abhinav Bharadwaj, Babak Naderi, Sebastian Möller
:
Perception of Social Speaker Characteristics in Synthetic Speech. 2771-2775 - Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang:
Hi-Fi Multi-Speaker English TTS Dataset. 2776-2780 - Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee:
Utilizing Self-Supervised Representations for MOS Prediction. 2781-2785 - Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat Khassanov, Huseyin Atakan Varol:
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. 2786-2790 - Jason Taylor, Korin Richmond
:
Confidence Intervals for ASR-Based TTS Evaluation. 2791-2795
INTERSPEECH 2021 Deep Noise Suppression Challenge
- Chandan K. A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Asokan Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan:
INTERSPEECH 2021 Deep Noise Suppression Challenge. 2796-2800 - Andong Li, Wenzhe Liu
, Xiaoxue Luo, Guochen Yu
, Chengshi Zheng, Xiaodong Li:
A Simultaneous Denoising and Dereverberation Framework with Target Decoupling. 2801-2805 - Ziyi Xu, Maximilian Strake, Tim Fingscheidt
:
Deep Noise Suppression with Non-Intrusive PESQNet Supervision Enabling the Use of Real Training Data. 2806-2810 - Xiaohuai Le, Hongsheng Chen, Kai Chen, Jing Lu:
DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement. 2811-2815 - Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie:
DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement. 2816-2820 - Kanghao Zhang
, Shulin He, Hao Li, Xueliang Zhang:
DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement. 2821-2825 - Xu Zhang, Xinlei Ren, Xiguang Zheng
, Lianwu Chen, Chen Zhang, Liang Guo, Bing Yu:
Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss. 2826-2830 - Koen Oostermeijer, Qing Wang, Jun Du:
Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement. 2831-2835
Neural Network Training Methods and Architectures for ASR
- Nicolae-Catalin Ristea, Radu Tudor Ionescu:
Self-Paced Ensemble Learning for Speech and Audio Classification. 2836-2840 - Atsushi Kojima:
Knowledge Distillation for Streaming Transformer-Transducer. 2841-2845 - Timo Lohrenz
, Zhengyang Li, Tim Fingscheidt
:
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition. 2846-2850 - Salah Zaiem, Titouan Parcollet, Slim Essid
:
Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning. 2851-2855 - Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter
, Hermann Ney:
Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models. 2856-2860 - Apoorv Vyas, Srikanth R. Madikeri, Hervé Bourlard:
Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model. 2861-2865
Emotion and Sentiment Analysis I
- Clément Le Moine
, Nicolas Obin, Axel Roebel
:
Speaker Attentive Speech Emotion Recognition. 2866-2870 - Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso:
Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions. 2871-2875 - Efthymios Georgiou
, Georgios Paraskevopoulos, Alexandros Potamianos:
M3: MultiModal Masking Applied to Sentiment Analysis. 2876-2880
Linguistic Components in End-to-End ASR
- Ondrej Klejch, Electra Wallington, Peter Bell:
The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2881-2885 - Wei Zhou
, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter
, Hermann Ney:
Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition. 2886-2890 - Wei Zhou
, Albert Zeyer, André Merboldt, Ralf Schlüter
, Hermann Ney:
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept. 2891-2895 - Abbas Khosravani, Philip N. Garner
, Alexandros Lazaridis:
Modeling Dialectal Variation for Swiss German Automatic Speech Recognition. 2896-2900 - Ekaterina Egorova, Hari Krishna Vydana, Lukás Burget, Jan Cernocký:
Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System. 2901-2905 - Matthew Wiesner, Mousmita Sarma, Ashish Arora, Desh Raj
, Dongji Gao, Ruizhe Huang, Supreet Preet, Moris Johnson, Zikra Iqbal, Nagendra Goel, Jan Trmal, Leibny Paola García-Perera, Sanjeev Khudanpur:
Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition. 2906-2910
Assessment of Pathological Speech and Language II
- Wei Xue
, Roeland van Hout, Fleur Boogmans, Mario Ganzeboom, Catia Cucchiarini, Helmer Strik:
Speech Intelligibility of Dysarthric Speech: Human Scores and Acoustic-Phonetic Features. 2911-2915 - Young-Kyung Kim
, Rimita Lahiri, Md. Nasir, So Hyun Kim
, Somer Bishop, Catherine Lord, Shrikanth S. Narayanan:
Analyzing Short Term Dynamic Speech Features for Understanding Behavioral Traits of Children with Autism Spectrum Disorder. 2916-2920 - Waldemar Jesko
:
Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms. 2921-2925 - Barbara Gili Fivela, Vincenzo Sallustio, Silvia Pede, Danilo Patrocinio:
Phonetic Complexity, Speech Accuracy and Intelligibility Assessment of Italian Dysarthric Speech. 2926-2930 - Si Ioi Ng, Cymie Wing-Yee Ng, Jingyu Li, Tan Lee
:
Detection of Consonant Errors in Disordered Speech Based on Consonant-Vowel Segment Embedding. 2931-2935 - Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard
, Ricardo Gutierrez-Osuna:
Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions. 2936-2940 - Bahman Mirheidari, Yilin Pan, Daniel Blackburn
, Ronan O'Malley, Heidi Christensen
:
Identifying Cognitive Impairment Using Sentence Representation Vectors. 2941-2945 - Zhengjun Yue, Jon Barker, Heidi Christensen
, Cristina McKean
, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright:
Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children. 2946-2950 - Tong Xia
, Jing Han, Lorena Qendro, Ting Dang
, Cecilia Mascolo:
Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data. 2951-2955 - Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng:
Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization. 2956-2960 - Tanuka Bhattacharjee, Jhansi Mallela, Yamini Belur, Atchayaram Nalini, Ravi Yadav
, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh:
Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson's Disease and Healthy Subjects. 2961-2965 - R'mani Haulcy, James R. Glass:
CLAC: A Speech Corpus of Healthy English Speakers. 2966-2970
Multimodal Systems
- Leanne Nortje, Herman Kamper
:
Direct Multimodal Few-Shot Learning of Speech and Images. 2971-2975 - Ramon Sanabria, Austin Waters, Jason Baldridge:
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. 2976-2980 - Huan Zhao, Kaili Ma:
A Fast Discrete Two-Step Learning Hashing for Scalable Cross-Modal Retrieval. 2981-2985 - Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu
:
Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition. 2986-2990 - Kayode Olaleye, Herman Kamper
:
Attention-Based Keyword Localisation in Speech Using Visual Grounding. 2991-2995 - Khazar Khorrami
, Okko Räsänen
:
Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models. 2996-3000