doi: 10.21437/Interspeech.2021
ISSN: 2958-1796
Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks
Michael Pucher, Thomas Woltron
T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion
Markéta Řezáčková, Jan Švec, Daniel Tihelka
Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values
Olivier Perrotin, Hussein El Amouri, Gérard Bailly, Thomas Hueber
A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages
Phat Do, Matt Coler, Jelske Dijkstra, Esther Klabbers
Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury
Tanya Talkar, Nancy Pearl Solomon, Douglas S. Brungart, Stefanie E. Kuchinsky, Megan M. Eitel, Sara M. Lippa, Tracey A. Brickell, Louis M. French, Rael T. Lange, Thomas F. Quatieri
On Modeling Glottal Source Information for Phonation Assessment in Parkinson’s Disease
J.C. Vásquez-Correa, Julian Fritsch, J.R. Orozco-Arroyave, Elmar Nöth, Mathew Magimai-Doss
Distortion of Voiced Obstruents for Differential Diagnosis Between Parkinson’s Disease and Multiple System Atrophy
Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Anne Pavy-Le Traon, Olivier Rascol, Wassilios G. Meissner, Virginie Woisard
A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech
Pu Wang, Bagher BabaAli, Hugo Van hamme
EasyCall Corpus: A Dysarthric Speech Dataset
Rosanna Turrisi, Arianna Braccia, Marco Emanuele, Simone Giulietti, Maura Pugliatti, Mariachiara Sensi, Luciano Fadiga, Leonardo Badino
A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling
Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber, Xavier Alameda-Pineda
Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods
Metehan Yurt, Pavan Kantharaju, Sascha Disch, Andreas Niedermeier, Alberto N. Escalante-B, Veniamin I. Morgenshtern
Identification of F1 and F2 in Speech Using Modified Zero Frequency Filtering
RaviShankar Prasad, Mathew Magimai-Doss
Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice
Yann Teytaut, Axel Roebel
Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition
Seong-Hu Kim, Yong-Hwa Park
Bidirectional Multiscale Feature Aggregation for Speaker Verification
Jiajun Qi, Wu Guo, Bin Gu
Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods
Yu-Jia Zhang, Yih-Wen Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan
Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification
Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu
Binary Neural Network for Speaker Verification
Tinglong Zhu, Xiaoyi Qin, Ming Li
Mutual Information Enhanced Training for Speaker Embedding
Youzhi Tu, Man-Wai Mak
Y-Vector: Multiscale Waveform Encoder for Speaker Embedding
Ge Zhu, Fei Jiang, Zhiyao Duan
Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification
Yan Liu, Zheng Li, Lin Li, Qingyang Hong
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding
Hongning Zhu, Kong Aik Lee, Haizhou Li
TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang
FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis
Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho
Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer
Taiki Nakamura, Tomoki Koriyama, Hiroshi Saruwatari
Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima
Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang
Deliberation-Based Multi-Pass Speech Synthesis
Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J.F. Gales
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, R.J. Skerry-Ryan, Yonghui Wu
Transformer-Based Acoustic Modeling for Streaming Speech Synthesis
Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Koehler, Qing He
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu
Speed up Training with Variable Length Inputs by Efficient Batching Strategies
Zhenhao Ge, Lakshmish Kaushik, Masanori Omote, Saket Kumar
Funnel Deep Complex U-Net for Phase-Aware Speech Enhancement
Yuhang Sun, Linju Yang, Huifeng Zhu, Jie Hao
Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement
Qiquan Zhang, Qi Song, Aaron Nicolson, Tian Lan, Haizhou Li
Perceptual Contributions of Vowels and Consonant-Vowel Transitions in Understanding Time-Compressed Mandarin Sentences
Changjie Pan, Feng Yang, Fei Chen
Transfer Learning for Speech Intelligibility Improvement in Noisy Environments
Ritujoy Biswas, Karan Nathwani, Vinayak Abrol
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility
Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani
Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement
Wenzhe Liu, Andong Li, Yuxuan Ke, Chengshi Zheng, Xiaodong Li
Speech Enhancement with Weakly Labelled Data from AudioSet
Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang
Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement
Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao
A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction
Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty
Self-Supervised Learning Based Phone-Fortified Speech Enhancement
Yuanhang Qiu, Ruili Wang, Satwinder Singh, Zhizhong Ma, Feng Hou
Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement
Khandokar Md. Nayem, Donald S. Williamson
Restoring Degraded Speech via a Modified Diffusion Model
Jianwei Zhang, Suren Jayasuriya, Visar Berisha
User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems
Hoang Long Nguyen, Vincent Renkens, Joris Pelemans, Srividya Pranavi Potharaju, Anil Kumar Nalamalapu, Murat Akbacak
Self-Supervised Dialogue Learning for Spoken Conversational Question Answering
Nuo Chen, Chenyu You, Yuexian Zou
Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking
Ruolin Su, Ting-Wei Wu, Biing-Hwang Juang
Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information
Yuya Chiba, Ryuichiro Higashinaka
Neural Spoken-Response Generation Using Prosodic and Linguistic Context for Conversational Systems
Yoshihiro Yamazaki, Yuya Chiba, Takashi Nose, Akinori Ito
Semantic Transportation Prototypical Network for Few-Shot Intent Detection
Weiyuan Xu, Peilin Zhou, Chenyu You, Yuexian Zou
Domain-Specific Multi-Agent Dialog Policy Learning in Multi-Domain Task-Oriented Scenarios
Li Tang, Yuke Si, Longbiao Wang, Jianwu Dang
Leveraging ASR N-Best in Deep Entity Retrieval
Haoyu Wang, John Chen, Majid Laali, Kevin Durda, Jeff King, William Campbell, Yang Liu
End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen
Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties
Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen, Michael R. Marlo, Graham Neubig
Speech Acoustic Modelling Using Raw Source and Filter Components
Erfan Loweimi, Zoran Cvetkovic, Peter Bell, Steve Renals
Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture
Masakiyo Fujimoto, Hisashi Kawai
IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition
Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha
Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays
Junqi Chen, Xiao-Lei Zhang
Multi-Channel Transformer Transducer for Speech Recognition
Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo
Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios
Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe
Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
Guodong Ma, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang
Rethinking Evaluation in ASR: Are Our Models Robust Enough?
Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve
Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition
Max W.Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu
Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams
Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren
Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
Ui-Hyun Kim
Noisy Student-Teacher Training for Robust Keyword Spotting
Hyun-Jin Park, Pai Zhu, Ignacio Lopez Moreno, Niranjan Subrahmanya
Multi-Channel VAD for Transcription of Group Discussion
Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu
Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments
Hengshun Zhou, Jun Du, Hang Chen, Zijun Jing, Shifu Xiong, Chin-Hui Lee
Enrollment-Less Training for Personalized Voice Activity Detection
Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura
Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model
Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki
FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications
Young D. Kwon, Jagmohan Chauhan, Cecilia Mascolo
End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention
Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee, Kiho Cho, Sung-Un Park
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak
A Lightweight Framework for Online Voice Activity Detection in the Wild
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu
“See what I mean, huh?” Evaluating Visual Inspection of F0 Tracking in Nasal Grunts
Aurélie Chlébowski, Nicolas Ballier
System Performance as a Function of Calibration Methods, Sample Size and Sampling Variability in Likelihood Ratio-Based Forensic Voice Comparison
Bruce Xiao Wang, Vincent Hughes
Voicing Assimilations by French Speakers of German in Stop-Fricative Sequences
Anne Bonneau
The Four-Way Classification of Stops with Voicing and Aspiration for Non-Native Speech Evaluation
Titas Chakraborty, Vaishali Patil, Preeti Rao
Acoustic and Prosodic Correlates of Emotions in Urdu Speech
Saba Urooj, Benazir Mumtaz, Sarmad Hussain, Ehsan ul Haq
Voicing Contrasts in the Singleton Stops of Palestinian Arabic: Production and Perception
Nour Tamim, Silke Hamann
A Comparison of the Accuracy of Dissen and Keshet’s (2016) DeepFormants and Traditional LPC Methods for Semi-Automatic Speaker Recognition
Thomas Coy, Vincent Hughes, Philip Harrison, Amelia J. Gully
MAP Adaptation Characteristics in Forensic Long-Term Formant Analysis
Michael Jessen
Cross-Linguistic Speaker Individuality of Long-Term Formant Distributions: Phonetic and Forensic Perspectives
Justin J.H. Lo
Sound Change in Spontaneous Bilingual Speech: A Corpus Study on the Cantonese n-l Merger in Cantonese-English Bilinguals
Rachel Soo, Khia A. Johnson, Molly Babel
Characterizing Voiced and Voiceless Nasals in Mizo
Wendy Lalhminghlui, Priyankoo Sarmah
The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates
Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J.M. Rothkrantz, Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp
Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19
Rubén Solera-Ureña, Catarina Botelho, Francisco Teixeira, Thomas Rolland, Alberto Abad, Isabel Trancoso
The Phonetic Footprint of Covid-19?
P. Klumpp, T. Bocklet, T. Arias-Vergara, J.C. Vásquez-Correa, P.A. Pérez-Toro, S.P. Bayerl, J.R. Orozco-Arroyave, Elmar Nöth
Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021
Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Jr., Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva
Visual Transformers for Primates Classification and Covid Detection
Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia-Linnhoff Popien
Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment
Thomas Pellegrini
A Deep and Recurrent Architecture for Primate Vocalization Classification
Robert Müller, Steffen Illium, Claudia Linnhoff-Popien
Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification
Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya
Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild
Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller
Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features
José Vicente Egas-López, Mercedes Vetráb, László Tóth, Gábor Gosztolya
Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech
Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina, Alena Velichko, Danila Mamontov, Wolfgang Minker, Alexey Karpov
Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification
Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André
Towards Automatic Speech Recognition for People with Atypical Speech
Heidi Christensen
Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization
Chau Luu, Peter Bell, Steve Renals
Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition
Magdalena Rybicka, Jesús Villalba, Piotr Żelasko, Najim Dehak, Konrad Kowalczyk
Speaker Embeddings by Modeling Channel-Wise Correlations
Themos Stafylakis, Johan Rohdin, Lukáš Burget
Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction
Weipeng He, Petr Motlicek, Jean-Marc Odobez
ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform
Junyi Peng, Xiaoyang Qu, Jianzong Wang, Rongzhi Gu, Jing Xiao, Lukáš Burget, Jan Černocký
Prosodic Disambiguation Using Chironomic Stylization of Intonation with Native and Non-Native Speakers
Xiao Xiao, Nicolas Audibert, Grégoire Locqueville, Christophe d'Alessandro, Barbara Kuhnert, Claire Pillot-Loiseau
Variation in Perceptual Sensitivity and Compensation for Coarticulation Across Adult and Child Naturally-Produced and TTS Voices
Aleese Block, Michelle Cohn, Georgia Zellou
Extracting Different Levels of Speech Information from EEG Using an LSTM-Based Model
Mohammad Jalilpour Monesi, Bernd Accou, Tom Francart, Hugo Van hamme
Word Competition: An Entropy-Based Approach in the DIANA Model of Human Word Comprehension
Louis ten Bosch, Lou Boves
Time-to-Event Models for Analyzing Reaction Time Sequences
Louis ten Bosch, Lou Boves
Models of Reaction Times in Auditory Lexical Decision: RTonset versus RToffset
Sophie Brand, Kimberley Mulder, Louis ten Bosch, Lou Boves
SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features
Gwantae Kim, David K. Han, Hanseok Ko
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification
Helin Wang, Yuexian Zou, Wenwu Wang
An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection
Xu Zheng, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu
Acoustic Scene Classification Using Kervolution-Based SubSpectralNet
Ritika Nandi, Shashank Shekhar, Manjunath Mulimani
Event Specific Attention for Polyphonic Sound Event Detection
Harshavardhan Sundar, Ming Sun, Chao Wang
AST: Audio Spectrogram Transformer
Yuan Gong, Yu-An Chung, James Glass
Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
Soonshin Seo, Donghyun Lee, Ji-Hwan Kim
An Evaluation of Data Augmentation Methods for Sound Scene Geotagging
Helen L. Bear, Veronica Morfi, Emmanouil Benetos
Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers
Chiori Hori, Takaaki Hori, Jonathan Le Roux
Variational Information Bottleneck for Effective Low-Resource Audio Classification
Shijing Si, Jianzong Wang, Huiming Sun, Jianhan Wu, Chuanyao Zhang, Xiaoyang Qu, Ning Cheng, Lei Chen, Jing Xiao
Improving Weakly Supervised Sound Event Detection with Self-Supervised Auxiliary Tasks
Soham Deshmukh, Bhiksha Raj, Rita Singh
Acoustic Event Detection with Classifier Chains
Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi
Segment and Tone Production in Continuous Speech of Hearing and Hearing-Impaired Children
Shu-Chuan Tseng, Yi-Fen Liu
Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-Acoustic Hearing
Feng Wang, Jing Chen, Fei Chen
A Comparative Study of Different EMG Features for Acoustics-to-EMG Mapping
Manthan Sharma, Navaneetha Gaddam, Tejas Umesh, Aditya Murthy, Prasanta Kumar Ghosh
Image-Based Assessment of Jaw Parameters and Jaw Kinematics for Articulatory Simulation: Preliminary Results
Ajish K. Abraham, V. Sivaramakrishnan, N. Swapna, N. Manohar
An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech
Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu
Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder
Judith Dineley, Grace Lavelle, Daniel Leightley, Faith Matcham, Sara Siddi, Maria Teresa Peñarrubia-María, Katie M. White, Alina Ivan, Carolin Oetzmann, Sara Simblett, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J.B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins, The RADAR-CNS Consortium
An Automatic, Simple Ultrasound Biofeedback Parameter for Distinguishing Accurate and Misarticulated Rhotic Syllables
Sarah R. Li, Colin T. Annand, Sarah Dugan, Sarah M. Schwab, Kathryn J. Eary, Michael Swearengen, Sarah Stack, Suzanne Boyce, Michael A. Riley, T. Douglas Mast
Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video
Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals
RaSSpeR: Radar-Based Silent Speech Recognition
David Ferreira, Samuel Silva, Francisco Curado, António Teixeira
Investigating Speech Reconstruction for Laryngectomees for Silent Speech Interfaces
Beiming Cao, Nordine Sebkhi, Arpan Bhavsar, Omer T. Inan, Robin Samlan, Ted Mau, Jun Wang
LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement
Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B, Andreas Maier
Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation
Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement
Siyuan Zhang, Xiaofei Li
Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks
Hyungchan Song, Jong Won Shin
Cancellation of Local Competing Speaker with Near-Field Localization for Distributed ad-hoc Sensor Network
Pablo Pérez Zarazaga, Mariem Bouafif Mansali, Tom Bäckström, Zied Lachiri
A Deep Learning Method to Multi-Channel Active Noise Control
Hao Zhang, DeLiang Wang
Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing
Simone Graetzer, Jon Barker, Trevor J. Cox, Michael Akeroyd, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz
Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model
Zehai Tu, Ning Ma, Jon Barker
Explaining Deep Learning Models for Speech Enhancement
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr
Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones
Weilong Huang, Jinwei Feng
Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning
Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma
wav2vec-C: A Self-Supervised Model for Speech Representation Learning
Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas
On the Learning Dynamics of Semi-Supervised Training for ASR
Electra Wallington, Benji Kershenbaum, Ondřej Klejch, Peter Bell
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models
Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation
Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno
slimIPL: Language-Model-Free Iterative Pseudo-Labeling
Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert
Phonetically Motivated Self-Supervised Speech Representation Learning
Xianghu Yue, Haizhou Li
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS
Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, Lei He
Speaker-Conversation Factorial Designs for Diarization Error Analysis
Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff
SmallER: Scaling Neural Entity Resolution for Edge Devices
Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel
Disfluency Detection with Unlabeled Data and Small BERT Models
Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, Daniel J. Liebling
Discriminative Self-Training for Punctuation Prediction
Qian Chen, Wen Wang, Mengzhe Chen, Qinglin Zhang
Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks Using Switching Tokens
Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura
A Noise Robust Method for Word-Level Pronunciation Assessment
Binghuai Lin, Liyuan Wang
Targeted Keyword Filtering for Accelerated Spoken Topic Identification
Jonathan Wintrode
Multimodal Speech Summarization Through Semantic Concept Learning
Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze
Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization
Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon
Speaker Transition Patterns in Three-Party Conversation: Evidence from English, Estonian and Swedish
Marcin Włodarczak, Emer Gilmartin
Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion
Samuel J. Broughton, Md. Asif Jalal, Roger K. Moore
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training
Kun Zhou, Berrak Sisman, Haizhou Li
Adversarial Voice Conversion Against Neural Spoofing Detectors
Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhen-Hua Ling
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation
Xiangheng He, Junjie Chen, Georgios Rizos, Björn W. Schuller
TVQVC: Transformer Based Vector Quantized Variational Autoencoder with CTC Loss for Voice Conversion
Ziyi Chen, Pengyuan Zhang
Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion
Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei Xie, Wendong Gan, Haitao Chen, Hai Li
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations
Jheng-hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee
An Exemplar Selection Algorithm for Native-Nonnative Voice Conversion
Christopher Liberatore, Ricardo Gutierrez-Osuna
Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion
Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng
Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder
Manh Luong, Viet Anh Tran
Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation
Oubaïda Chouchane, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino, Nicholas Evans, Melek Önen, Massimiliano Todisco
Configurable Privacy-Preserving Automatic Speech Recognition
Ranya Aloufi, Hamed Haddadi, David Boyle
Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation
Scott Novotney, Yile Gu, Ivan Bulyko
Communication-Efficient Agnostic Federated Averaging
Jae Ro, Mingqing Chen, Rajiv Mathews, Mehryar Mohri, Ananda Theertha Suresh
Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification
Timm Koppelmann, Alexandru Nelus, Lea Schönherr, Dorothea Kolossa, Rainer Martin
PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification
Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee
Continual Learning for Fake Audio Detection
Haoxin Ma, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang
Evaluating the Vulnerability of End-to-End Automatic Speech Recognition Models to Membership Inference Attacks
Muhammad A. Shah, Joseph Szurley, Markus Mueller, Athanasios Mouchtaris, Jasha Droppo
SynthASR: Unlocking Synthetic Data for Speech Recognition
Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, Jasha Droppo
DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics
Ananya Muguli, Lancelot Pinto, Nirmala R, Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda
PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge
Madhu R. Kamble, Jose A. Gonzalez-Lopez, Teresa Grau, Juan M. Espin, Lorenzo Cascioli, Yiqing Huang, Alejandro Gomez-Alanis, Jose Patino, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas Evans, Maria A. Zuluaga, Massimiliano Todisco
Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features
Vincent Karas, Björn W. Schuller
Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines
Isabella Södergren, Maryam Pahlavan Nodeh, Prakash Chandra Chhipa, Konstantina Nikolaidou, György Kovács
Diagnosis of COVID-19 Using Auditory Acoustic Cues
Rohan Kumar Das, Maulik Madhavi, Haizhou Li
Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation
John Harvill, Yash R. Wani, Mark Hasegawa-Johnson, Narendra Ahuja, David Beiser, David Chestek
The DiCOVA 2021 Challenge — An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio
Gauri Deshpande, Björn W. Schuller
COVID-19 Detection from Spectral Features on the DiCOVA Dataset
Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, Deepu Vijayasenan
Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information
Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller
Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis
Swapnil Bhosale, Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu
Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds
Flavio Avila, Amir H. Poorjam, Deepak Mittal, Charles Dognin, Ananya Muguli, Rohit Kumar, Srikanth Raj Chetupalli, Sriram Ganapathy, Maneesh Singh
Application for Detecting Depression, Parkinson’s Disease and Dysphonic Speech
Gábor Kiss, Dávid Sztahó, Miklós Gábriel Tulics
Beey: More Than a Speech-to-Text Editor
Lenka Weingartová, Veronika Volná, Ewa Balejová
Downsizing of Vocal-Tract Models to Line up Variations and Reduce Manufacturing Costs
Takayuki Arai
ROXANNE Research Platform: Automate Criminal Investigations
Maël Fabien, Shantipriya Parida, Petr Motlicek, Dawei Zhu, Aravind Krishnan, Hoang H. Nguyen
The LIUM Human Active Correction Platform for Speaker Diarization
Alexandre Flucha, Anthony Larcher, Ambuj Mehrish, Sylvain Meignier, Florian Plaut, Nicolas Poupon, Yevhenii Prokopalo, Adrien Puertolas, Meysam Shamsi, Marie Tahon
On-Device Streaming Transformer-Based End-to-End Speech Recognition
Yoo Rhee Oh, Kiyoung Park
Advanced Semi-Blind Speaker Extraction and Tracking Implemented in Experimental Device with Revolving Dense Microphone Array
J. Čmejla, T. Kounovský, J. Janský, Jiri Malek, M. Rozkovec, Z. Koldovský
Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw
Jan Chorowski, Grzegorz Ciesielski, Jarosław Dzikowski, Adrian Łańcucki, Ricard Marxer, Mateusz Opala, Piotr Pusz, Paweł Rychlikowski, Michał Stypułkowski
Aligned Contrastive Predictive Coding
Jan Chorowski, Grzegorz Ciesielski, Jarosław Dzikowski, Adrian Łańcucki, Ricard Marxer, Mateusz Opala, Piotr Pusz, Paweł Rychlikowski, Michał Stypułkowski
Neural Text Denormalization for Speech Transcripts
Benjamin Suter, Josef Novak
Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio
Aditya Joglekar, Seyed Omid Sadjadi, Meena Chandra-Shekar, Christopher Cieri, John H.L. Hansen
Voice Quality in Verbal Irony: Electroglottographic Analyses of Ironic Utterances in Standard Austrian German
Hannah Leykum
Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing
Mathilde Hutin, Yaru Wu, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker
Glottal Stops in Upper Sorbian: A Data-Driven Approach
Ivan Kraljevski, Maria Paola Bissiri, Frank Duckhorn, Constanze Tschoepe, Matthias Wolff
Cue Interaction in the Perception of Prosodic Prominence: The Role of Voice Quality
Bogdan Ludusan, Petra Wagner, Marcin Włodarczak
Glottal Sounds in Korebaju
Jenifer Vega Rodriguez, Nathalie Vallée
Automatic Classification of Phonation Types in Spontaneous Speech: Towards a New Workflow for the Characterization of Speakers’ Voice Quality
Anaïs Chanclu, Imen Ben Amor, Cédric Gendrot, Emmanuel Ferragne, Jean-François Bonastre
Measuring Voice Quality Parameters After Speaker Pseudonymization
Rob J.J.H. van Son
Audio-Visual Recognition of Emotional Engagement of People with Dementia
Lars Steinert, Felix Putze, Dennis Küster, Tanja Schultz
Speaking Corona? Human and Machine Recognition of COVID-19 from Voice
Pascal Hecker, Florian B. Pokorny, Katrin D. Bartl-Pokorny, Uwe Reichel, Zhao Ren, Simone Hantke, Florian Eyben, Dagmar M. Schuller, Bert Arnrich, Björn W. Schuller
Acoustic-Prosodic, Lexical and Demographic Cues to Persuasiveness in Competitive Debate Speeches
Huyen Nguyen, Ralph Vente, David Lupea, Sarah Ita Levitan, Julia Hirschberg
Unsupervised Bayesian Adaptation of PLDA for Speaker Verification
Bengt J. Borgström
The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III
Weiqing Wang, Danwei Cai, Jin Wang, Qingjian Lin, Xuyang Wang, Mi Hong, Ming Li
Improved Meta-Learning Training for Speaker Verification
Yafeng Chen, Wu Guo, Bin Gu
Variational Information Bottleneck Based Regularization for Speaker Recognition
Dan Wang, Yuanjie Dong, Yaxing Li, Yunfei Zi, Zhihui Zhang, Xiaoqi Li, Shengwu Xiong
Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make?
Niko Brümmer, Luciana Ferrer, Albert Swart
SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System
Roza Chojnacka, Jason Pelecanos, Quan Wang, Ignacio Lopez Moreno
AntVoice Neural Speaker Embedding System for FFSVC 2020
Zhiming Wang, Furong Xu, Kaisheng Yao, Yuan Cheng, Tao Xiong, Huijia Zhu
Gradient Regularization for Noise-Robust Speaker Verification
Jianchen Li, Jiqing Han, Hongwei Song
Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification
Saurabh Kataria, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak
Scaling Effect of Self-Supervised Speech Models
Jie Pu, Yuguang Yang, Ruirui Li, Oguz Elibol, Jasha Droppo
Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network
Yibo Wu, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang
Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification
Li Zhang, Qing Wang, Kong Aik Lee, Lei Xie, Haizhou Li
Speaker Anonymisation Using the McAdams Coefficient
Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans
Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
Yiyu Luo, Jing Wang, Liang Xu, Lidong Yang
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation
Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu
Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function
Jianjun Gu, Longbiao Cheng, Xingwei Sun, Junfeng Li, Yonghong Yan
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation
Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu
Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement
Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy
Scene-Agnostic Multi-Microphone Speech Dereverberation
Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot
Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding Vectors Based on Regular Simplex
Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi
A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation
Hao Zhang, DeLiang Wang
Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation
Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian, Qiang Fu
Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition
Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo
Estimating Articulatory Movements in Speech Production with Transformer Networks
Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh
Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification
Dongchao Yang, Helin Wang, Yuexian Zou
Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation
Alfredo Esquivel Jaramillo, Jesper Kjær Nielsen, Mads Græsbøll Christensen
Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao
Noise Robust Pitch Stylization Using Minimum Mean Absolute Error Criterion
Chiranjeevi Yarra, Prasanta Kumar Ghosh
An Attribute-Aligned Strategy for Learning Speech Representation
Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation
Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Torbjørn Svendsen
Unsupervised Training of a DNN-Based Formant Tracker
Jason Lilley, H. Timothy Bunnell
SUPERB: Speech Processing Universal PERformance Benchmark
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee
Synchronising Speech Segments with Musical Beats in Mandarin and English Singing
Cong Zhang, Jian Zhu
FRILL: A Non-Semantic Speech Embedding for Mobile Devices
Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, Shwetak Patel
Pitch Contour Separation from Overlapping Speech
Hiroki Mori
Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning
Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen
Data Augmentation for Spoken Language Understanding via Pretrained Language Models
Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao
FANS: Fusing ASR and NLU for On-Device SLU
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow
Sequential End-to-End Intent and Slot Label Classification and Localization
Yiran Cao, Nihal Potdar, Anderson R. Avila
DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants
Deepak Muralidharan, Joel Ruben Antony Moniz, Weicheng Zhang, Stephen Pulman, Lin Li, Megan Barnes, Jingjing Pan, Jason Williams, Alex Acero
A Context-Aware Hierarchical BERT Fusion Network for Multi-Turn Dialog Act Detection
Ting-Wei Wu, Ruolin Su, Biing-Hwang Juang
Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
Qian Chen, Wen Wang, Qinglin Zhang
Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models
Quynh Do, Judith Gaspers, Daniil Sorokin, Patrick Lehnen
Integrating Dialog History into End-to-End Spoken Language Understanding Systems
Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury
Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking
Ting Han, Chongxuan Huang, Wei Peng
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W. Black
Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian
Low Resource German ASR with Untranscribed Data Spoken by Non-Native Children — INTERSPEECH 2021 Shared Task SPAPL System
Jinhan Wang, Yunzheng Zhu, Ruchao Fan, Wei Chu, Abeer Alwan
Robust Continuous On-Device Personalization for Automatic Speech Recognition
Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Tsendsuren Munkhdalai, Françoise Beaufays
Speaker Normalization Using Joint Variational Autoencoder
Shashi Kumar, Shakti P. Rath, Abhishek Pandey
The TAL System for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Childrens Speech
Gaopeng Xu, Song Yang, Lu Ma, Chengfei Li, Zhongqin Wu
On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR
Tsz Kin Lam, Mayumi Ohta, Shigehiko Schamoni, Stefan Riezler
Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding
Heting Gao, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, Mark Hasegawa-Johnson
Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need
Yan Huang, Guoli Ye, Jinyu Li, Yifan Gong
Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning
Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau
Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII’s System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge
Wei Chu, Peng Chang, Jing Xiao
CVC: Contrastive Learning for Non-Parallel Voice Conversion
Tingle Li, Yichen Liu, Chenxu Hu, Hang Zhao
A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda
One-Shot Voice Conversion with Speaker-Agnostic StarGAN
Sefik Emre Eskimez, Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr
Fine-Tuning Pre-Trained Voice Conversion Model for Adding New Target Speakers with Limited Data
Takeshi Koshizuka, Hidefumi Ohmura, Kouichi Katsurada
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng
StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion
Yinghao Aaron Li, Ali Zare, Nima Mesgarani
Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis
Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall
StarGAN-VC+ASR: StarGAN-Based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition
Shoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi, Hirokazu Kameoka
Two-Pathway Style Embedding for Arbitrary Voice Conversion
Xuexin Xu, Liang Shi, Jinhui Chen, Xunquan Chen, Jie Lian, Pingyuan Lin, Zhihong Zhang, Edwin R. Hancock
Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics
Yufei Liu, Chengzhu Yu, Wang Shuai, Zhenchuan Yang, Yang Chao, Weibin Zhang
Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation
Yi Zhou, Xiaohai Tian, Zhizheng Wu, Haizhou Li
Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder
Hongqiang Du, Lei Xie
Optimizing an Automatic Creaky Voice Detection Method for Australian English Speaking Females
Hannah White, Joshua Penney, Andy Gibson, Anita Szakay, Felicity Cox
A Comparison of Acoustic Correlates of Voice Quality Across Different Recording Devices: A Cautionary Tale
Joshua Penney, Andy Gibson, Felicity Cox, Michael Proctor, Anita Szakay
Investigating Voice Function Characteristics of Greek Speakers with Hearing Loss Using Automatic Glottal Source Feature Extraction
Anna Sfakianaki, George P. Kafentzis
Automated Detection of Voice Disorder in the Saarbrücken Voice Database: Effects of Pathology Subset and Audio Materials
Mark Huckvale, Catinca Buciuleac
Accelerometer-Based Measurements of Voice Quality in Children During Semi-Occluded Vocal Tract Exercise with a Narrow Straw in Air
Steven M. Lulich, Rita R. Patel
Articulatory Coordination for Speech Motor Tracking in Huntington Disease
Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost
Modeling Dysphonia Severity as a Function of Roughness and Breathiness Ratings in the GRBAS Scale
Carlos A. Ferrer, Efren Aragón, María E. Hdez-Díaz, Marc S. de Bodt, Roman Cmejla, Marina Englert, Mara Behlau, Elmar Nöth
Golos: Russian Dataset for Speech Research
Nikolay Karpov, Alexander Denisenko, Fedor Minkin
Radically Old Way of Computing Spectra: Applications in End-to-End ASR
Samik Sadhu, Hynek Hermansky
Self-Supervised End-to-End ASR for Low Resource L2 Swedish
Ragheb Al-Ghezi, Yaroslav Getman, Aku Rouhe, Raili Hildén, Mikko Kurimo
SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition
Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier
Prosodic Accommodation in Face-to-Face and Telephone Dialogues
Pavel Šturm, Radek Skarnitzl, Tomáš Nechanský
Dialect Features in Heterogeneous and Homogeneous Gheg Speaking Communities
Josiane Riverin-Coutlée, Conceição Cunha, Enkeleida Kapia, Jonathan Harrington
An Exploration of the Acoustic Space of Rhotics and Laterals in Ruruuli
Margaret Zellers, Alena Witzlack-Makarevich, Lilja Saeboe, Saudah Namyalo
Domain-Initial Strengthening in Turkish: Acoustic Cues to Prosodic Hierarchy in Stop Consonants
Kubra Bodur, Sweeney Branje, Morgane Peirolo, Ingrid Tiscareno, James S. German
Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics
Katerina Zmolikova, Marc Delcroix, Desh Raj, Shinji Watanabe, Jan Černocký
Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers
Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz
Using X-Vectors for Speech Activity Detection in Broadcast Streams
Lukas Mateju, Frantisek Kynych, Petr Cerva, Jindrich Zdansky, Jiri Malek
Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features
Daniele Salvati, Carlo Drioli, Gian Luca Foresti
Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network
Midia Yousefi, John H.L. Hansen
End-to-End Language Diarization for Bilingual Code-Switching Speech
Hexin Liu, Leibny Paola García Perera, Xinyi Zhang, Justin Dauwels, Andy W.H. Khong, Sanjeev Khudanpur, Suzy J. Styles
Modeling and Training Strategies for Language Recognition Systems
Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina
A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification
Hui Wang, Lin Liu, Yan Song, Lei Fang, Ian McLoughlin, Li-Rong Dai
Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning
Keqi Deng, Songjun Cao, Long Ma
Exploring wav2vec 2.0 on Speaker Verification and Language Identification
Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu
Self-Supervised Phonotactic Representations for Language Identification
G. Ramesh, C. Shiva Kumar, K. Sri Rama Murty
E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition
Jicheng Zhang, Yizhou Peng, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng
Excitation Source Feature Based Dialect Identification in Ao — A Low Resource Language
Moakala Tzudir, Shikha Baghel, Priyankoo Sarmah, S.R. Mahadeva Prasanna
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration
Shreya Khare, Ashish Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj
Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation
Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg
Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks
Herman Kamper, Benjamin van Niekerk
Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning
Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li
Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language
Christiaan Jacobs, Herman Kamper
Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing
Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper
Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages
Shun Takahashi, Sakriani Sakti, Satoshi Nakamura
Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021
Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
Identifying Indicators of Vulnerability from Short Speech Segments Using Acoustic and Textual Features
Xia Cui, Amila Gamage, Terry Hanley, Tingting Mu
The Zero Resource Speech Challenge 2021: Spoken Language Modelling
Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux
Zero-Shot Federated Learning with New Classes for Audio Classification
Gautham Krishna Gudur, Satheesh Kumar Perepu
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement
Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, Hoon-Young Cho
Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features
Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
Haoyue Zhan, Haitong Zhang, Wenjie Ou, Yue Lin
Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations
Zhenchuan Yang, Weibin Zhang, Yufei Liu, Xiaofen Xing
EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder
Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao
Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari
Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech
Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan
Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation
Ege Kesim, Engin Erzin
Speech2Video: Cross-Modal Distillation for Speech to Video Generation
Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, Jing Xiao
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
Junhyeok Lee, Seungu Han
QISTA-Net-Audio: Audio Super-Resolution via Non-Convex ℓ_q-Norm Minimization
Gang-Xuan Lin, Shih-Wei Hu, Yen-Ju Lu, Yu Tsao, Chun-Shien Lu
X-net: A Joint Scale Down and Scale Up Method for Voice Call
Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park, Kwang Pyo Choi
WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution
Kexun Zhang, Yi Ren, Changliang Xu, Zhou Zhao
Half-Truth: A Partially Fake Audio Detection Dataset
Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu
Data Quality as Predictor of Voice Anti-Spoofing Generalization
Bhusan Chettri, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen
Coded Speech Enhancement Using Neural Network-Based Vector-Quantized Residual Features
Youngju Cheon, Soojoong Hwang, Sangwook Han, Inseon Jang, Jong Won Shin
Multi-Channel Opus Compression for Far-Field Automatic Speech Recognition with a Fixed Bitrate Budget
Lukas Drude, Jahn Heymann, Andreas Schwarz, Jean-Marc Valin
Effects of Prosodic Variations on Accidental Triggers of a Commercial Voice Assistant
Ingo Siegert
Improving the Expressiveness of Neural Vocoding with Non-Affine Normalizing Flows
Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote
Voice Privacy Through x-Vector and CycleGAN-Based Anonymization
Gauri P. Prajapati, Dipesh K. Singh, Preet P. Amin, Hemant A. Patil
A Two-Stage Approach to Speech Bandwidth Extension
Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen
Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder
Joon Byun, Seungmin Shin, Youngcheol Park, Jongmo Sung, Seungkwon Beack
Protecting Gender and Identity with Disentangled Speech Representations
Dimitrios Stoidis, Andrea Cavallaro
Perception of Standard Arabic Synthetic Speech Rate
Yahya Aldholmi, Rawan Aldhafyan, Asma Alqahtani
The Influence of Parallel Processing on Illusory Vowels
Takeshi Kishiyama
Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors
Anupama Chingacham, Vera Demberg, Dietrich Klakow
SpeechAdjuster: A Tool for Investigating Listener Preferences and Speech Intelligibility
Olympia Simantiraki, Martin Cooke
VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification
Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa
Effects of Aging and Age-Related Hearing Loss on Talker Discrimination
Min Xu, Jing Shao, Lan Wang
Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication
Yuqing Zhang, Zhu Li, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang
Human Spoofing Detection Performance on Degraded Speech
Camryn Terblanche, Philip Harrison, Amelia J. Gully
Reliable Estimates of Interpretable Cue Effects with Active Learning in Psycholinguistic Research
Marieke Einfeldt, Rita Sevastjanova, Katharina Zahner-Ritter, Ekaterina Kazak, Bettina Braun
Towards the Explainability of Multimodal Speech Emotion Recognition
Puneet Kumar, Vishesh Kaushik, Balasubramanian Raman
Primacy of Mouth over Eyes: Eye Movement Evidence from Audiovisual Mandarin Lexical Tones and Vowels
Biao Zeng, Rui Wang, Guoxing Yu, Christian Dobel
Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
Takanori Ashihara, Takafumi Moriya, Makio Kashino
Super-Human Performance in Online Low-Latency Recognition of Conversational Speech
Thai-Son Nguyen, Sebastian Stüker, Alex Waibel
Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li, Yifan Gong
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon
Streaming Multi-Talker Speech Recognition with Joint Speaker Identification
Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong
Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture
Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix, Taichi Asami
Improving RNN-T ASR Accuracy Using Context Audio
Andreas Schwarz, Ilya Sklyar, Simon Wiesler
HMM-Free Encoder Pre-Training for Streaming RNN Transducer
Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma
Reducing Exposure Bias in Training Recurrent Neural Network Transducers
Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltán Tüske
Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models
Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno
StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR
Hirofumi Inaguma, Tatsuya Kawahara
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition
Niko Moritz, Takaaki Hori, Jonathan Le Roux
Multi-Mode Transformer Transducer with Stochastic Future Context
Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe
A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement
Xinlei Ren, Xu Zhang, Lianwu Chen, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu
A Partitioned-Block Frequency-Domain Adaptive Kalman Filter for Stereophonic Acoustic Echo Cancellation
Rui Zhu, Feiran Yang, Yuepeng Li, Shidong Shang
Real-Time Independent Vector Analysis Using Semi-Supervised Nonnegative Matrix Factorization as a Source Model
Taihui Wang, Feiran Yang, Rui Zhu, Jun Yang
Improving Channel Decorrelation for Multi-Channel Target Speech Extraction
Jiangyu Han, Wei Rao, Yannan Wang, Yanhua Long
Inplace Gated Convolutional Recurrent Neural Network for Dual-Channel Speech Enhancement
Jinjiang Liu, Xueliang Zhang
SRIB-LEAP Submission to Far-Field Multi-Channel Speech Enhancement Challenge for Video Conferencing
R.G. Prithvi Raj, Rohit Kumar, M.K. Jayesh, Anurenjan Purushothaman, Sriram Ganapathy, M.A. Basha Shaik
Real-Time Multi-Channel Speech Enhancement Based on Neural Network Masking with Attention Model
Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng
BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List
Dominique Fohr, Irina Illina
Text Augmentation for Language Models in High Error Recognition Scenario
Karel Beneš, Lukáš Burget
On Sampling-Based Training Criteria for Neural Language Modeling
Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney
Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network
Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, Hannes Heikinheimo
Using Games to Augment Corpora for Language Recognition and Confusability
Christopher Cieri, James Fiumara, Jonathan Wright
Fair Voice Biometrics: Impact of Demographic Imbalance on Group Fairness in Speaker Recognition
Gianni Fenu, Mirko Marras, Giacomo Medda, Giacomo Meloni
Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification
Leying Zhang, Zhengyang Chen, Yanmin Qian
Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation
Paul-Gauthier Noé, Mohammad Mohammadamini, Driss Matrouf, Titouan Parcollet, Andreas Nautsch, Jean-François Bonastre
Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson’s Disease
Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts, Emily Mower Provost
Automatic Extraction of Speech Rhythm Descriptors for Speech Intelligibility Assessment in the Context of Head and Neck Cancers
Robin Vaysse, Jérôme Farinas, Corine Astésano, Régine André-Obrecht
Speech Disorder Classification Using Extended Factorized Hierarchical Variational Auto-Encoders
Jinzi Qi, Hugo Van hamme
The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation
Vikram C. Mathad, Tristan J. Mahr, Nancy Scherer, Kathy Chapman, Katherine C. Hustad, Julie Liss, Visar Berisha
Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition
Esaú Villatoro-Tello, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlicek, Mathew Magimai-Doss
Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces
Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó
Cross-Modal Learning for Audio-Visual Video Parsing
Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan
A Psychology-Driven Computational Analysis of Political Interviews
Darren Cook, Miri Zilka, Simon Maskell, Laurence Alison
Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure
Jennifer Santoso, Takeshi Yamada, Shoji Makino, Kenkichi Ishizuka, Takekatsu Hiramura
Effects of Voice Type and Task on L2 Learners’ Awareness of Pronunciation Errors
Alif Silpachai, Ivana Rehman, Taylor Anne Barriuso, John Levis, Evgeny Chukharev-Hudilainen, Guanlong Zhao, Ricardo Gutierrez-Osuna
Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues
Alla Menshikova, Daniil Kocharov, Tatiana Kachkovskaia
Detecting Alzheimer’s Disease Using Interactional and Acoustic Features from Spontaneous Speech
Shamila Nasreen, Julian Hough, Matthew Purver
Investigating the Interplay Between Affective, Phonatory and Motoric Subsystems in Autism Spectrum Disorder Using a Multimodal Dialogue Agent
Hardik Kothare, Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, Jackson Liscombe, William Burke, Andrew Cornish, Doug Habberstad, Alaa Sakallah, Sara Markuson, Seemran Kansara, Afik Faerman, Yasmine Bensidi-Slimane, Laura Fry, Saige Portera, David Suendermann-Oeft, David Pautler, Carly Demopoulos
Analysis of Eye Gaze Reasons and Gaze Aversions During Three-Party Conversations
Carlos Toshinori Ishi, Taiken Shintani
Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding
Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems
Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li
Incorporating External POS Tagger for Punctuation Restoration
Ning Shi, Wei Wang, Boxin Wang, Jinfeng Li, Xiangyu Liu, Zhouhan Lin
Phonetically Induced Subwords for End-to-End Speech Recognition
Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo
Revisiting Parity of Human vs. Machine Conversational Speech Transcription
Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright, Mari Ostendorf
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition
W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
Jesús Andrés-Ferrer, Dario Albesano, Puming Zhan, Paul Vozila
Token-Level Supervised Contrastive Learning for Punctuation Restoration
Qiushi Huang, Tom Ko, H. Lilian Tang, Xubo Liu, Bo Wu
BART Based Semantic Correction for Mandarin Automatic Speech Recognition System
Yun Zhao, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, Yuanfu Zhou
Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR
Lingfeng Dai, Qi Liu, Kai Yu
Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio
Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, Zoltán Tüske
A Discriminative Entity-Aware Language Model for Virtual Assistants
Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel
Correcting Automated and Manual Speech Transcription Errors Using Warped Language Models
Mahdi Namazifar, John Malik, Li Erran Li, Gokhan Tur, Dilek Hakkani Tür
Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency
Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
Domain-Aware Self-Attention for Multi-Domain Neural Machine Translation
Shiqi Zhang, Yan Liu, Deyi Xiong, Pei Zhang, Boxing Chen
Librispeech Transducer Model with Internal Language Model Prior Correction
Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, Hermann Ney
A Deliberation-Based Joint Acoustic and Text Decoder
Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu
On the Limit of English Conversational Speech Recognition
Zoltán Tüske, George Saon, Brian Kingsbury
Deformable TDNN with Adaptive Receptive Fields for Speech Recognition
Keyu An, Yi Zhang, Zhijian Ou
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
Zhao You, Shulin Feng, Dan Su, Dong Yu
Online Compressive Transformer for End-to-End Speech Recognition
Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien
End to End Transformer-Based Contextual Speech Recognition Based on Pointer Network
Binghuai Lin, Liyuan Wang
A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition
Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones
Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux
Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation
Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh
Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios
Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer
Difference in Perceived Speech Signal Quality Assessment Among Monolingual and Bilingual Teenage Students
Przemyslaw Falkowski-Gilski
PILOT: Introducing Transformers for Probabilistic Sound Event Localization
Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa
Sound Source Localization with Majorization Minimization
Masahito Togami, Robin Scheibler
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets
Gabriel Mittag, Babak Naderi, Assmaa Chehadi, Sebastian Möller
Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing
Babak Naderi, Ross Cutler
Reliable Intensity Vector Selection for Multi-Source Direction-of-Arrival Estimation Using a Single Acoustic Vector Sensor
Jianhua Geng, Sifan Wang, Juan Li, JingWei Li, Xin Lou
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment
Meng Yu, Chunlei Zhang, Yong Xu, Shi-Xiong Zhang, Dong Yu
CNN-Based Processing of Acoustic and Radio Frequency Signals for Speaker Localization from MAVs
Andrea Toma, Daniele Salvati, Carlo Drioli, Gian Luca Foresti
Assessment of von Mises-Bernoulli Deep Neural Network in Sound Source Localization
Katsutoshi Itoyama, Yoshiya Morimoto, Shungo Masaki, Ryosuke Kojima, Kenji Nishida, Kazuhiro Nakadai
Feature Fusion by Attention Networks for Robust DOA Estimation
Rongliang Liu, Nengheng Zheng, Xi Chen
Far-Field Speaker Localization and Adaptive GLMB Tracking
Shoufeng Lin, Zhaojie Luo
On the Design of Deep Priors for Unsupervised Audio Restoration
Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Andreas Spanias
Cramér-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments
Weiguang Chen, Cheng Xue, Xionghu Zhong
GAN Vocoder: Multi-Resolution Discriminator Is All You Need
Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, Gyeongsu Chae
Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis
Jian Cong, Shan Yang, Lei Xie, Dan Su
Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN
Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda
Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator
Kazuki Mizuta, Tomoki Koriyama, Hiroshi Saruwatari
Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis
Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, Seong-Whan Lee
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Young-Ik Kim, Hoon-Young Cho
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim
Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis
Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh
High-Fidelity and Low-Latency Universal Neural Vocoder Based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling
Patrick Lumban Tobing, Tomoki Toda
Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition
Zhengxi Liu, Yanmin Qian
High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model
Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction
Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang
Subtitle Translation as Markup Translation
Colin Cherry, Naveen Arivazhagan, Dirk Padfield, Maxim Krikun
Large-Scale Self- and Semi-Supervised Learning for Speech Translation
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau
CoVoST 2 and Massively Multilingual Speech Translation
Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino
AlloST: Low-Resource Speech Translation Without Source Transcription
Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang
Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer
Johanes Effendi, Sakriani Sakti, Satoshi Nakamura
Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura
End-to-End Speech Translation via Cross-Modal Progressive Training
Rong Ye, Mingxuan Wang, Lei Li
ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation
Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura
Towards Simultaneous Machine Interpretation
Alejandro Pérez-González-de-Martos, Javier Iranzo-Sánchez, Adrià Giménez Pastor, Javier Jorge, Joan-Albert Silvestre-Cerdà, Jorge Civera, Albert Sanchis, Alfons Juan
Lexical Modeling of ASR Errors for Robust Speech Translation
Giuseppe Martucci, Mauro Cettolo, Matteo Negri, Marco Turchi
Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation
Piyush Vyas, Anastasia Kuznetsova, Donald S. Williamson
Effects of Feature Scaling and Fusion on Sign Language Translation
Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu
The ID R&D System Description for Short-Duration Speaker Verification Challenge 2021
Alexander Alenin, Anton Okhotnikov, Rostislav Makarov, Nikita Torgashov, Ilya Shigabeev, Konstantin Simonchik
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification
Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck
SdSVC Challenge 2021: Tips and Tricks to Boost the Short-Duration Speaker Verification System Performance
Aleksei Gusev, Alisa Vinogradova, Sergey Novoselov, Sergei Astapov
Team02 Text-Independent Speaker Verification System for SdSV Challenge 2021
Woo Hyun Kang, Nam Soo Kim
Our Learned Lessons from Cross-Lingual Speaker Verification: The CRMI-DKU System Description for the Short-Duration Speaker Verification Challenge 2021
Xiaoyi Qin, Chao Wang, Yong Ma, Min Liu, Shilei Zhang, Ming Li
Investigation of IMU&Elevoc Submission for the Short-Duration Speaker Verification Challenge 2021
Peng Zhang, Peng Hu, Xueliang Zhang
The Sogou System for Short-Duration Speaker Verification Challenge 2021
Jie Yan, Shengyu Yao, Yiqian Pan, Wei Chen
The SJTU System for Short-Duration Speaker Verification Challenge 2021
Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian
Multi-Speaker Emotional Text-to-Speech Synthesizer
Sungjae Cho, Soo-Young Lee
Live TV Subtitling Through Respeaking
Aleš Pražák, Zdeněk Loose, Josef V. Psutka, Vlasta Radová, Josef Psutka, Jan Švec
Autonomous Robot for Measuring Room Impulse Responses
Stefan Fragner, Tobias Topar, Maximilian Giller, Lukas Pfeifenberger, Franz Pernkopf
Expressive Robot Performance Based on Facial Motion Capture
Jonas Beskow, Charlie Caper, Johan Ehrenfors, Nils Hagberg, Anne Jansen, Chris Wood
ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction
Mónica Domínguez, Juan Soler-Company, Leo Wanner
Addressing Compliance in Call Centers with Entity Extraction
Sai Guruju, Jithendra Vepa
Audio Segmentation Based Conversational Silence Detection for Contact Center Calls
Krishnachaitanya Gogineni, Tarun Reddy Yadama, Jithendra Vepa
Reformulating DOVER-Lap Label Mapping as a Graph Partitioning Problem
Desh Raj, Sanjeev Khudanpur
Graph Attention Networks for Anti-Spoofing
Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas Evans
Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems
Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida
Effective Phase Encoding for End-To-End Speaker Verification
Junyi Peng, Xiaoyang Qu, Rongzhi Gu, Jianzong Wang, Jing Xiao, Lukáš Burget, Jan Černocký
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
Ha Nguyen, Yannick Estève, Laurent Besacier
Lost in Interpreting: Speech Translation from Source or Interpreter?
Dominik Macháček, Matúš Žilinec, Ondřej Bojar
Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion
Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, Charles Bouveyron, Frederic Precioso
It’s Not What You Said, it’s How You Said it: Discriminative Perception of Speech as a Multichannel Communication System
Sarenne Wallbridge, Peter Bell, Catherine Lai
Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations
Thilo Michael, Gabriel Mittag, Andreas Bütow, Sebastian Möller
ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification
Christian Bergler, Manuel Schmitt, Andreas Maier, Helena Symonds, Paul Spong, Steven R. Ness, George Tzanetakis, Elmar Nöth
Audiovisual Transfer Learning for Audio Tagging and Sound Event Detection
Wim Boes, Hugo Van hamme
Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-Specific Scaling
Natalia Nessler, Milos Cernak, Paolo Prandoni, Pablo Mainar
Audio Retrieval with Natural Language Queries
Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie
Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio
Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett
Efficient Weight Factorization for Multilingual Speech Recognition
Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stüker, Alex Waibel
Unsupervised Cross-Lingual Representation Learning for Speech Recognition
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli
Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition
Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki
Using Large Self-Supervised Models for Low-Resource Speech Recognition
Krishna D. N, Pinyi Wang, Bruno Bozza
Dual Script E2E Framework for Multilingual and Code-Switching ASR
Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala V.S.V. Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema A. Murthy
MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan
Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition
Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven Hoi
SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Hardik Sailor, Kiran Praveen T, Vikas Agrawal, Abhinav Jain, Abhishek Pandey
Hierarchical Phone Recognition with Compositional Phonetics
Xinjian Li, Juncheng Li, Florian Metze, Alan W. Black
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali
Differentiable Allophone Graphs for Language-Universal Speech Recognition
Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe
Automatic Speech Recognition Systems Errors for Objective Sleepiness Detection Through Voice
Vincent P. Martin, Jean-Luc Rouas, Florian Boyer, Pierre Philip
Robust Laughter Detection in Noisy Environments
Jon Gillick, Wesley Deng, Kimiko Ryokai, David Bamman
Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech
Mizuki Nagano, Yusuke Ijima, Sadao Hiroya
Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children
Huda Alsofyani, Alessandro Vinciarelli
Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units
Nujud Aloshban, Anna Esposito, Alessandro Vinciarelli
Emotion Carrier Recognition from Personal Narratives
Aniruddha Tammewar, Alessandra Cervone, Giuseppe Riccardi
Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training
Scott Condron, Georgia Clarke, Anita Klementiev, Daniela Morse-Kopp, Jack Parry, Dimitri Palaz
TDCA-Net: Time-Domain Channel Attention Network for Depression Detection
Cong Cai, Mingyue Niu, Bin Liu, Jianhua Tao, Xuefei Liu
Visual Speech for Obstructive Sleep Apnea Detection
Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso
Analysis of Contextual Voice Changes in Remote Meetings
Hector A. Cordourier Maruri, Sinem Aslan, Georg Stemmer, Nese Alyuz, Lama Nachman
Speech Based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model
Nadee Seneviratne, Carol Espy-Wilson
Multi-Domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models
Ho-Gyeong Kim, Min-Joong Lee, Hoshik Lee, Tae Gyoon Kang, Jihyun Lee, Eunho Yang, Sung Ju Hwang
Learning a Neural Diff for Speech Models
Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow
Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models
Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals
Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training
Jiabin Xue, Tieran Zheng, Jiqing Han
Towards Lifelong Learning of End-to-End ASR
Heng-Jui Chang, Hung-yi Lee, Lin-shan Lee
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence
Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu
Regularizing Word Segmentation by Creating Misspellings
Hainan Xu, Kartik Audhkhasi, Yinghui Huang, Jesse Emond, Bhuvana Ramabhadran
Multitask Training with Text Data for End-to-End Speech Recognition
Peidong Wang, Tara N. Sainath, Ron J. Weiss
Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition
Xianzhao Chen, Hao Ni, Yi He, Kang Wang, Zejun Ma, Zongxia Xie
Scaling Laws for Acoustic Models
Jasha Droppo, Oguz Elibol
Leveraging Non-Target Language Resources to Improve ASR Performance in a Target Language
Jayadev Billa
4-Bit Quantization of LSTM-Based Speech Recognition Models
Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan
Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation
Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition
Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong
Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning
Dongcheng Jiang, Chao Zhang, Philip C. Woodland
How f0 and Phrase Position Affect Papuan Malay Word Identification
Constantijn Kaland, Matthew Gordon
On the Feasibility of the Danish Model of Intonational Transcription: Phonetic Evidence from Jutlandic Danish
Anna Bothe Jespersen, Pavel Šturm, Míša Hejná
An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus
Adrien Méli, Nicolas Ballier, Achille Falaise, Alice Henderson
ProsoBeast Prosody Annotation Tool
Branislav Gerazov, Michael Wagner
Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts
Trang Tran, Mari Ostendorf
Targeted and Targetless Neutral Tones in Taiwanese Southern Min
Roger Cheng-yen Liu, Feng-fan Hsieh, Yueh-chin Chang
The Interaction of Word Complexity and Word Duration in an Agglutinative Language
Mária Gósy, Kálmán Abari
Taiwan Min Nan (Taiwanese) Checked Tones Sound Change
Ho-hsien Pan, Shao-ren Lyu
In-Group Advantage in the Perception of Emotions: Evidence from Three Varieties of German
Moritz Jakob, Bettina Braun, Katharina Zahner-Ritter
The LF Model in the Frequency Domain for Glottal Airflow Modelling Without Aliasing Distortion
Christer Gobl
Parsing Speech for Grouping and Prominence, and the Typology of Rhythm
Michael Wagner, Alvaro Iturralde Zurita, Sijia Zhang
Prosody of Case Markers in Urdu
Benazir Mumtaz, Massimiliano Canzi, Miriam Butt
Articulatory Characteristics of Icelandic Voiced Fricative Lenition: Gradience, Categoricity, and Speaker/Gesture-Specific Effects
Brynhildur Stefansdottir, Francesco Burroni, Sam Tilsen
Leveraging the Uniformity Framework to Examine Crosslinguistic Similarity for Long-Lag Stops in Spontaneous Cantonese-English Bilingual Speech
Khia A. Johnson
Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification
Aswin Sivaraman, Sunwoo Kim, Minje Kim
Speech Denoising with Auditory Models
Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott
Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement
Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka
Multi-Stage Progressive Speech Enhancement Network
Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen
Single-Channel Speech Enhancement Using Learnable Loss Mixup
Oscar Chang, Dung N. Tran, Kazuhito Koishida
A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement
Xiao-Qi Zhang, Jun Du, Li Chai, Chin-Hui Lee
Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
Vikas Agrawal, Shashi Kumar, Shakti P. Rath
DEMUCS-Mobile : On-Device Lightweight Speech Enhancement
Lukas Lee, Youna Ji, Minjae Lee, Min-Seok Choi
Speech Denoising Without Clean Training Data: A Noise2Noise Approach
Madhav Mahesh Kashyap, Anuj Tambwekar, Krishnamoorthy Manohara, S. Natarajan
Improved Speech Enhancement Using a Complex-Domain GAN with Fused Time-Domain and Time-Frequency Domain Constraints
Feng Dang, Pengyuan Zhang, Hangting Chen
Speech Enhancement with Topology-Enhanced Generative Adversarial Networks (GANs)
Xudong Zhang, Liang Zhao, Feng Gu
Learning Speech Structure to Improve Time-Frequency Masks
Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han
SE-Conformer: Time-Domain Speech Enhancement Using Conformer
Eesung Kim, Hyeji Seo
Spectral and Latent Speech Representation Distortion for TTS Evaluation
Thananchai Kongthaworn, Burin Naowarat, Ekapol Chuangsuwanich
Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech
Cassia Valentini-Botinhao, Simon King
RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
Rohola Zandie, Mohammad H. Mahoor, Julia Madsen, Eshrat S. Emamian
AISHELL-3: A Multi-Speaker Mandarin TTS Corpus
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis
Nicholas Eng, C.T. Justine Hui, Yusuke Hioka, Catherine I. Watson
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao
Perception of Social Speaker Characteristics in Synthetic Speech
Sai Sirisha Rallabandi, Abhinav Bharadwaj, Babak Naderi, Sebastian Möller
Hi-Fi Multi-Speaker English TTS Dataset
Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang
Utilizing Self-Supervised Representations for MOS Prediction
Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat Khassanov, Huseyin Atakan Varol
Confidence Intervals for ASR-Based TTS Evaluation
Jason Taylor, Korin Richmond
INTERSPEECH 2021 Deep Noise Suppression Challenge
Chandan K.A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan
A Simultaneous Denoising and Dereverberation Framework with Target Decoupling
Andong Li, Wenzhe Liu, Xiaoxue Luo, Guochen Yu, Chengshi Zheng, Xiaodong Li
Deep Noise Suppression with Non-Intrusive PESQNet Supervision Enabling the Use of Real Training Data
Ziyi Xu, Maximilian Strake, Tim Fingscheidt
DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement
Xiaohuai Le, Hongsheng Chen, Kai Chen, Jing Lu
DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement
Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie
DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement
Kanghao Zhang, Shulin He, Hao Li, Xueliang Zhang
Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss
Xu Zhang, Xinlei Ren, Xiguang Zheng, Lianwu Chen, Chen Zhang, Liang Guo, Bing Yu
Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement
Koen Oostermeijer, Qing Wang, Jun Du
Self-Paced Ensemble Learning for Speech and Audio Classification
Nicolae-Cătălin Ristea, Radu Tudor Ionescu
Knowledge Distillation for Streaming Transformer–Transducer
Atsushi Kojima
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
Timo Lohrenz, Zhengyang Li, Tim Fingscheidt
Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning
Salah Zaiem, Titouan Parcollet, Slim Essid
Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models
Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney
Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model
Apoorv Vyas, Srikanth Madikeri, Hervé Bourlard
Speaker Attentive Speech Emotion Recognition
Clément Le Moine, Nicolas Obin, Axel Roebel
Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions
Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso
M3: MultiModal Masking Applied to Sentiment Analysis
Efthymios Georgiou, Georgios Paraskevopoulos, Alexandros Potamianos
The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Ondřej Klejch, Electra Wallington, Peter Bell
Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, Hermann Ney
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept
Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney
Modeling Dialectal Variation for Swiss German Automatic Speech Recognition
Abbas Khosravani, Philip N. Garner, Alexandros Lazaridis
Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System
Ekaterina Egorova, Hari Krishna Vydana, Lukáš Burget, Jan Černocký
Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition
Matthew Wiesner, Mousmita Sarma, Ashish Arora, Desh Raj, Dongji Gao, Ruizhe Huang, Supreet Preet, Moris Johnson, Zikra Iqbal, Nagendra Goel, Jan Trmal, Leibny Paola García Perera, Sanjeev Khudanpur
Speech Intelligibility of Dysarthric Speech: Human Scores and Acoustic-Phonetic Features
Wei Xue, Roeland van Hout, Fleur Boogmans, Mario Ganzeboom, Catia Cucchiarini, Helmer Strik
Analyzing Short Term Dynamic Speech Features for Understanding Behavioral Traits of Children with Autism Spectrum Disorder
Young-Kyung Kim, Rimita Lahiri, Md. Nasir, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth S. Narayanan
Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms
Waldemar Jęśko
Phonetic Complexity, Speech Accuracy and Intelligibility Assessment of Italian Dysarthric Speech
Barbara Gili Fivela, Vincenzo Sallustio, Silvia Pede, Danilo Patrocinio
Detection of Consonant Errors in Disordered Speech Based on Consonant-Vowel Segment Embedding
Si-Ioi Ng, Cymie Wing-Yee Ng, Jingyu Li, Tan Lee
Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions
Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard, Ricardo Gutierrez-Osuna
Identifying Cognitive Impairment Using Sentence Representation Vectors
Bahman Mirheidari, Yilin Pan, Daniel Blackburn, Ronan O’Malley, Heidi Christensen
Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children
Zhengjun Yue, Jon Barker, Heidi Christensen, Cristina McKean, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright
Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data
Tong Xia, Jing Han, Lorena Qendro, Ting Dang, Cecilia Mascolo
Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng
Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson’s Disease and Healthy Subjects
Tanuka Bhattacharjee, Jhansi Mallela, Yamini Belur, Nalini Atchayaram, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh
CLAC: A Speech Corpus of Healthy English Speakers
R’mani Haulcy, James Glass
Direct Multimodal Few-Shot Learning of Speech and Images
Leanne Nortje, Herman Kamper
Talk, Don’t Write: A Study of Direct Speech-Based Image Retrieval
Ramon Sanabria, Austin Waters, Jason Baldridge
A Fast Discrete Two-Step Learning Hashing for Scalable Cross-Modal Retrieval
Huan Zhao, Kaili Ma
Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu
Attention-Based Keyword Localisation in Speech Using Visual Grounding
Kayode Olaleye, Herman Kamper
Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
Khazar Khorrami, Okko Räsänen
Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee
Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision
Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic
End-to-End Audio-Visual Speech Recognition for Overlapping Speech
Richard Rose, Olivier Siohan, Anshuman Tripathi, Otavio Braga
Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party
Yifei Wu, Chenda Li, Song Yang, Zhongqin Wu, Yanmin Qian
Ultra Fast Speech Separation Model with Teacher Student Learning
Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu
Group Delay Based Re-Weighted Sparse Recovery Algorithms for Robust and High-Resolution Source Separation in DOA Framework
Murtiza Ali, Ashwani Koul, Karan Nathwani
Continuous Speech Separation Using Speaker Inventory for Long Recording
Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen
Crossfire Conditional Generative Adversarial Networks for Singing Voice Extraction
Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki, Wenwu Wang
End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
Kai Wang, Hao Huang, Ying Hu, Zhihua Huang, Sheng Li
Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation
Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi
Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training
Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-yi Lee
Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation
Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang
Investigation of Practical Aspects of Single Channel Speech Separation for ASR
Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li
Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation
Yi Luo, Nima Mesgarani
Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation
Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu
End-to-End Neural Diarization: From Transformer to Conformer
Yi Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke
Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network
Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-Jin Lee
Online Speaker Diarization Equipped with Discriminative Modeling and Guided Inference
Xucheng Wan, Kai Liu, Huan Zhou
Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization
Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Leibny Paola García Perera, Kenji Nagamatsu
Adapting Speaker Embeddings for Speaker Diarisation
Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung
Scenario-Dependent Speaker Diarization for DIHARD-III Challenge
Yu-Xuan Wang, Jun Du, Maokui He, Shu-Tong Niu, Lei Sun, Chin-Hui Lee
End-To-End Speaker Segmentation for Overlap-Aware Resegmentation
Hervé Bredin, Antoine Laurent
Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Leibny Paola García Perera, Kenji Nagamatsu
A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection
Or Haim Anidjar, Itshak Lapidot, Chen Hajaj, Amit Dvir
Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis
Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana
Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows
Iván Vallés-Pérez, Julian Roth, Grzegorz Beringer, Roberto Barra-Chicote, Jasha Droppo
Rich Prosody Diversity Modelling with Phone-Level Mixture Density Network
Chenpeng Du, Kai Yu
Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
Kenichi Fujita, Atsushi Ando, Yusuke Ijima
Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation
Yuxiang Zou, Shichao Liu, Xiang Yin, Haopeng Lin, Chunfeng Wang, Haoyu Zhang, Zejun Ma
Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing
Mayank Sharma, Yogesh Virkar, Marcello Federico, Roberto Barra-Chicote, Robert Enyedi
Applying the Information Bottleneck Principle to Prosodic Representation Learning
Guangyan Zhang, Ying Qin, Daxin Tan, Tan Lee
A Prototypical Network Approach for Evaluating Generated Emotional Speech
Alice Baird, Silvan Mertes, Manuel Milling, Lukas Stappen, Thomas Wiest, Elisabeth André, Björn W. Schuller
A Simplified Model for the Vocal Tract of [s] with Inclined Incisors
Tsukasa Yoshinaga, Kohei Tada, Kazunori Nozaki, Akiyoshi Iida
Vocal-Tract Models to Visualize the Airstream of Human Breath and Droplets While Producing Speech
Takayuki Arai
Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data
Ryo Tanji, Hidefumi Ohmura, Kouichi Katsurada
Comparison Between Lumped-Mass Modeling and Flow Simulation of the Reed-Type Artificial Vocal Fold
Rafia Inaam, Tsukasa Yoshinaga, Takayuki Arai, Hiroshi Yokoyama, Akiyoshi Iida
Inhalations in Speech: Acoustic and Physiological Characteristics
Raphael Werner, Susanne Fuchs, Jürgen Trouvain, Bernd Möbius
Model-Based Exploration of Linking Between Vowel Articulatory Space and Acoustic Space
Anqi Xu, Daniel van Niekerk, Branislav Gerazov, Paul Konstantin Krug, Santitham Prom-on, Peter Birkholz, Yi Xu
Take a Breath: Respiratory Sounds Improve Recollection in Synthetic Speech
Mikey Elmers, Raphael Werner, Beeke Muhlack, Bernd Möbius, Jürgen Trouvain
Modeling Sensorimotor Adaptation in Speech Through Alterations to Forward and Inverse Models
Taijing Chen, Adam Lammert, Benjamin Parrell
Mixture of Orthogonal Sequences Made from Extended Time-Stretched Pulses Enables Measurement of Involuntary Voice Fundamental Frequency Response to Pitch Perturbation
Hideki Kawahara, Toshie Matsui, Kohei Yatabe, Ken-Ichi Sakakibara, Minoru Tsuzaki, Masanori Morise, Toshio Irino
Contextualized Attention-Based Knowledge Transfer for Spoken Conversational Question Answering
Chenyu You, Nuo Chen, Yuexian Zou
Injecting Descriptive Meta-Information into Pre-Trained Language Models with Hypernetworks
Wenying Duan, Xiaoxi He, Zimu Zhou, Hong Rao, Lothar Thiele
Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy
Mahdin Rohmatillah, Jen-Tzung Chien
Timing Generating Networks: Neural Network Based Precise Turn-Taking Timing Prediction in Multiparty Conversation
Shinya Fujie, Hayato Katayama, Jin Sakuma, Tetsunori Kobayashi
Human-to-Human Conversation Dataset for Learning Fine-Grained Turn-Taking Action
Kehan Chen, Zezhong Li, Suyang Dai, Wei Zhou, Haiqing Chen
PhonemeBERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript
Mukuntha Narayanan Sundararaman, Ayush Kumar, Jithendra Vepa
Joint Retrieval-Extraction Training for Evidence-Aware Dialog Response Selection
Hongyin Luo, James Glass, Garima Lalwani, Yi Zhang, Shang-Wen Li
Adapting Long Context NLM for ASR Rescoring in Conversational Agents
Ashish Shenoy, Sravan Bodapati, Monica Sunkara, Srikanth Ronanki, Katrin Kirchhoff
Oriental Language Recognition (OLR) 2020: Summary and Analysis
Jing Li, Binling Wang, Yiming Zhi, Zheng Li, Lin Li, Qingyang Hong, Dong Wang
Language Recognition on Unknown Conditions: The LORIA-Inria-MULTISPEECH System for AP20-OLR Challenge
Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina
Dynamic Multi-Scale Convolution for Dialect Identification
Tianlong Kong, Shouyi Yin, Dawei Zhang, Wang Geng, Xin Wang, Dandan Song, Jinwen Huang, Huiyu Shi, Xiaorui Wang
An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model
Ding Wang, Shuaishuai Ye, Xinhui Hu, Sheng Li, Xinkang Xu
Language Recognition Based on Unsupervised Pretrained Models
Haibin Yu, Jing Zhao, Song Yang, Zhongqin Wu, Yuting Nie, Wei-Qiang Zhang
Additive Phoneme-Aware Margin Softmax Loss for Language Recognition
Zheng Li, Yan Liu, Lin Li, Qingyang Hong
Towards an Accent-Robust Approach for ATC Communications Transcription
Nataly Jahchan, Florentin Barbier, Ariyanidevi Dharma Gita, Khaled Khelif, Estelle Delpech
Detecting English Speech in the Air Traffic Control Voice Communication
Igor Szöke, Santosh Kesiraju, Ondřej Novotný, Martin Kocour, Karel Veselý, Jan Černocký
Robust Command Recognition for Lithuanian Air Traffic Control Tower Utterances
Oliver Ohneiser, Seyyed Saeed Sarfjoo, Hartmut Helmke, Shruthi Shetty, Petr Motlicek, Matthias Kleinert, Heiko Ehr, Šarūnas Murauskas
Contextual Semi-Supervised Learning: An Approach to Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems
Juan Zuluaga-Gomez, Iuliia Nigmatulina, Amrutha Prasad, Petr Motlicek, Karel Veselý, Martin Kocour, Igor Szöke
Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition
Martin Kocour, Karel Veselý, Alexander Blatt, Juan Zuluaga Gomez, Igor Szöke, Jan Černocký, Dietrich Klakow, Petr Motlicek
Modeling the Effect of Military Oxygen Masks on Speech Characteristics
Benjamin Elie, Jodie Gauvain, Jean-Luc Gauvain, Lori Lamel
MoM: Minutes of Meeting Bot
Benjamin Milde, Tim Fischer, Steffen Remus, Chris Biemann
Articulatory Data Recorder: A Framework for Real-Time Articulatory Data Recording
Alexander Wilbrandt, Simon Stone, Peter Birkholz
The INGENIOUS Multilingual Operations App
Joan Codina-Filbà, Guillermo Cámbara, Alex Peiró-Lilja, Jens Grivolla, Roberto Carlini, Mireia Farrús
Digital Einstein Experience: Fast Text-to-Speech for Conversational AI
Joanna Rownicka, Kilian Sprenkamp, Antonio Tripiana, Volodymyr Gromoglasov, Timo P. Kunz
Live Subtitling for BigBlueButton with Open-Source Software
Robert Geislinger, Benjamin Milde, Timo Baumann, Chris Biemann
Expressive Latvian Speech Synthesis for Dialog Systems
Dāvis Nicmanis, Askars Salimbajevs
ViSTAFAE: A Visual Speech-Training Aid with Feedback of Articulatory Efforts
Pramod H. Kachare, Prem C. Pandey, Vishal Mane, Hirak Dasgupta, K.S. Nataraj, Akshada Rathod, Sheetal K. Pathak
Towards the Prediction of the Vocal Tract Shape from the Sequence of Phonemes to be Articulated
Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie
Comparison of the Finite Element Method, the Multimodal Method and the Transmission-Line Model for the Computation of Vocal Tract Transfer Functions
Rémi Blandin, Marc Arnela, Simon Félix, Jean-Baptiste Doc, Peter Birkholz
Effects of Time Pressure and Spontaneity on Phonotactic Innovations in German Dialogues
Petra Wagner, Sina Zarrieß, Joana Cholin
Importance of Parasagittal Sensor Information in Tongue Motion Capture Through a Diphonic Analysis
Salvador Medina, Sarah Taylor, Mark Tiede, Alexander Hauptmann, Iain Matthews
Learning Robust Speech Representation with an Articulatory-Regularized Variational Autoencoder
Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber
Changes in Glottal Source Parameter Values with Light to Moderate Physical Load
Heather Weston, Laura L. Koenig, Susanne Fuchs
End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding
Mohammad Hassan Vali, Tom Bäckström
Fusion-Net: Time-Frequency Information Fusion Y-Network for Speech Enhancement
Santhan Kumar Reddy Nareddula, Subrahmanyam Gorthi, Rama Krishna Sai S. Gorthi
N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification
Ľuboš Marcinek, Michael Stone, Rebecca Millman, Patrick Gaydecki
Temporal Context in Speech Emotion Recognition
Yangyang Xia, Li-Wei Chen, Alexander Rudnicky, Richard M. Stern
Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition
Hang Li, Wenbiao Ding, Zhongqin Wu, Zitao Liu
Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit
Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Okko Räsänen
Multimodal Sentiment Analysis with Temporal Modality Attention
Fan Qian, Jiqing Han
Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition
Mani Kumar T, Enrique Sanchez, Georgios Tzimiropoulos, Timo Giesbrecht, Michel Valstar
Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in Audio-Visual Emotion Recognition
Haoqi Li, Yelin Kim, Cheng-Hao Kuo, Shrikanth S. Narayanan
Emotion Recognition from Speech Using wav2vec 2.0 Embeddings
Leonardo Pepino, Pablo Riera, Luciana Ferrer
Graph Isomorphism Network for Speech Emotion Recognition
Jiawang Liu, Haoxiang Wang
Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition
Pooja Kumawat, Aurobinda Routray
Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech
Aaron Keesing, Yun Sing Koh, Michael Witbrock
Leveraging Pre-Trained Language Model for Speech Sentiment Analysis
Suwon Shon, Pablo Brusco, Jing Pan, Kyu J. Han, Shinji Watanabe
Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching
Wenxin Hou, Jindong Wang, Xu Tan, Tao Qin, Takahiro Shinozaki
Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer
Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong
Reducing Streaming ASR Model Delay with Self Alignment
Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
Anuj Diwan, Preethi Jyothi
Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer
Takashi Fukuda, Samuel Thomas
Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo
Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models
Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao
Earnings-21: A Practical Benchmark for ASR in the Wild
Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Żelasko, Miguel Jetté
Improving Multilingual Transformer Transducer Models by Reducing Language Confusions
Eric Sun, Jinyu Li, Zhong Meng, Yu Wu, Jian Xue, Shujie Liu, Yifan Gong
Arabic Code-Switching Speech Recognition Using Monolingual Data
Ahmed Ali, Shammur Absar Chowdhury, Amir Hussein, Yasser Hifny
Online Blind Audio Source Separation Using Recursive Expectation-Maximization
Aviad Eisenberg, Boaz Schwartz, Sharon Gannot
Empirical Analysis of Generalized Iterative Speech Separation Networks
Yi Luo, Cong Han, Nima Mesgarani
Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers
Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation
Jisi Zhang, Cătălin Zorilă, Rama Doddipatla, Jon Barker
Few-Shot Learning of New Sound Classes for Target Sound Extraction
Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki
Binaural Speech Separation of Moving Speakers With Preserved Spatial Cues
Cong Han, Yi Luo, Nima Mesgarani
AvaTr: One-Shot Speaker Extraction with Transformers
Shell Xu Hu, Md. Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq Pitkow, Andreas Savas Tolias
Vocal Harmony Separation Using Time-Domain Neural Networks
Saurjya Sarkar, Emmanouil Benetos, Mark Sandler
Speaker Verification-Based Evaluation of Single-Channel Speech Separation
Matthew Maciejewski, Shinji Watanabe, Sanjeev Khudanpur
Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection
Tian Lan, Yuxin Qian, Yilan Lyu, Refuoe Mokhosi, Wenxin Tai, Qiao Liu
Robust Speaker Extraction Network Based on Iterative Refined Adaptation
Chengyun Deng, Shiqian Ma, Yongtao Sha, Yi Zhang, Hui Zhang, Hui Song, Fei Wang
Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
Wupeng Wang, Chenglin Xu, Meng Ge, Haizhou Li
Deep Audio-Visual Speech Separation Based on Facial Motion
Rémi Rigal, Jacques Chodorowski, Benoît Zerr
LEAP Submission for the Third DIHARD Diarization Challenge
Prachi Singh, Rajat Varma, Venkat Krishnamohan, Srikanth Raj Chetupalli, Sriram Ganapathy
Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings
Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng, Zhijie Yan
Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker
Maokui He, Desh Raj, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe
ECAPA-TDNN Embeddings for Speaker Diarization
Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, Hwidong Na
Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech
Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara
The Third DIHARD Diarization Challenge
Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman
Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty
Tsun-Yat Leung, Lahiru Samarakoon
Anonymous Speaker Clusters: Making Distinctions Between Anonymised Speech Recordings with Clustering Interface
Benjamin O’Brien, Natalia Tomashenko, Anaïs Chanclu, Jean-François Bonastre
Speaker Diarization Using Two-Pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings
Kiran Karra, Alan McCree
Federated Learning with Dynamic Transformer for Text to Speech
Zhenhou Hong, Jianzong Wang, Xiaoyang Qu, Jie Liu, Chendong Zhao, Jing Xiao
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks
Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech
Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim
Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech
Jae-Sung Bae, Taejun Bak, Young-Sun Joo, Hoon-Young Cho
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
A Learned Conditional Prior for the VAE Acoustic Space of a TTS System
Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo-Trueba, Thomas Drugman
A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization
Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis, Yannis Stylianou
Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder
Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda
Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee
Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet
Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu
SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Jr., Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass
The Multilingual TEDx Corpus for Speech Recognition and Translation
Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post
Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments
David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan
Look Who’s Talking: Active Speaker Detection in the Wild
You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung
AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children’s Speech
Beena Ahmed, Kirrie J. Ballard, Denis Burnham, Tharmakulasingam Sirojan, Hadi Mehmood, Dominique Estival, Elise Baker, Felicity Cox, Joanne Arciuli, Titia Benders, Katherine Demuth, Barbara Kelly, Chloé Diskin-Holdaway, Mostafa Shahin, Vidhyasaharan Sethu, Julien Epps, Chwee Beng Lee, Eliathamby Ambikairajah
Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
Per Fallgren, Jens Edlund
Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition
Elena Ryumina, Oxana Verkholyak, Alexey Karpov
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization
Gonçal V. Garcés Díaz-Munío, Joan-Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez-González-de-Martos, Jorge Civera, Albert Sanchis, Alfons Juan
Towards Automatic Speech to Sign Language Generation
Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B. Hegde, Vinay Namboodiri, C.V. Jawahar
kosp2e: Korean Speech to English Translation Corpus
Won Ik Cho, Seok Min Kim, Hyunchang Cho, Nam Soo Kim
speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment
Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, Yukai Huang, Ke Li, Daniel Povey, Yujun Wang
An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition
Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan
Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie
Pushing the Limits of Non-Autoregressive Speech Recognition
Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan
Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies
Alexander H. Liu, Yu-An Chung, James Glass
Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions
Jumon Nozaki, Tatsuya Komatsu
Toward Streaming ASR with Non-Autoregressive Insertion-Based Model
Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi
Layer Pruning on Demand with Intermediate CTC
Jaesong Lee, Jingu Kang, Shinji Watanabe
Real-Time End-to-End Monaural Multi-Speaker Speech Recognition
Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li, Qingyang Hong
Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models
Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe
TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis
Stanislav Beliaev, Boris Ginsburg
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan
Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
Nanxin Chen, Piotr Żelasko, Laureano Moro-Velázquez, Jesús Villalba, Najim Dehak
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis
Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng
Detecting Cognitive Decline Using Speech Only: The ADReSSo Challenge
Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney
Influence of the Interviewer on the Automatic Assessment of Alzheimer’s Disease in the Context of the ADReSSo Challenge
P.A. Pérez-Toro, S.P. Bayerl, T. Arias-Vergara, J.C. Vásquez-Correa, P. Klumpp, M. Schuster, Elmar Nöth, J.R. Orozco-Arroyave, K. Riedhammer
WavBERT: Exploiting Semantic and Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection
Youxiang Zhu, Abdelrahman Obyat, Xiaohui Liang, John A. Batsis, Robert M. Roth
Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models
Lara Gauder, Leonardo Pepino, Luciana Ferrer, Pablo Riera
Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection
Aparna Balagopalan, Jekaterina Novikova
Alzheimer’s Disease Detection from Spontaneous Speech Through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models
Yu Qiao, Xuefeng Yin, Daniel Wiechmann, Elma Kerz
Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech
Yilin Pan, Bahman Mirheidari, Jennifer M. Harris, Jennifer C. Thompson, Matthew Jones, Julie S. Snowden, Daniel Blackburn, Heidi Christensen
Tackling the ADRESSO Challenge 2021: The MUET-RMIT System for Alzheimer’s Dementia Recognition from Spontaneous Speech
Zafi Sherhan Syed, Muhammad Shehram Shah Syed, Margaret Lech, Elena Pirogova
Alzheimer’s Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs
Morteza Rohanian, Julian Hough, Matthew Purver
Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios
Raghavendra Pappagari, Jaejin Cho, Sonal Joshi, Laureano Moro-Velázquez, Piotr Żelasko, Jesús Villalba, Najim Dehak
Automatic Detection of Alzheimer’s Disease Using Spontaneous Speech Only
Jun Chen, Jieping Ye, Fengyi Tang, Jiayu Zhou
Modular Multi-Modal Attention Network for Alzheimer’s Disease Detection Using Patient Audio and Language Data
Ning Wang, Yupeng Cao, Shuai Hao, Zongru Shao, K.P. Subbalakshmi
Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-Field Speech Recognition
Rong Gong, Carl Quillen, Dushyant Sharma, Andrew Goderre, José Laínez, Ljubomir Milanović
ETLT 2021: Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
R. Gretter, Marco Matassoni, D. Falavigna, A. Misra, C.W. Leong, K. Knill, L. Wang
Age-Invariant Training for End-to-End Child Speech Recognition Using Adversarial Multi-Task Learning
Lars Rumberg, Hanna Ehlert, Ulrike Lüdtke, Jörn Ostermann
Learning to Rank Microphones for Distant Speech Recognition
Samuele Cornell, Alessio Brutti, Marco Matassoni, Stefano Squartini
Simulating Reading Mistakes for Child Speech Transformer-Based Phone Recognition
Lucile Gelin, Thomas Pellegrini, Julien Pinquier, Morgane Daniel
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input
Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier
Exploring Emotional Prototypes in a High Dimensional TTS Latent Space
Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M.C. Harrison, Pauline Larrouy-Maestri, Elisabeth André, Nori Jacoby
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
Devang S. Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G.R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King
ADEPT: A Dataset for Evaluating Prosody Transfer
Alexandra Torresquintero, Tian Huey Teh, Christopher G.R. Wallis, Marlene Staib, Devang S. Ram Mohan, Vivian Hu, Lorenzo Foglianti, Jiameng Gao, Simon King
Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech
Nguyen Thi Thu Trang, Nguyen Hoang Ky, Albert Rilliard, Christophe d'Alessandro
Many-Speakers Single Channel Speech Separation with Optimal Permutation Training
Shaked Dovrat, Eliya Nachmani, Lior Wolf
Combating Reverberation in NTF-Based Speech Separation Using a Sub-Source Weighted Multichannel Wiener Filter and Linear Prediction
Mieszko Fraś, Marcin Witkowski, Konrad Kowalczyk
A Hands-On Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation
Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler
GlobalPhone Mix-To-Separate Out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation
Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz
Cross-Linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared
Kimiko Tsukada, Yurong, Joo-Yeon Kim, Jeong-Im Han, John Hajek
Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention
Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek
Testing Acoustic Voice Quality Classification Across Languages and Speech Styles
Bettina Braun, Nicole Dehé, Marieke Einfeldt, Daniela Wochner, Katharina Zahner-Ritter
Acquisition of Prosodic Focus Marking by Three- to Six-Year-Old Children Learning Mandarin Chinese
Qianyutong Zhang, Kexin Lyu, Zening Chen, Ping Tang
Adaptive Listening Difficulty Detection for L2 Learners Through Moderating ASR Resources
Maryam Sadat Mirzaei, Kourosh Meshgi
F0 Patterns of L2 English Speech by Mandarin Chinese Learners
Hongwei Ding, Binghuai Lin, Liyuan Wang
A Neural Network-Based Noise Compensation Method for Pronunciation Assessment
Binghuai Lin, Liyuan Wang
Phonetic Distance and Surprisal in Multilingual Priming: Evidence from Slavic
Jacek Kudera, Philip Georgis, Bernd Möbius, Tania Avgustinova, Dietrich Klakow
A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives
Yuqing Zhang, Zhu Li, Binghuai Lin, Jinsong Zhang
Transformer Based End-to-End Mispronunciation Detection and Diagnosis
Minglin Wu, Kun Li, Wai-Kim Leung, Helen Meng
L1 Identification from L2 Speech Using Neural Spectrogram Analysis
Calbert Graham
Leveraging Real-Time MRI for Illuminating Linguistic Velum Action
Miran Oh, Dani Byrd, Shrikanth S. Narayanan
Segmental Alignment of English Syllables with Singleton and Cluster Onsets
Zirui Liu, Yi Xu
Exploration of Welsh English Pre-Aspiration: How Wide-Spread is it?
Míša Hejná
Revisiting Recall Effects of Filler Particles in German and English
Beeke Muhlack, Mikey Elmers, Heiner Drenhaus, Jürgen Trouvain, Marjolein van Os, Raphael Werner, Margarita Ryzhova, Bernd Möbius
How Reliable Are Phonetic Data Collected Remotely? Comparison of Recording Devices and Environments on Acoustic Measurements
Chunyu Ge, Yixuan Xiong, Peggy Mok
A Cross-Dialectal Comparison of Apical Vowels in Beijing Mandarin, Northeastern Mandarin and Southwestern Mandarin: An EMA and Ultrasound Study
Jing Huang, Feng-fan Hsieh, Yueh-chin Chang
Dissecting the Aero-Acoustic Parameters of Open Articulatory Transitions
Mark Gibson, Oihane Muxika, Marianne Pouplier
Quantifying Vocal Tract Shape Variation and its Acoustic Impact: A Geometric Morphometric Approach
Amelia J. Gully
Speech Perception and Loanword Adaptations: The Case of Copy-Vowel Epenthesis
Adriana Guevara-Rukoz, Shi Yu, Sharon Peperkamp
Speakers Coarticulate Less When Facing Real and Imagined Communicative Difficulties: An Analysis of Read and Spontaneous Speech from the LUCID Corpus
Zhe-chen Guo, Rajka Smiljanic
Developmental Changes of Vowel Acoustics in Adolescents
Einar Meister, Lya Meister
Context and Co-Text Influence on the Accuracy Production of Italian L2 Non-Native Sounds
Sonia d'Apolito, Barbara Gili Fivela
A New Vowel Normalization for Sociophonetics
Wilbert Heeringa, Hans Van de Velde
The Pacific Expansion: Optimizing Phonetic Transcription of Archival Corpora
Rosey Billington, Hywel Stoakes, Nick Thieberger
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization
Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen
LT-LM: A Novel Non-Autoregressive Language Model for Single-Shot Lattice Rescoring
Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Khokhlov, Aleksandr Laptev, Andrei Andrusenko, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko
A Hybrid Seq-2-Seq ASR Design for On-Device and Server Applications
Cyril Allauzen, Ehsan Variani, Michael Riley, David Rybach, Hao Zhang
VAD-Free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
Hirofumi Inaguma, Tatsuya Kawahara
WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit
Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei
Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition
Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima
Deep Neural Network Calibration for E2E Speech Recognition System
Mun-Hak Lee, Joon-Hyuk Chang
Residual Energy-Based Models for End-to-End Speech Recognition
Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction
David Qiu, Yanzhang He, Qiujia Li, Yu Zhang, Liangliang Cao, Ian McGraw
Insights on Neural Representations for End-to-End Speech Recognition
Anna Ollerenshaw, Md. Asif Jalal, Thomas Hain
Sequence-Level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models
Amber Afshan, Kshitiz Kumar, Jian Wu
Unsupervised Learning of Disentangled Speech Content and Style Representation
Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita
Label Embedding for Chinese Grapheme-to-Phoneme Conversion
Eunbi Choi, Hwa-Yeon Kim, Jong-Hwan Kim, Jae-Min Kim
PDF: Polyphone Disambiguation in Chinese by Using FLAT
Haiteng Zhang
Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-Pooling Strategy and Window-Based Attention
Junjie Li, Zhiyu Zhang, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao
Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning
Yi Shi, Congyi Wang, Yu Chen, Bin Wang
A Neural-Network-Based Approach to Identifying Speakers in Novels
Yue Chen, Zhen-Hua Ling, Qing-Feng Liu
UnitNet-Based Hybrid Speech Synthesis
Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai
Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder
Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura
LinearSpeech: Parallel Text-to-Speech with Linear Complexity
Haozhe Zhang, Zhihua Huang, Zengqiang Shang, Pengyuan Zhang, Yonghong Yan
An Agent for Competing with Humans in a Deceptive Game Based on Vocal Cues
Noa Mansbach, Evgeny Hershkovitch Neiterman, Amos Azaria
A Multi-Branch Deep Learning Network for Automated Detection of COVID-19
Ahmed Fakhry, Xinyi Jiang, Jaclyn Xiao, Gunvant Chaudhari, Asriel Han
RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform
Youxuan Ma, Zongze Ren, Shugong Xu
Fake Audio Detection in Resource-Constrained Settings Using Microfeatures
Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi, Agha Ali Raza
Coughing-Based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks
Tianhao Yan, Hao Meng, Emilia Parada-Cabaleiro, Shuo Liu, Meishu Song, Björn W. Schuller
Knowledge Distillation for Singing Voice Detection
Soumava Paul, Gurunath Reddy M, K. Sreenivasa Rao, Partha Pratim Das
Age Estimation with Speech-Age Model for Heterogeneous Speech Datasets
Ryu Takeda, Kazunori Komatani
Open-Set Audio Classification with Limited Training Resources Based on Augmentation Enhanced Variational Auto-Encoder GAN with Detection-Classification Joint Training
Kah Kuan Teh, Huy Dat Tran
Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification
Takahiro Fukumori
Automatic Detection of Shouted Speech Segments in Indian News Debates
Shikha Baghel, Mrinmoy Bhattacharjee, S.R. Mahadeva Prasanna, Prithwijit Guha
Generalized Spoofing Detection Inspired from Audio Generation Artifacts
Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh
Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion
Weiguang Chen, Van Tung Pham, Eng Siong Chng, Xionghu Zhong
Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study
Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius, Dietrich Klakow
Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding
Zheng Gao, Radhika Arava, Qian Hu, Xibin Gao, Thahir Mohamed, Wei Xiao, Mohamed AbdelHady
Personalized Keyphrase Detection Using Speaker and Environment Information
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng Huang, Arun Narayanan, Ian McGraw
Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir
Few-Shot Keyword Spotting in Any Language
Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, Vijay Janapa Reddi
Text Anchor Based Metric Learning for Small-Footprint Keyword Spotting
Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou
A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples
Yangbin Chen, Tom Ko, Jianping Wang
Auxiliary Sequence Labeling Tasks for Disfluency Detection
Dongyub Lee, Byeongil Ko, Myeong Cheol Shin, Taesun Whang, Daniel Lee, Eunhwa Kim, Eunggyun Kim, Jaechoon Jo
Energy-Friendly Keyword Spotting System Using Add-Based Convolution
Hang Zhou, Wenchao Hu, Yu Ting Yeung, Xiao Chen
The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results
Yan Jia, Xingming Wang, Xiaoyi Qin, Yinping Zhang, Xuyang Wang, Junjie Wang, Dong Zhang, Ming Li
Auto-KWS 2021 Challenge: Task, Datasets, and Baselines
Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie
Keyword Transformer: A Self-Attention Model for Keyword Spotting
Axel Berg, Mark O’Connor, Miguel Tairum Cruz
Teaching Keyword Spotters to Spot New Keywords with Limited Examples
Abhijeet Awasthi, Kevin Kilgour, Hassan Rom
A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection
Xin Wang, Junichi Yamagishi
An Initial Investigation for Detecting Partially Spoofed Audio
Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans
Siamese Network with wav2vec Feature for Spoofing Speech Detection
Yang Xie, Zhenchuan Zhang, Yingchun Yang
Cross-Database Replay Detection in Terminal-Dependent Speaker Verification
Xingliang Cheng, Mingxing Xu, Thomas Fang Zheng
The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System
Yuxiang Zhang, Wenchao Wang, Pengyuan Zhang
Pairing Weak with Strong: Twin Models for Defending Against Adversarial Attack on Speaker Verification
Zhiyuan Peng, Xu Li, Tan Lee
Attention-Based Convolutional Neural Network for ASV Spoofing Detection
Hefei Ling, Leichao Huang, Junrui Huang, Baiyan Zhang, Ping Li
Voting for the Right Answer: Adversarial Defense for Speaker Verification
Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-yi Lee
Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing
Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee
Representation Learning to Classify and Detect Adversarial Attacks Against Speaker and Speech Recognition Systems
Jesús Villalba, Sonal Joshi, Piotr Żelasko, Najim Dehak
An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems
You Zhang, Ge Zhu, Fei Jiang, Zhiyao Duan
Channel-Wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks
Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng
Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection
Wanying Ge, Michele Panariello, Jose Patino, Massimiliano Todisco, Nicholas Evans
OpenASR20: An Open Challenge for Automatic Speech Recognition of Conversational Telephone Speech in Low-Resource Languages
Kay Peterson, Audrey Tong, Yan Yu
Multitask Adaptation with Lattice-Free MMI for Multi-Genre Speech Recognition of Low Resource Languages
Srikanth Madikeri, Petr Motlicek, Hervé Bourlard
An Improved Wav2Vec 2.0 Pre-Training Approach Using Enhanced Local Dependency Modeling for Speech Recognition
Qiu-shi Zhu, Jie Zhang, Ming-hui Wu, Xin Fang, Li-Rong Dai
Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges
Hung-Pang Lin, Yu-Jia Zhang, Chia-Ping Chen
The TNT Team System Descriptions of Cantonese and Mongolian for IARPA OpenASR20
Jing Zhao, Zhiqiang Lv, Ambyera Han, Guan-Bo Wang, Guixin Shi, Jian Kang, Jinghao Yan, Pengfei Hu, Shen Huang, Wei-Qiang Zhang
Combining Hybrid and End-to-End Approaches for the OpenASR20 Challenge
Tanel Alumäe, Jiaming Kong
One Size Does Not Fit All in Resource-Constrained ASR
Ethan Morris, Robbie Jimerson, Emily Prud’hommeaux
Child Language Acquisition Studied with Wearables
Alejandrina Cristia
Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021
Pablo Gimeno, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge
Tyler Vuong, Yangyang Xia, Richard M. Stern
Speech Activity Detection Based on Multilingual Speech Recognition System
Seyyed Saeed Sarfjoo, Srikanth Madikeri, Petr Motlicek
Voice Activity Detection with Teacher-Student Domain Emulation
Jarrod Luckenbaugh, Samuel Abplanalp, Rachel Gonzalez, Daniel Fulford, David Gard, Carlos Busso
EML Online Speech Activity Detection for the Fearless Steps Challenge Phase-III
Omid Ghahabi, Volker Fischer
Device Playback Augmentation with Echo Cancellation for Keyword Spotting
Kuba Łopatka, Katarzyna Kaszuba-Miotke, Piotr Klinke, Paweł Trella
End-to-End Open Vocabulary Keyword Search
Bolaji Yusuf, Alican Gok, Batuhan Gundogdu, Murat Saraclar
Semantic Sentence Similarity: Size does not Always Matter
Danny Merkx, Stefan L. Frank, Mirjam Ernestus
Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings
Jan Švec, Luboš Šmídl, Josef V. Psutka, Aleš Pražák
Toward Genre Adapted Closed Captioning
François Buet, François Yvon
Weakly-Supervised Word-Level Pronunciation Error Detection in Non-Native English Speech
Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek
End-to-End Speaker-Attributed ASR with Transformer
Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction
Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey
Phone-Level Pronunciation Scoring for Spanish Speakers Learning English Using a GOP-DNN System
Jazmín Vidal, Cyntia Bonomi, Marcelo Sancinetti, Luciana Ferrer
Explore wav2vec 2.0 for Mispronunciation Detection
Xiaoshuo Xu, Yueteng Kang, Songjun Cao, Binghuai Lin, Long Ma
Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
Shintaro Ando, Nobuaki Minematsu, Daisuke Saito
Deep Feature Transfer Learning for Automatic Pronunciation Assessment
Binghuai Lin, Liyuan Wang
Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil
Huayun Zhang, Ke Shi, Nancy F. Chen
A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis
Linkai Peng, Kaiqi Fu, Binghuai Lin, Dengfeng Ke, Jinsong Zhan
The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech
Yu Qiao, Wei Zhou, Elma Kerz, Ralf Schlüter
End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning
Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima
“You don’t understand me!”: Comparing ASR Results for L1 and L2 Speakers of Swedish
Ronald Cumbal, Birger Moell, José Lopes, Olov Engwall
NeMo Inverse Text Normalization: From Development to Production
Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg
Improvement of Automatic English Pronunciation Assessment with Small Number of Utterances Using Sentence Speakability
Satsuki Naijo, Akinori Ito, Takashi Nose
Affect Recognition Through Scalogram and Multi-Resolution Cochleagram Features
Fasih Haider, Saturnino Luz
A Speech Emotion Recognition Framework for Better Discrimination of Confusions
Jiawang Liu, Haoxiang Wang
Speech Emotion Recognition via Multi-Level Cross-Modal Distillation
Ruichen Li, Jinming Zhao, Qin Jin
Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
Koichiro Ito, Takuya Fujioka, Qinghua Sun, Kenji Nagamatsu
Parametric Distributions to Model Numerical Emotion Labels
Deboshree Bose, Vidhyasaharan Sethu, Eliathamby Ambikairajah
Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition
Yuan Gao, Jiaxing Liu, Longbiao Wang, Jianwu Dang
Speech Emotion Recognition with Multi-Task Learning
Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, Kenneth Church
Generalized Dilated CNN Models for Depression Detection Using Inverted Vocal Tract Variables
Nadee Seneviratne, Carol Espy-Wilson
Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition
Yuhua Wang, Guang Shen, Yuezhu Xu, Jiahang Li, Zhengdao Zhao
Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition
Jiaxing Liu, Yaodong Song, Longbiao Wang, Jianwu Dang, Ruiguo Yu
Compressing 1D Time-Channel Separable Convolutions Using Sparse Random Ternary Matrices
Gonçalo Mordido, Matthijs Van keirsbilck, Alexander Keller
Weakly Supervised Construction of ASR Systems from Massive Video Data
Mengli Cheng, Chengyu Wang, Jun Huang, Xiaobo Wang
Broadcasted Residual Learning for Efficient Keyword Spotting
Byeonggeun Kim, Simyung Chang, Jinkyu Lee, Dooyong Sung
CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition
Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris
Extremely Low Footprint End-to-End ASR System for Smart Device
Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer
Amortized Neural Networks for Low-Latency Speech Recognition
Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow
Tied & Reduced RNN-T Decoder
Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He
PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation
Jangho Kim, Simyung Chang, Nojun Kwak
Collaborative Training of Acoustic Encoders for Speech Recognition
Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra
Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition
Xiong Wang, Sining Sun, Lei Xie, Long Ma
The Energy and Carbon Footprint of Training End-to-End Speech Recognizers
Titouan Parcollet, Mirco Ravanelli
Graph-Based Label Propagation for Semi-Supervised Speaker Identification
Long Chen, Venkatesh Ravichandran, Andreas Stolcke
Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
Ruirui Li, Chelsea J.-T. Ju, Zeya Chen, Hongda Mao, Oguz Elibol, Andreas Stolcke
A Generative Model for Duration-Dependent Score Calibration
Sandro Cumani, Salvatore Sarni
Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition
Jason Pelecanos, Quan Wang, Ignacio Lopez Moreno
Multi-Channel Speaker Verification for Single and Multi-Talker Speech
Saurabh Kataria, Shi-Xiong Zhang, Dong Yu
Chronological Self-Training for Real-Time Speaker Diarization
Dirk Padfield, Daniel J. Liebling
Adaptive Margin Circle Loss for Speaker Verification
Runqiu Xiao, Xiaoxiao Miao, Wenchao Wang, Pengyuan Zhang, Bin Cai, Liuping Luo
Presentation Matters: Evaluating Speaker Identification Tasks
Benjamin O’Brien, Christine Meunier, Alain Ghio
Automatic Error Correction for Speaker Embedding Learning with Noisy Labels
Fuchuan Tong, Yan Liu, Song Li, Jie Wang, Lin Li, Qingyang Hong
An Integrated Framework for Two-Pass Personalized Voice Trigger
Dexin Liao, Jing Li, Yiming Zhi, Song Li, Qingyang Hong, Lin Li
Masked Proxy Loss for Text-Independent Speaker Verification
Jiachen Lian, Aiswarya Vinod Kumar, Hira Dhamyal, Bhiksha Raj, Rita Singh
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech
Keon Lee, Kyumin Park, Daeyoung Kim
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability
Rui Liu, Berrak Sisman, Haizhou Li
Emotional Prosody Control for Speech Generation
Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi
Controllable Context-Aware Conversational Speech Synthesis
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su
Expressive Text-to-Speech Using Style Tag
Minchan Kim, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, Nam Soo Kim
Adaptive Text to Speech for Spontaneous Style
Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu
Towards Multi-Scale Style Control for Expressive Speech Synthesis
Xiang Li, Changhe Song, Jingbei Li, Zhiyong Wu, Jia Jia, Helen Meng
Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
Shifeng Pan, Lei He
Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
Daxin Tan, Tan Lee
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS
Xiaochun An, Frank K. Soong, Lei Xie
Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture
Slava Shechtman, Raul Fernandez, Alexander Sorin, David Haws
Intent Detection and Slot Filling for Vietnamese
Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen
Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models
Haitao Lin, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong
The Impact of Intent Distribution Mismatch on Semi-Supervised Spoken Language Understanding
Judith Gaspers, Quynh Do, Daniil Sorokin, Patrick Lehnen
Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification
Yidi Jiang, Bidisha Sharma, Maulik Madhavi, Haizhou Li
Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-Trained DNN-HMM-Based Acoustic-Phonetic Model
Nick J.C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang J. Kuo, Samuel Thomas, Edmilson Morais
End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining
Xianwei Zhang, Liang He
Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge
Hamidreza Saghir, Samridhi Choudhary, Sepehr Eghbali, Clement Chung
End-to-End Spoken Language Understanding for Generalized Voice Assistants
Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris
Bi-Directional Joint Neural Networks for Intent Classification and Slot Filling
Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, Josiah Poon
INTERSPEECH 2021 Acoustic Echo Cancellation Challenge
Ross Cutler, Ando Saabas, Tanel Parnamaa, Markus Loide, Sten Sootla, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sorensen, Robert Aichner, Sriram Srinivasan
Acoustic Echo Cancellation with Cross-Domain Learning
Lukas Pfeifenberger, Matthias Zoehrer, Franz Pernkopf
F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement
Shimin Zhang, Yuxiang Kong, Shubo Lv, Yanxin Hu, Lei Xie
Y2-Net FCRN for Acoustic Echo and Noise Suppression
Ernst Seidel, Jan Franzen, Maximilian Strake, Tim Fingscheidt
Acoustic Echo Cancellation Using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information
Renhua Peng, Linjuan Cheng, Chengshi Zheng, Xiaodong Li
Nonlinear Acoustic Echo Cancellation with Deep Learning
Amir Ivry, Israel Cohen, Baruch Berdugo
Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases
Jordan R. Green, Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Katrin Tomanek
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale
Michael Neumann, Oliver Roesler, Jackson Liscombe, Hardik Kothare, David Suendermann-Oeft, David Pautler, Indu Navar, Aria Anvar, Jochen Kumm, Raquel Norel, Ernest Fraenkel, Alexander V. Sherman, James D. Berry, Gary L. Pattee, Jun Wang, Jordan R. Green, Vikram Ramanarayanan
Handling Acoustic Variation in Dysarthric Speech Recognition Systems Through Model Combination
Enno Hermann, Mathew Magimai-Doss
Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition
Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng
Speaking with a KN95 Face Mask: ASR Performance and Speaker Compensation
Sarah E. Gutz, Hannah P. Rowe, Jordan R. Green
Adversarial Data Augmentation for Disordered Speech Recognition
Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng
Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition
Xurong Xie, Rukiye Ruzi, Xunying Liu, Lan Wang
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion
Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng
Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition
Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng
A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks
Shanqing Cai, Lisie Lillianfeld, Katie Seaver, Jordan R. Green, Michael P. Brenner, Philip C. Nelson, D. Sculley
Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech
Zhehuai Chen, Bhuvana Ramabhadran, Fadi Biadsy, Xia Zhang, Youzheng Chen, Liyang Jiang, Fang Chu, Rohan Doshi, Pedro J. Moreno
Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia
Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Jordan R. Green, Katrin Tomanek
Automatic Severity Classification of Korean Dysarthric Speech Using Phoneme-Level Pronunciation Features
Eun Jung Yeo, Sunhee Kim, Minhwa Chung
Comparing Supervised Models and Learned Speech Representations for Classifying Intelligibility of Disordered Speech on Selected Phrases
Subhashini Venugopalan, Joel Shor, Manoj Plakal, Jimmy Tobin, Katrin Tomanek, Jordan R. Green, Michael P. Brenner
Analysis and Tuning of a Voice Assistant System for Dysfluent Speech
Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou, Sachin Kajarekar, Jefferey Bigham
Interactive and Real-Time Acoustic Measurement Tools for Speech Data Acquisition and Presentation: Application of an Extended Member of Time Stretched Pulses
Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Masanori Morise, Hideki Banno, Toshio Irino
Save Your Voice: Voice Banking and TTS for Anyone
Daniel Tihelka, Markéta Řezáčková, Martin Grůber, Zdeněk Hanzlíček, Jakub Vít, Jindřich Matoušek
NeMo (Inverse) Text Normalization: From Development to Production
Yang Zhang, Evelina Bakhturina, Boris Ginsburg
Lalilo: A Reading Assistant for Children Featuring Speech Recognition-Based Reading Mistake Detection
Corentin Hembise, Lucile Gelin, Morgane Daniel
Automatic Radiology Report Editing Through Voice
Manh Hung Nguyen, Vu Hoang, Tu Anh Nguyen, Trung H. Bui
WittyKiddy: Multilingual Spoken Language Learning for Kids
Ke Shi, Kye Min Tan, Huayun Zhang, Siti Umairah Md. Salleh, Shikang Ni, Nancy F. Chen
Duplex Conversation in Outbound Agent System
Chunxiang Jin, Minghui Yang, Zujie Wen
Web Interface for Estimating Articulatory Movements in Speech Production from Acoustics and Text
Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh
Article |
---|