doi: 10.21437/Interspeech.2020
ISSN: 2958-1796
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu
SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition
Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin
Contextual RNN-T for Open Domain ASR
Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition
Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma
Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity
Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo
BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example
Timo Lohrenz, Tim Fingscheidt
Relative Positional Encoding for Speech Recognition and Direct Translation
Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka
Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework
Takashi Fukuda, Samuel Thomas
Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition
Jinhwan Park, Wonyong Sung
Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition
Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao
Neural Spatio-Temporal Beamformer for Target Speech Separation
Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu
Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis
Li Li, Kazuhito Koishida, Shoji Makino
End-to-End Multi-Look Keyword Spotting
Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu
Differential Beamforming for Uniform Circular Array with Directional Microphones
Weilong Huang, Jinwei Feng
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement
Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee
An End-to-End Architecture of Online Multi-Channel Speech Separation
Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Edward Lin, Yi Luo, Lei Xie
Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation
Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi
Computationally Efficient and Versatile Framework for Joint Optimization of Blind Speech Separation and Dereverberation
Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki
A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge
Yan-Hui Tu, Jun Du, Lei Sun, Feng Ma, Jia Pan, Chin-Hui Lee
Identifying Causal Relationships Between Behavior and Local Brain Activity During Natural Conversation
Hmamouche Youssef, Prévot Laurent, Ochs Magalie, Chaminade Thierry
Neural Entrainment to Natural Speech Envelope Based on Subject Aligned EEG Signals
Di Zhou, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Zhuo Zhang
Does Lexical Retrieval Deteriorate in Patients with Mild Cognitive Impairment? Analysis of Brain Functional Network Will Tell
Chongyuan Lian, Tianqi Wang, Mingxiao Gu, Manwa L. Ng, Feiqi Zhu, Lan Wang, Nan Yan
Congruent Audiovisual Speech Enhances Cortical Envelope Tracking During Auditory Selective Attention
Zhen Fu, Jing Chen
Contribution of RMS-Level-Based Speech Segments to Target Speech Decoding Under Noisy Conditions
Lei Wang, Ed X. Wu, Fei Chen
Cortical Oscillatory Hierarchy for Natural Sentence Processing
Bin Zhao, Jianwu Dang, Gaoyan Zhang, Masashi Unoki
Comparing EEG Analyses with Different Epoch Alignments in an Auditory Lexical Decision Experiment
Louis ten Bosch, Kimberley Mulder, Lou Boves
Detection of Subclinical Mild Traumatic Brain Injury (mTBI) Through Speech and Gait
Tanya Talkar, Sophia Yuditskaya, James R. Williamson, Adam C. Lammert, Hrishikesh Rao, Daniel Hannon, Anne O’Brien, Gloria Vergara-Diaz, Richard DeLaura, Douglas Sturim, Gregory Ciccarelli, Ross Zafonte, Jeffrey Palmer, Paolo Bonato, Thomas F. Quatieri
Towards Learning a Universal Non-Semantic Representation of Speech
Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv
Poetic Meter Classification Using i-Vector-MTF Fusion
Rajeev Rajan, Aiswarya Vinod Kumar, Ben P. Babu
Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism
Wang Dai, Jinsong Zhang, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie
Automatic Analysis of Speech Prosody in Dutch
Na Hu, Berit Janssen, Judith Hanssen, Carlos Gussenhoven, Aoju Chen
Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting
Adrien Gresse, Mathias Quillot, Richard Dufour, Jean-François Bonastre
Enhancing Formant Information in Spectrographic Display of Speech
B. Yegnanarayana, Anand Joseph, Vishala Pannala
Unsupervised Methods for Evaluating Speech Representations
Michael Gump, Wei-Ning Hsu, James Glass
Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments
Dung N. Tran, Uros Batricevic, Kazuhito Koishida
Nonlinear ISA with Auxiliary Variables for Learning Speech Representations
Amrith Setlur, Barnabás Póczos, Alan W. Black
Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals
Hirotoshi Takeuchi, Kunio Kashino, Yasunori Ohishi, Hiroshi Saruwatari
Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders
Yang Ai, Zhen-Hua Ling
FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction
Qiao Tian, Zewang Zhang, Heng Lu, Ling-Hui Chen, Shan Liu
VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-Nested Adversarial Network
Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoon-Young Cho, Injung Kim
Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition
Hiroki Kanagawa, Yusuke Ijima
WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU
Po-chun Hsu, Hung-yi Lee
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS
Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber
Fast and Lightweight On-Device TTS with Tacotron2 and LPCNet
Vadim Popov, Stanislav Kamenev, Mikhail Kudinov, Sergey Repyevsky, Tasnima Sadekova, Vitalii Bushaev, Vladimir Kryzhanovskiy, Denis Parkhomenko
Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed
Wei Song, Guanghui Xu, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou
Can Auditory Nerve Models Tell us What’s Different About WaveNet Vocoded Speech?
Sébastien Le Maguer, Naomi Harte
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions
Dipjyoti Paul, Yannis Pantazis, Yannis Stylianou
Neural Homomorphic Vocoder
Zhijun Liu, Kuan Chen, Kai Yu
Overview of the Interspeech TLT2020 Shared Task on ASR for Non-Native Children’s Speech
Roberto Gretter, Marco Matassoni, Daniele Falavigna, Keelan Evanini, Chee Wee Leong
The NTNU System at the Interspeech 2020 Non-Native Children’s Speech ASR Challenge
Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen
Non-Native Children’s Automatic Speech Recognition: The INTERSPEECH 2020 Shared Task ALTA Systems
Kate M. Knill, Linlin Wang, Yu Wang, Xixin Wu, Mark J.F. Gales
Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children’s Speech
Hemant Kathania, Mittul Singh, Tamás Grósz, Mikko Kurimo
UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
Mostafa Shahin, Renée Lu, Julien Epps, Beena Ahmed
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu
Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny, Aleksandr Laptev, Aleksei Romanenko
New Advances in Speaker Diarization
Hagai Aronowitz, Weizhong Zhu, Masayuki Suzuki, Gakuto Kurata, Ron Hoory
Self-Attentive Similarity Measurement Strategies in Speaker Diarization
Qingjian Lin, Yu Hou, Ming Li
Speaker Attribution with Voice Profiles by Graph-Based Semi-Supervised Learning
Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, Michael Brudno
Deep Self-Supervised Hierarchical Clustering for Speaker Diarization
Prachi Singh, Sriram Ganapathy
Spot the Conversation: Speaker Diarisation in the Wild
Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition
Wangyou Zhang, Yanmin Qian
Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition
Zhihao Du, Jiqing Han, Xueliang Zhang
Anti-Aliasing Regularization in Stacking Layers
Antoine Bruguier, Ananya Misra, Arun Narayanan, Rohit Prabhavalkar
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription
Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov
End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming
Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Shinji Watanabe, Yanmin Qian
Quaternion Neural Networks for Multi-Channel Distant Speech Recognition
Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas D. Lane, Mohamed Morchid
Improved Guided Source Separation Integrated with a Strong Back-End for the CHiME-6 Dinner Party Scenario
Hangting Chen, Pengyuan Zhang, Qian Shi, Zuozhen Liu
Neural Speech Separation Using Spatially Distributed Microphones
Dongmei Wang, Zhuo Chen, Takuya Yoshioka
Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones
Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu
Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset
Jack Deadman, Jon Barker
Toward Silent Paralinguistics: Speech-to-EMG — Retrieving Articulatory Muscle Activity from Speech
Catarina Botelho, Lorenz Diener, Dennis Küster, Kevin Scheck, Shahin Amiriparian, Björn W. Schuller, Tanja Schultz, Alberto Abad, Isabel Trancoso
Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features
Jiaxuan Zhang, Sarah Ita Levitan, Julia Hirschberg
Multi-Modal Attention for Speech Emotion Recognition
Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li
WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition
Guang Shen, Riwei Lai, Rui Chen, Yu Zhang, Kejia Zhang, Qilong Han, Hongtao Song
A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition
Ming Chen, Xudong Zhao
Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition
Pengfei Liu, Kun Li, Helen Meng
Multi-Modal Embeddings Using Multi-Task Learning for Emotion Recognition
Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram
Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network
Jeng-Lin Li, Chi-Chun Lee
Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition
Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li
ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment
Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, Yi Lin
Developing an Open-Source Corpus of Yoruba Speech
Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, Kọ́lá Túbọ̀sún
ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers
Jung-Woo Ha, Kihyun Nam, Jingu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soojin Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, Sunghun Kim
LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR
Yanhong Wang, Huan Luan, Jiahong Yuan, Bin Wang, Hui Lin
Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency
Vikram Ramanarayanan
CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment
Si-Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee, Kathy Yuet-Sheung Lee, Michael Chi-Fai Tong
FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics
Katri Leino, Juho Leinonen, Mittul Singh, Sami Virpioja, Mikko Kurimo
DiPCo — Dinner Party Corpus
Maarten Van Segbroeck, Ahmed Zaid, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Xuewen Luo, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas
Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews
Bo Wang, Yue Wu, Niall Taylor, Terry Lyons, Maria Liakata, Alejo J. Nevado-Holgado, Kate E.A. Saunders
FT Speech: Danish Parliament Speech Corpus
Andreas Kirkedal, Marija Stepanović, Barbara Plank
Metric Learning Loss Functions to Reduce Domain Mismatch in the x-Vector Space for Language Recognition
Raphaël Duroselle, Denis Jouvet, Irina Illina
The XMUSPEECH System for the AP19-OLR Challenge
Zheng Li, Miao Zhao, Jing Li, Yiming Zhi, Lin Li, Qingyang Hong
On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification
Zheng Li, Miao Zhao, Jing Li, Lin Li, Qingyang Hong
What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information?
Shammur A. Chowdhury, Ahmed Ali, Suwon Shon, James Glass
Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets
Matias Lindgren, Tommi Jauhiainen, Mikko Kurimo
Learning Intonation Pattern Embeddings for Arabic Dialect Identification
Aitor Arronte Alvarez, Elsayed Sabry Abdelaal Issa
Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages
Badr M. Abdullah, Tania Avgustinova, Bernd Möbius, Dietrich Klakow
ICE-Talk: An Interface for a Controllable Expressive Talking Machine
Noé Tits, Kevin El Haddad, Thierry Dutoit
Kaldi-Web: An Installation-Free, On-Device Speech Recognition System
Mathieu Hu, Laurent Pierron, Emmanuel Vincent, Denis Jouvet
Soapbox Labs Verification Platform for Child Speech
Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Agape Deng, Arnaud Letondor, Robert O’Regan, Qiru Zhou
SoapBox Labs Fluency Assessment Platform for Child Speech
Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Gloria Montoya Gomez, Agape Deng, Arnaud Letondor, Niall Mullally, Adrian Hempel, Robert O’Regan, Qiru Zhou
CATOTRON — A Neural Text-to-Speech System in Catalan
Baybars Külebi, Alp Öktem, Alex Peiró-Lilja, Santiago Pascual, Mireia Farrús
Toward Remote Patient Monitoring of Speech, Video, Cognitive and Respiratory Biomarkers Using Multimodal Dialog Technology
Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, David Pautler, Doug Habberstad, Andrew Cornish, Hardik Kothare, Vignesh Murali, Jackson Liscombe, Dirk Schnelle-Walka, Patrick Lange, David Suendermann-Oeft
VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch
Baihan Lin, Xinxin Zhang
Enhancing Transferability of Black-Box Adversarial Attacks via Lifelong Learning for Speech Emotion Recognition Models
Zhao Ren, Jing Han, Nicholas Cummins, Björn W. Schuller
End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model
Han Feng, Sei Ueno, Tatsuya Kawahara
Improving Speech Emotion Recognition Using Graph Attentive Bi-Directional Gated Recurrent Unit Network
Bo-Hao Su, Chun-Min Chang, Yun-Shao Lin, Chi-Chun Lee
An Investigation of Cross-Cultural Semi-Supervised Learning for Continuous Affect Recognition
Adria Mallol-Ragolta, Nicholas Cummins, Björn W. Schuller
Ensemble of Students Taught by Probabilistic Teachers to Improve Speech Emotion Recognition
Kusha Sridhar, Carlos Busso
Augmenting Generative Adversarial Networks for Speech Emotion Recognition
Siddique Latif, Muhammad Asim, Rajib Rana, Sara Khalifa, Raja Jurdak, Björn W. Schuller
Speech Emotion Recognition ‘in the Wild’ Using an Autoencoder
Vipula Dissanayake, Haimo Zhang, Mark Billinghurst, Suranga Nanayakkara
Emotion Profile Refinery for Speech Emotion Classification
Shuiyang Mao, P.C. Ching, Tan Lee
Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
Sung-Lin Yeh, Yun-Shao Lin, Chi-Chun Lee
Fast and Slow Acoustic Model
Kshitiz Kumar, Emilian Stoimenov, Hosam Khalil, Jian Wu
Self-Distillation for Improving CTC-Transformer-Based ASR Systems
Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix
Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard
Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury
Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection
Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno
PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur
CAT: A CTC-CRF Based ASR Toolkit Bridging the Hybrid and the End-to-End Approaches Towards Data Efficiency and Low Latency
Keyu An, Hongyu Xiang, Zhijian Ou
CTC-Synchronous Training for Monotonic Attention Model
Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara
Continual Learning for Multi-Dialect Acoustic Models
Brady Houston, Katrin Kirchhoff
SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
Xingchen Song, Zhiyong Wu, Yiheng Huang, Dan Su, Helen Meng
RECOApy: Data Recording, Pre-Processing and Phonetic Transcription for End-to-End Speech-Based Applications
Adriana Stan
Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer
Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Françoise Beaufays
Statistical Testing on ASR Performance via Blockwise Bootstrap
Zhe Liu, Fuchun Peng
Sentence Level Estimation of Psycholinguistic Norms Using Joint Multidimensional Annotations
Anil Ramakrishna, Shrikanth Narayanan
Neural Zero-Inflated Quality Estimation Model for Automatic Speech Recognition System
Kai Fan, Bo Li, Jiayi Wang, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhijie Yan
Confidence Measures in Encoder-Decoder Models for Speech Recognition
Alejandro Woodward, Clara Bonnín, Issey Masuda, David Varas, Elisenda Bou-Balust, Juan Carlos Riveiro
Word Error Rate Estimation Without ASR Output: e-WER2
Ahmed Ali, Steve Renals
An Evaluation of Manual and Semi-Automatic Laughter Annotation
Bogdan Ludusan, Petra Wagner
Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be”
Joshua L. Martin, Kevin Tang
Secondary Phonetic Cues in the Production of the Nasal Short-a System in California English
Georgia Zellou, Rebecca Scarborough, Renee Kemp
Acoustic Properties of Strident Fricatives at the Edges: Implications for Consonant Discrimination
Louis-Marie Lorin, Lorenzo Maselli, Léo Varnet, Maria Giavazzi
Processes and Consequences of Co-Articulation in Mandarin V1N.(C2)V2 Context: Phonology and Phonetics
Mingqiong Luo
Voicing Distinction of Obstruents in the Hangzhou Wu Chinese Dialect
Yang Yue, Fang Hu
The Phonology and Phonetics of Kaifeng Mandarin Vowels
Lei Wang
Microprosodic Variability in Plosives in German and Austrian German
Margaret Zellers, Barbara Schuppler
Er-Suffixation in Southwestern Mandarin: An EMA and Ultrasound Study
Jing Huang, Feng-fan Hsieh, Yueh-chin Chang
Electroglottographic-Phonetic Study on Korean Phonation Induced by Tripartite Plosives in Yanbian Korean
Yinghao Li, Jinghua Zhang
Modeling Global Body Configurations in American Sign Language
Nicholas Wilkins, Max Cordes Galbraith, Ifeoma Nwogu
Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation
Hang Li, Siyuan Chen, Julien Epps
CAM: Uninteresting Speech Detector
Weiyi Lu, Yi Xu, Peng Yang, Belinda Zeng
Mixed Case Contextual ASR Using Capitalization Masks
Diamantino Caseiro, Pat Rondon, Quoc-Nam Le The, Petar Aleksic
Speech Recognition and Multi-Speaker Diarization of Long Conversations
Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison W. Cottrell
Investigation of Data Augmentation Techniques for Disordered Speech Recognition
Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng
A Real-Time Robot-Based Auxiliary System for Risk Evaluation of COVID-19 Infection
Wenqi Wei, Jianzong Wang, Jiteng Ma, Ning Cheng, Jing Xiao
An Utterance Verification System for Word Naming Therapy in Aphasia
David S. Barbera, Mark Huckvale, Victoria Fleming, Emily Upton, Henry Coley-Fisher, Ian Shaw, William Latham, Alexander P. Leff, Jenny Crinion
Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition
Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng
Joint Prediction of Punctuation and Disfluency in Speech Transcripts
Binghuai Lin, Liyuan Wang
Focal Loss for Punctuation Prediction
Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Ye Bai, Cunhang Fan
Improving X-Vector and PLDA for Text-Dependent Speaker Verification
Zhuxin Chen, Yue Lin
SdSV Challenge 2020: Large-Scale Evaluation of Short-Duration Speaker Verification
Hossein Zeinali, Kong Aik Lee, Jahangir Alam, Lukáš Burget
The XMUSPEECH System for Short-Duration Speaker Verification Challenge 2020
Tao Jiang, Miao Zhao, Lin Li, Qingyang Hong
Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020
Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim
The TalTech Systems for the Short-Duration Speaker Verification Challenge 2020
Tanel Alumäe, Jörgen Valk
Investigation of NICT Submission for Short-Duration Speaker Verification Challenge 2020
Peng Shen, Xugang Lu, Hisashi Kawai
Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization
Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck
BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020
Alicia Lozano-Diez, Anna Silnova, Bhargav Pulugundla, Johan Rohdin, Karel Veselý, Lukáš Burget, Oldřich Plchot, Ondřej Glembek, Ondvrej Novotný, Pavel Matějka
Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification
Vijay Ravi, Ruchao Fan, Amber Afshan, Huanhua Lu, Abeer Alwan
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning
Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition
Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna
Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN
Yanping Li, Dongxiang Xu, Yan Zhang, Yang Wang, Binbin Chen
TTS Skins: Speaker Conversion via ASR
Adam Polyak, Lior Wolf, Yaniv Taigman
GAZEV: GAN-Based Zero-Shot Voice Conversion Over Non-Parallel Speech Corpus
Zining Zhang, Bingsheng He, Zhenjie Zhang
Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation
Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Rongxiu Zhong
Unsupervised Cross-Domain Singing Voice Conversion
Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman
Attention-Based Speaker Embeddings for One-Shot Voice Conversion
Tatsuma Ishihara, Daisuke Saito
Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training
Jian Cong, Shan Yang, Lei Xie, Guoqiao Yu, Guanglu Wan
Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging
Sixin Hong, Yuexian Zou, Wenwu Wang
Environmental Sound Classification with Parallel Temporal-Spectral Attention
Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang
Contrastive Predictive Coding of Audio with an Adversary
Luyu Wang, Kazuya Kawakami, Aaron van den Oord
Memory Controlled Sequential Self Attention for Sound Recognition
Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos
Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification
Donghyeon Kim, Jaihyun Park, David K. Han, Hanseok Ko
An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection
Xu Zheng, Yan Song, Jie Yan, Li-Rong Dai, Ian McLoughlin, Lin Liu
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling
Chieh-Chi Kao, Bowen Shi, Ming Sun, Chao Wang
Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging
Chun-Chieh Chang, Chieh-Chi Kao, Ming Sun, Chao Wang
Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification
Inyoung Park, Hong Kook Kim
SpeechMix — Augmenting Deep Sound Recognition Using Hidden Space Interpolations
Amit Jindal, Narayanan Elavathur Ranganatha, Aniket Didolkar, Arijit Ghosh Chowdhury, Di Jin, Ramit Sawhney, Rajiv Ratn Shah
End-to-End Neural Transformer Based Spoken Language Understanding
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann
Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding
Chen Liu, Su Zhu, Zijian Zhao, Ruisheng Cao, Lu Chen, Kai Yu
Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces
Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow
Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning
Pavel Denisov, Ngoc Thang Vu
Context Dependent RNNLM for Automatic Transcription of Conversations
Srikanth Raj Chetupalli, Sriram Ganapathy
Improving End-to-End Speech-to-Intent Classification with Reptile
Yusheng Tian, Philip John Gorinski
Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim
Towards an ASR Error Robust Spoken Language Understanding System
Weitong Ruan, Yaroslav Nechaev, Luoxin Chen, Chengwei Su, Imre Kiss
End-to-End Spoken Language Understanding Without Full Transcripts
Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras
Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study
Karthik Gopalakrishnan, Behnam Hedayatnia, Longshaokan Wang, Yang Liu, Dilek Hakkani-Tür
AutoSpeech: Neural Architecture Search for Speaker Recognition
Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang
Densely Connected Time Delay Neural Network for Speaker Verification
Ya-Qi Yu, Wu-Jun Li
Phonetically-Aware Coupled Network For Short Duration Text-Independent Speaker Verification
Siqi Zheng, Yun Lei, Hongbin Suo
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention
Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoirin Kim
Vector-Based Attentive Pooling for Text-Independent Speaker Verification
Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu
Self-Attention Encoding and Pooling for Speaker Recognition
Pooyan Safari, Miquel India, Javier Hernando
ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification
Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu
Adversarial Separation Network for Speaker Recognition
Hanyi Zhang, Longbiao Wang, Yunchun Zhang, Meng Liu, Kong Aik Lee, Jianguo Wei
Text-Independent Speaker Verification with Dual Attention Network
Jingyu Li, Tan Lee
Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification
Xiaoyang Qu, Jianzong Wang, Jing Xiao
Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition
Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu
Semantic Mask for Transformer Based End-to-End Speech Recognition
Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces
Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig
A Federated Approach in Training Acoustic Models
Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data
Imran Sheikh, Emmanuel Vincent, Irina Illina
On Front-End Gain Invariant Modeling for Wake Word Spotting
Yixin Gao, Noah D. Stein, Chieh-Chi Kao, Yunliang Cai, Ming Sun, Tao Zhang, Shiv Naga Prasad Vitaladevuni
Unsupervised Regularization-Based Adaptive Training for Speech Recognition
Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du
On the Robustness and Training Dynamics of Raw Waveform Models
Erfan Loweimi, Peter Bell, Steve Renals
Iterative Pseudo-Labeling for Speech Recognition
Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert
Smart Tube: A Biofeedback System for Vocal Training and Therapy Through Tube Phonation
Naoko Kawamura, Tatsuya Kitamura, Kenta Hamada
VCTUBE : A Library for Automatic Speech Data Annotation
Seong Choi, Seunghoon Jeong, Jeewoo Yoon, Migyeong Yang, Minsam Ko, Eunil Park, Jinyoung Han, Munyoung Lee, Seonghee Lee
A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback
Yanlu Xie, Xiaoli Feng, Boxue Li, Jinsong Zhang, Yujia Jin
Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains
Tejas Udayakumar, Kinnera Saranu, Mayuresh Sanjay Oak, Ajit Ashok Saunshikar, Sandip Shriram Bapat
Computer-Assisted Language Learning System: Automatic Speech Evaluation for Children Learning Malay and Tamil
Ke Shi, Kye Min Tan, Richeng Duan, Siti Umairah Md. Salleh, Nur Farah Ain Suhaimi, Rajan Vellu, Ngoc Thuy Huong Helen Thai, Nancy F. Chen
Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU
Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari
A Dynamic 3D Pronunciation Teaching Model Based on Pronunciation Attributes and Anatomy
Xiaoli Feng, Yanlu Xie, Yayue Deng, Boxue Li
End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge
Naoki Kimura, Zixiong Su, Takaaki Saeki
Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?
Jialu Li, Mark Hasegawa-Johnson
Development of Multilingual ASR Using GlobalPhone for Less-Resourced Languages: The Case of Ethiopian Languages
Martha Yifiru Tachbelie, Solomon Teferra Abate, Tanja Schultz
Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning
Wenxin Hou, Yue Dong, Bairong Zhuang, Longfei Yang, Jiatong Shi, Takahiro Shinozaki
Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition
Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, Haizhou Li
Multilingual Acoustic and Language Modeling for Ethio-Semitic Languages
Solomon Teferra Abate, Martha Yifiru Tachbelie, Tanja Schultz
Multilingual Jointly Trained Acoustic and Written Word Embeddings
Yushi Hu, Shane Settle, Karen Livescu
Improving Code-Switching Language Modeling with Artificially Generated Texts Using Cycle-Consistent Adversarial Networks
Chia-Yu Li, Ngoc Thang Vu
Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods
Xinhui Hu, Qi Zhang, Lei Yang, Binbin Gu, Xinkang Xu
A 43 Language Multilingual Punctuation Prediction Neural Network Model
Xinxing Li, Edward Lin
Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition
Jisung Wang, Jihwan Kim, Sangki Kim, Yeha Lee
Multi-Task Siamese Neural Network for Improving Replay Attack Detection
Patrick von Platen, Fei Tao, Gokhan Tur
POCO: A Voice Spoofing and Liveness Detection Corpus Based on Pop Noise
Kosuke Akimoto, Seng Pei Liew, Sakiko Mishima, Ryo Mizushima, Kong Aik Lee
Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection
Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, Kai Yu
Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection
Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu
Competency Evaluation in Voice Mimicking Using Acoustic Cues
Abhijith G., Adharsh S., Akshay P. L., Rajeev Rajan
Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks
Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li
Spoofing Attack Detection Using the Non-Linear Fusion of Sub-Band Classifiers
Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, Massimiliano Todisco
Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions
Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja, Gajan Suthokumar
Siamese Convolutional Neural Network Using Gaussian Probability Feature for Spoofing Speech Detection
Zhenchun Lei, Yingen Yang, Changhong Liu, Jihua Ye
Lightweight Online Noise Reduction on Embedded Devices Using Hierarchical Recurrent Neural Networks
H. Schröter, T. Rosenkranz, A.N. Escalante-B., P. Zobel, Andreas Maier
SEANet: A Multi-Modal Speech Enhancement Network
Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek
Lite Audio-Visual Speech Enhancement
Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo, Hsin-Min Wang
ORCA-CLEAN: A Deep Denoising Toolkit for Killer Whale Communication
Christian Bergler, Manuel Schmitt, Andreas Maier, Simeon Smeele, Volker Barth, Elmar Nöth
A Deep Learning Approach to Active Noise Control
Hao Zhang, DeLiang Wang
Improving Speech Intelligibility Through Speaker Dependent and Independent Spectral Style Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks
Mathias B. Pedersen, Morten Kolbæk, Asger H. Andersen, Søren H. Jensen, Jesper Jensen
Predicting Intelligibility of Enhanced Speech Using Posteriors Derived from DNN-Based ASR System
Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, Toshio Irino
Automatic Estimation of Intelligibility Measure for Consonants in Speech
Ali Abavisani, Mark Hasegawa-Johnson
Large Scale Evaluation of Importance Maps in Automatic Speech Recognition
Viet Anh Trinh, Michael I. Mandel
Neural Architecture Search on Acoustic Scene Classification
Jixiang Li, Chuming Liang, Bo Zhang, Zhao Wang, Fei Xiang, Xiangxiang Chu
Acoustic Scene Classification Using Audio Tagging
Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu
ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification
Liwen Zhang, Jiqing Han, Ziqiang Shi
Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network
Jivitesh Sharma, Ole-Christoffer Granmo, Morten Goodwin
Acoustic Scene Analysis with Multi-Head Attention Networks
Weimin Wang, Weiran Wang, Ming Sun, Chao Wang
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee
An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances
Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Xue Bai, Jun Du, Chin-Hui Lee
Attention-Driven Projections for Soundscape Classification
Dhanunjaya Varma Devalraju, Muralikrishna H., Padmanabhan Rajan, Dileep Aroor Dinesh
Computer Audition for Continuous Rainforest Occupancy Monitoring: The Case of Bornean Gibbons’ Call Detection
Panagiotis Tzirakis, Alexander Shiarella, Robert Ewers, Björn W. Schuller
Deep Learning Based Open Set Acoustic Scene Classification
Zuzanna Kwiatkowska, Beniamin Kalinowski, Michał Kośmider, Krzysztof Rykaczewski
Singing Synthesis: With a Little Help from my Attention
Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman
Peking Opera Synthesis via Duration Informed Attention Network
Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu
DurIAN-SC: Duration Informed Attention Network Based Singing Voice Conversion System
Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, Dong Yu
Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music
Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li
Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music
Haohe Liu, Lei Xie, Jian Wu, Geng Yang
Continual Learning in Automatic Speech Recognition
Samik Sadhu, Hynek Hermansky
Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism
Genshun Wan, Jia Pan, Qingran Wang, Jianqing Gao, Zhongfu Ye
Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator
Yan Huang, Jinyu Li, Lei He, Wenning Wei, William Gale, Yifan Gong
Speech Transformer with Speaker Aware Persistent Memory
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma
Adaptive Speaker Normalization for CTC-Based Speech Recognition
Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du
Unsupervised Domain Adaptation Under Label Space Mismatch for Speech Classification
Akhil Mathur, Nadia Berthouze, Nicholas D. Lane
Learning Fast Adaptation on Cross-Accented Speech Recognition
Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, Pascale Fung
Black-Box Adaptation of ASR for Accented Speech
Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, Sunita Sarawagi
Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation
M.A. Tuğtekin Turan, Emmanuel Vincent, Denis Jouvet
Frame-Wise Online Unsupervised Adaptation of DNN-HMM Acoustic Model from Perspective of Robust Adaptive Filtering
Ryu Takeda, Kazunori Komatani
Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer
Jie Wu, Jian Luan
Prediction of Head Motion from Speech Waveforms with a Canonical-Correlation-Constrained Autoencoder
JinHong Lu, Hiroshi Shimodaira
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou
Stochastic Talking Face Generation Using Latent Distribution Matching
Ravindra Yadav, Ashish Sardana, Vinay P. Namboodiri, Rajesh M. Hegde
Speech-to-Singing Conversion Based on Boundary Equilibrium GAN
Da-Yi Wu, Yi-Hsuan Yang
Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image
Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, Koichiro Mori
Speech Driven Talking Head Generation via Attentional Landmarks Based Representation
Wentao Wang, Yan Wang, Jianqing Sun, Qingsong Liu, Jiaen Liang, Teng Li
Optimization and Evaluation of an Intelligibility-Improving Signal Processing Approach (IISPA) for the Hurricane Challenge 2.0 with FADE
Marc René Schädler
iMetricGAN: Intelligibility Enhancement for Speech-in-Noise Using Generative Adversarial Network-Based Metric Learning
Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi
Intelligibility-Enhancing Speech Modifications — The Hurricane Challenge 2.0
Jan Rennies, Henning Schepker, Cassia Valentini-Botinhao, Martin Cooke
Exploring Listeners’ Speech Rate Preferences
Olympia Simantiraki, Martin Cooke
Adaptive Compressive Onset-Enhancement for Improved Speech Intelligibility in Noise and Reverberation
Felicitas Bederna, Henning Schepker, Christian Rollwage, Simon Doclo, Arne Pusch, Jörg Bitzer, Jan Rennies
A Sound Engineering Approach to Near End Listening Enhancement
Carol Chermaz, Simon King
Enhancing Speech Intelligibility in Text-To-Speech Synthesis Using Speaking Style Conversion
Dipjyoti Paul, Muhammed P.V. Shifas, Yannis Pantazis, Yannis Stylianou
Two Different Mechanisms of Movable Mandible for Vocal-Tract Model with Flexible Tongue
Takayuki Arai
Improving the Performance of Acoustic-to-Articulatory Inversion by Removing the Training Loss of Noncritical Portions of Articulatory Channels Dynamically
Qiang Fang
Speaker Conditioned Acoustic-to-Articulatory Inversion Using x-Vectors
Aravind Illa, Prasanta Kumar Ghosh
Coarticulation as Synchronised Sequential Target Approximation: An EMA Study
Zirui Liu, Yi Xu, Feng-fan Hsieh
Improved Model for Vocal Folds with a Polyp with Potential Application
Jônatas Santos, Jugurta Montalvão, Israel Santos
Regional Resonance of the Lower Vocal Tract and its Contribution to Speaker Characteristics
Lin Zhang, Kiyoshi Honda, Jianguo Wei, Seiji Adachi
Air-Tissue Boundary Segmentation in Real Time Magnetic Resonance Imaging Video Using 3-D Convolutional Neural Network
Renuka Mannem, Navaneetha Gaddam, Prasanta Kumar Ghosh
An Investigation of the Virtual Lip Trajectories During the Production of Bilabial Stops and Nasal at Different Speaking Rates
Tilak Purohit, Prasanta Kumar Ghosh
SpEx+: A Complete Time Domain Speaker Extraction Network
Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li
Atss-Net: Target Speaker Separation via Attention-Based Neural Network
Tingle Li, Qingjian Lin, Yuanyuan Bao, Ming Li
Multimodal Target Speech Separation with Voice and Face References
Leyuan Qu, Cornelius Weber, Stefan Wermter
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
Zining Zhang, Bingsheng He, Zhenjie Zhang
Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation
Chenda Li, Yanmin Qian
A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments
Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu
Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding
Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki
Listen to What You Want: Neural Network-Based Universal Sound Selector
Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki
Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels
Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada
Speaker-Aware Monaural Speech Separation
Jiahao Xu, Kun Hu, Chang Xu, Duc Chung Tran, Zhiyong Wang
A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions
Liming Wang, Mark Hasegawa-Johnson
Efficient Wait-k Models for Simultaneous Machine Translation
Maha Elbayad, Laurent Besacier, Jakob Verbeek
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation
Ha Nguyen, Fethi Bougares, N. Tomashenko, Yannick Estève, Laurent Besacier
Contextualized Translation of Automatically Segmented Speech
Marco Gaido, Mattia A. Di Gangi, Matteo Negri, Mauro Cettolo, Marco Turchi
Self-Training for End-to-End Speech Translation
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang
Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing
Marcello Federico, Yogesh Virkar, Robert Enyedi, Roberto Barra-Chicote
Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass
Self-Supervised Representations Improve End-to-End Speech Translation
Anne Wu, Changhan Wang, Juan Pino, Jiatao Gu
Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms
Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim
An Adaptive X-Vector Model for Text-Independent Speaker Verification
Bin Gu, Wu Guo, Fenglin Ding, Zhen-Hua Ling, Jun Du
Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions
Santi Prieto, Alfonso Ortega, Iván López-Espejo, Eduardo Lleida
Sum-Product Networks for Robust Automatic Speaker Identification
Aaron Nicolson, Kuldip K. Paliwal
Segment Aggregation for Short Utterances Speaker Verification Using Raw Waveforms
Seung-bin Kim, Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu
Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition
Shai Rozenberg, Hagai Aronowitz, Ron Hoory
Speaker Re-Identification with Speaker Dependent Speech Enhancement
Yanpei Shi, Qiang Huang, Thomas Hain
Blind Speech Signal Quality Estimation for Speaker Verification Systems
Galina Lavrentyeva, Marina Volkova, Anastasia Avdeeva, Sergey Novoselov, Artem Gorlanov, Tseren Andzhukaev, Artem Ivanov, Alexander Kozlov
Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification
Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng
Modeling ASR Ambiguity for Neural Dialogue State Tracking
Vaishali Pal, Fabien Guillot, Manish Shrivastava, Jean-Michel Renders, Laurent Besacier
ASR Error Correction with Augmented Transformer for Entity Retrieval
Haoyu Wang, Shuyan Dong, Yue Liu, James Logan, Ashish Kumar Agrawal, Yang Liu
Large-Scale Transfer Learning for Low-Resource Spoken Language Understanding
Xueli Jia, Jianzong Wang, Zhiyong Zhang, Ning Cheng, Jing Xiao
Data Balancing for Boosting Performance of Low-Frequency Classes in Spoken Language Understanding
Judith Gaspers, Quynh Do, Fabian Triefenbach
An Interactive Adversarial Reward Learning-Based Spoken Language Understanding System
Yu Wang, Yilin Shen, Hongxia Jin
Style Attuned Pre-Training and Parameter Efficient Fine-Tuning for Spoken Language Understanding
Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, Shang-Wen Li
Unsupervised Domain Adaptation for Dialogue Sequence Labeling Based on Hierarchical Adversarial Training
Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura
Deep F-Measure Maximization for End-to-End Speech Understanding
Leda Sarı, Mark Hasegawa-Johnson
An Effective Domain Adaptive Post-Training Method for BERT in Response Selection
Taesun Whang, Dongyub Lee, Chanhee Lee, Kisu Yang, Dongsuk Oh, Heuiseok Lim
Confidence Measure for Speech-to-Concept End-to-End Spoken Language Understanding
Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin
Attention to Indexical Information Improves Voice Recall
Grant L. McGuire, Molly Babel
Categorization of Whistled Consonants by French Speakers
Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier
Whistled Vowel Identification by French Listeners
Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier
F0 Slope and Mean: Cues to Speech Segmentation in French
Maria del Mar Cordero, Fanny Meunier, Nicolas Grimault, Stéphane Pota, Elsa Spinelli
Does French Listeners’ Ability to Use Accentual Information at the Word Level Depend on the Ear of Presentation?
Amandine Michelas, Sophie Dufour
A Perceptual Study of the Five Level Tones in Hmu (Xinzhai Variety)
Wen Liu
Mandarin and English Adults’ Cue-Weighting of Lexical Stress
Zhen Zeng, Karen Mattock, Liquan Liu, Varghese Peter, Alba Tuninetti, Feng-Ming Tsao
Age-Related Differences of Tone Perception in Mandarin-Speaking Seniors
Yan Feng, Gang Peng, William Shi-Yuan Wang
Social and Functional Pressures in Vocal Alignment: Differences for Human and Voice-AI Interlocutors
Georgia Zellou, Michelle Cohn
Identifying Important Time-Frequency Locations in Continuous Speech Utterances
Hassan Salami Kavaki, Michael I. Mandel
Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling
Erfan Loweimi, Peter Bell, Steve Renals
Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations
Purvi Agrawal, Sriram Ganapathy
A Deep 2D Convolutional Network for Waveform-Based Speech Recognition
Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals
Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions
Ludwig Kürzinger, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll
An Alternative to MFCCs for ASR
Pegah Ghahramani, Hossein Hadian, Daniel Povey, Hynek Hermansky, Sanjeev Khudanpur
Phase Based Spectro-Temporal Features for Building a Robust ASR System
Anirban Dutta, G. Ashishkumar, Ch.V. Rama Rao
Deep Scattering Power Spectrum Features for Robust Speech Recognition
Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals
FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition
Titouan Parcollet, Xinchi Qiu, Nicholas D. Lane
Bandpass Noise Generation and Augmentation for Unified ASR
Kshitiz Kumar, Bo Ren, Yifan Gong, Jian Wu
Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition
Anurenjan Purushothaman, Anirudh Sreeram, Rohit Kumar, Sriram Ganapathy
Introducing the VoicePrivacy Initiative
N. Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco
The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment
Andreas Nautsch, Jose Patino, N. Tomashenko, Junichi Yamagishi, Paul-Gauthier Noé, Jean-François Bonastre, Massimiliano Todisco, Nicholas Evans
X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System
Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Masashi Unoki
A Comparative Study of Speech Anonymization Metrics
Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent
Design Choices for X-Vector Based Speaker Anonymization
Brij Mohan Lal Srivastava, N. Tomashenko, Xin Wang, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, Marc Tommasi
Speech Pseudonymisation Assessment Using Voice Similarity Matrices
Paul-Gauthier Noé, Jean-François Bonastre, Driss Matrouf, N. Tomashenko, Andreas Nautsch, Nicholas Evans
g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
Kyubyong Park, Seanie Lee
A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation
Haiteng Zhang, Huashan Pan, Xiulin Li
Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes
Michelle Cohn, Georgia Zellou
Enhancing Sequence-to-Sequence Text-to-Speech with Morphology
Jason Taylor, Korin Richmond
Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling
Yeunju Choi, Youngmoon Jung, Hoirin Kim
Deep Learning Based Assessment of Synthetic Speech Naturalness
Gabriel Mittag, Sebastian Möller
Distant Supervision for Polyphone Disambiguation in Mandarin Chinese
Jiawen Zhang, Yuanyuan Zhao, Jiaqi Zhu, Jinba Xiao
An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets
Pilar Oplustil Gallegos, Jennifer Williams, Joanna Rownicka, Simon King
Understanding the Effect of Voice Quality and Accent on Talker Similarity
Anurag Das, Guanlong Zhao, John Levis, Evgeny Chukharev-Hudilainen, Ricardo Gutierrez-Osuna
Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition Without Length Bias
Wei Zhou, Ralf Schlüter, Hermann Ney
Transformer with Bidirectional Decoder for Speech Recognition
Xi Chen, Songyang Zhang, Dandan Song, Peng Ouyang, Shouyi Yin
An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition
Weiran Wang, Guangsen Wang, Aadyot Bhatnagar, Yingbo Zhou, Caiming Xiong, Richard Socher
Combination of End-to-End and Hybrid Models for Speech Recognition
Jeremy H.M. Wong, Yashesh Gaur, Rui Zhao, Liang Lu, Eric Sun, Jinyu Li, Yifan Gong
Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition
Jihwan Kim, Jisung Wang, Sangki Kim, Yeha Lee
Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition
Abhinav Garg, Ashutosh Gupta, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim
LVCSR with Transformer Language Models
Eugen Beck, Ralf Schlüter, Hermann Ney
DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation
Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, Hung-yi Lee
Uncertainty-Aware Machine Support for Paper Reviewing on the Interspeech 2019 Submission Corpus
Lukas Stappen, Georgios Rizos, Madina Hasan, Thomas Hain, Björn W. Schuller
Individual Variation in Language Attitudes Toward Voice-AI: The Role of Listeners’ Autistic-Like Traits
Michelle Cohn, Melina Sarian, Kristin Predeck, Georgia Zellou
Differences in Gradient Emotion Perception: Human vs. Alexa Voices
Michelle Cohn, Eran Raveh, Kristin Predeck, Iona Gessinger, Bernd Möbius, Georgia Zellou
The MSP-Conversation Corpus
Luz Martinez-Lucas, Mohammed Abdelwahab, Carlos Busso
Spotting the Traces of Depression in Read Speech: An Approach Based on Computational Paralinguistics and Social Signal Processing
Fuxiang Tao, Anna Esposito, Alessandro Vinciarelli
Speech Sentiment and Customer Satisfaction Estimation in Socialbot Conversations
Yelin Kim, Joshua Levy, Yang Liu
Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments
Haley Lepp, Gina-Anne Levow
Are Germans Better Haters Than Danes? Language-Specific Implicit Prosodies of Types of Hate Speech and How They Relate to Perceived Severity and Societal Rules
Jana Neitsch, Oliver Niebuhr
An Objective Voice Gender Scoring System and Identification of the Salient Acoustic Measures
Fuling Chen, Roberto Togneri, Murray Maybery, Diana Tan
How Ordinal Are Your Data?
Sadari Jayawardena, Julien Epps, Zhaocheng Huang
Correlating Cepstra with Formant Frequencies: Implications for Phonetically-Informed Forensic Voice Comparison
Vincent Hughes, Frantz Clermont, Philip Harrison
Prosody and Breathing: A Comparison Between Rhetorical and Information-Seeking Questions in German and Brazilian Portuguese
Jana Neitsch, Plinio A. Barbosa, Oliver Niebuhr
Scaling Processes of Clause Chains in Pitjantjatjara
Rebecca Defina, Catalina Torres, Hywel Stoakes
Neutralization of Voicing Distinction of Stops in Tohoku Dialects of Japanese: Field Work and Acoustic Measurements
Ai Mizoguchi, Ayako Hashimoto, Sanae Matsui, Setsuko Imatomi, Ryunosuke Kobayashi, Mafuyu Kitahara
Correlation Between Prosody and Pragmatics: Case Study of Discourse Markers in French and English
Lou Lee, Denis Jouvet, Katarina Bartkova, Yvon Keromnes, Mathilde Dargnat
An Analysis of Prosodic Prominence Cues to Information Structure in Egyptian Arabic
Dina El Zarka, Anneliese Kelterer, Barbara Schuppler
Lexical Stress in Urdu
Benazir Mumtaz, Tina Bögel, Miriam Butt
Vocal Markers from Sustained Phonation in Huntington’s Disease
Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan-Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi
How Rhythm and Timbre Encode Mooré Language in Bendré Drummed Speech
Laure Dentel, Julien Meyer
Interaction of Tone and Voicing in Mizo
Wendy Lalhminghlui, Priyankoo Sarmah
Mandarin Lexical Tones: A Corpus-Based Study of Word Length, Syllable Position and Prosodic Position on Duration
Yaru Wu, Martine Adda-Decker, Lori Lamel
An Investigation of the Target Approximation Model for Tone Modeling and Recognition in Continuous Mandarin Speech
Yingming Gao, Xinyu Zhang, Yi Xu, Jinsong Zhang, Peter Birkholz
Integrating the Application and Realization of Mandarin 3rd Tone Sandhi in the Resolution of Sentence Ambiguity
Wei Lai, Aini Li
Neutral Tone in Changde Mandarin
Zhenrui Zhang, Fang Hu
Pitch Declination and Final Lowering in Northeastern Mandarin
Ping Cui, Jianjing Kuang
Variation in Spectral Slope and Interharmonic Noise in Cantonese Tones
Phil Rose
The Acoustic Realization of Mandarin Tones in Fast Speech
Ping Tang, Shanpeng Li
Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency
Anastassia Loukina, Keelan Evanini, Matthew Mulholland, Ian Blood, Klaus Zechner
A Low Latency ASR-Free End to End Spoken Language Understanding System
Mohamed Mhiri, Samuel Myer, Vikrant Singh Tomar
An Audio-Based Wakeword-Independent Verification System
Joe Wang, Rajath Kumar, Mike Rodehorst, Brian Kulis, Shiv Naga Prasad Vitaladevuni
Learnable Spectro-Temporal Receptive Fields for Robust Voice Type Discrimination
Tyler Vuong, Yangyang Xia, Richard M. Stern
Low Latency Speech Recognition Using End-to-End Prefetching
Shuo-Yiin Chang, Bo Li, David Rybach, Yanzhang He, Wei Li, Tara N. Sainath, Trevor Strohman
AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification
Jingsong Wang, Tom Ko, Zhen Xu, Xiawei Guo, Souxiang Liu, Wei-Wei Tu, Lei Xie
Building a Robust Word-Level Wakeword Verification Network
Rajath Kumar, Mike Rodehorst, Joe Wang, Jiacheng Gu, Brian Kulis
A Transformer-Based Audio Captioning Model with Keyword Estimation
Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito
Neural Architecture Search for Keyword Spotting
Tong Mo, Yakun Yu, Mohammad Salameh, Di Niu, Shangling Jui
Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution
Ximin Li, Xiaodong Wei, Xiaowei Qin
Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model
Xin Wang, Junichi Yamagishi
Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization
Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh, Yi-Hsuan Yang
Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra
Toru Nakashika
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Seungwoo Choi, Seungju Han, Dongyoung Kim, Sungjoo Ha
Reformer-TTS: Neural Speech Synthesis with Reformer Network
Hyeong Rae Ihm, Joun Yeop Lee, Byoung Jin Choi, Sung Jun Cheon, Nam Soo Kim
CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo
High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency
Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis
DurIAN: Duration Informed Attention Network for Speech Synthesis
Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu
Multi-Speaker Text-to-Speech Synthesis Using Deep Gaussian Processes
Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari
A Hybrid HMM-Waveglow Based Text-to-Speech Synthesizer Using Histogram Equalization for Low Resource Indian Languages
Mano Ranjith Kumar M., Sudhanshu Srivastava, Anusha Prakash, Hema A. Murthy
The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks
Björn W. Schuller, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia Hamilton, Shahin Amiriparian, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, Harald Baumeister, Alexis Deighton MacIntyre, Simone Hantke
Learning Higher Representations from Pre-Trained Deep Models with Data Augmentation for the COMPARE 2020 Challenge Mask Task
Tomoya Koike, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto
Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms
Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien
Surgical Mask Detection with Deep Recurrent Phonetic Models
Philipp Klumpp, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Florian Hönig, Elmar Nöth, Juan Rafael Orozco-Arroyave
Phonetic, Frame Clustering and Intelligibility Analyses for the INTERSPEECH 2020 ComParE Challenge
Claude Montacié, Marie-José Caraty
Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition
Mariana Julião, Alberto Abad, Helena Moniz
Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges
Maxim Markitantov, Denis Dresvyanskiy, Danila Mamontov, Heysem Kaya, Wolfgang Minker, Alexey Karpov
Analyzing Breath Signals for the Interspeech 2020 ComParE Challenge
John Mendonça, Francisco Teixeira, Isabel Trancoso, Alberto Abad
Deep Attentive End-to-End Continuous Breath Sensing from Speech
Alexis Deighton MacIntyre, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian, Antonia Hamilton, Björn W. Schuller
Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion
Jeno Szep, Salim Hariri
Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge
Ziqing Yang, Zifan An, Zehao Fan, Chengye Jing, Houwei Cao
Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition
Gizem Soğancıoğlu, Oxana Verkholyak, Heysem Kaya, Dmitrii Fedotov, Tobias Cadée, Albert Ali Salah, Alexey Karpov
Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs
Nicolae-Cătălin Ristea, Radu Tudor Ionescu
1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM
Kshitiz Kumar, Chaojun Liu, Yifan Gong, Jian Wu
Low Latency End-to-End Streaming Speech Recognition with a Scout Network
Chengyi Wang, Yu Wu, Liang Lu, Shujie Liu, Jinyu Li, Guoli Ye, Ming Zhou
Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition
Gakuto Kurata, George Saon
Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition
Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He
Improved Hybrid Streaming ASR with Transformer Language Models
Pau Baquero-Arnal, Javier Jorge, Adrià Giménez, Joan Albert Silvestre-Cerdà, Javier Iranzo-Sánchez, Albert Sanchis, Jorge Civera, Alfons Juan
Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory
Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang
Enhancing Monotonic Multihead Attention for Streaming ASR
Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara
Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition
Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie
High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stüker, Alex Waibel
Transfer Learning Approaches for Streaming End-to-End Speech Recognition System
Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li
Tackling the ADReSS Challenge: A Multimodal Approach to the Automated Recognition of Alzheimer’s Dementia
Matej Martinc, Senja Pollak
Disfluencies and Fine-Tuning Pre-Trained Language Models for Detection of Alzheimer’s Disease
Jiahong Yuan, Yuchen Bian, Xingyu Cai, Jiaji Huang, Zheng Ye, Kenneth Church
To BERT or not to BERT: Comparing Speech and Language-Based Approaches for Alzheimer’s Disease Detection
Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova
Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge
Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney
Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its Severity
Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, Najim Dehak
A Comparison of Acoustic and Linguistics Methodologies for Alzheimer’s Dementia Recognition
Nicholas Cummins, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen, Daniel Blackburn, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä
Multi-Modal Fusion with Gating Using Audio, Lexical and Disfluency Features for Alzheimer’s Dementia Recognition from Spontaneous Speech
Morteza Rohanian, Julian Hough, Matthew Purver
Comparing Natural Language Processing Techniques for Alzheimer’s Dementia Prediction in Spontaneous Speech
Thomas Searle, Zina Ibrahim, Richard Dobson
Multiscale System for Alzheimer’s Dementia Recognition Through Spontaneous Speech
Erik Edwards, Charles Dognin, Bajibabu Bollepalli, Maneesh Singh
The INESC-ID Multi-Modal System for the ADReSS 2020 Challenge
Anna Pompili, Thomas Rolland, Alberto Abad
Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues
Shahla Farzana, Natalie Parde
Multimodal Inductive Transfer Learning for Detection of Alzheimer’s Dementia and its Severity
Utkarsh Sarawgi, Wazeer Zulfikar, Nouran Soliman, Pattie Maes
Exploiting Multi-Modal Features from Pre-Trained Networks for Alzheimer’s Dementia Recognition
Junghyun Koo, Jie Hwan Lee, Jaewoo Pyo, Yujin Jo, Kyogu Lee
Automated Screening for Alzheimer’s Dementia Through Spontaneous Speech
Muhammad Shehram Shah Syed, Zafi Sherhan Syed, Margaret Lech, Elena Pirogova
NEC-TT Speaker Verification System for SRE’19 CTS Challenge
Kong Aik Lee, Koji Okabe, Hitoshi Yamamoto, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Keisuke Ishikawa, Koichi Shinoda
THUEE System for NIST SRE19 CTS Challenge
Ruyun Li, Tianyu Liang, Dandan Song, Yi Liu, Yangcheng Wu, Can Xu, Peng Ouyang, Xianwei Zhang, Xianhong Chen, Wei-Qiang Zhang, Shouyi Yin, Liang He
Automatic Quality Assessment for Audio-Visual Verification Systems. The LOVe Submission to NIST SRE Challenge 2019
Grigory Antipov, Nicolas Gengembre, Olivier Le Blouch, Gaël Le Lan
Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network
Ruijie Tao, Rohan Kumar Das, Haizhou Li
Multimodal Association for Speaker Verification
Suwon Shon, James Glass
Multi-Modality Matters: A Performance Leap on VoxCeleb
Zhengyang Chen, Shuai Wang, Yanmin Qian
Cross-Domain Adaptation with Discrepancy Minimization for Text-Independent Forensic Speaker Verification
Zhenyu Wang, Wei Xia, John H.L. Hansen
Open-Set Short Utterance Forensic Speaker Verification Using Teacher-Student Network with Explicit Inductive Bias
Mufan Sang, Wei Xia, John H.L. Hansen
JukeBox: A Multilingual Singer Recognition Dataset
Anurag Chowdhury, Austin Cozzo, Arun Ross
Speaker Identification for Household Scenarios with Self-Attention and Adversarial Training
Ruirui Li, Jyun-Yu Jiang, Xian Wu, Chu-Cheng Hsieh, Andreas Stolcke
Streaming Keyword Spotting on Mobile Devices
Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, Stella Laurenzo
Metadata-Aware End-to-End Keyword Spotting
Hongyi Liu, Apurva Abhyankar, Yuriy Mishchenko, Thibaud Sénéchal, Gengshen Fu, Brian Kulis, Noah D. Stein, Anish Shah, Shiv Naga Prasad Vitaladevuni
Adversarial Audio: A New Information Hiding Method
Yehao Kong, Jiliang Zhang
S2IGAN: Speech-to-Image Generation via Adversarial Learning
Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg
Automatic Speech Recognition Benchmark for Air-Traffic Communications
Juan Zuluaga-Gomez, Petr Motlicek, Qingran Zhan, Karel Veselý, Rudolf Braun
Whisper Augmented End-to-End/Hybrid Speech Recognition System — CycleGAN Approach
Prithvi R.R. Gudepu, Gowtham P. Vadisetti, Abhishek Niranjan, Kinnera Saranu, Raghava Sarma, M. Ali Basha Shaik, Periyasamy Paramasivam
Risk Forecasting from Earnings Calls Acoustics and Network Correlations
Ramit Sawhney, Arshiya Aggarwal, Piyush Khanna, Puneet Mathur, Taru Jain, Rajiv Ratn Shah
SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems
Huili Chen, Bita Darvish, Farinaz Koushanfar
Evaluating Automatically Generated Phoneme Captions for Images
Justin van der Hout, Zoltán D’Haese, Mark Hasegawa-Johnson, Odette Scharenborg
An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks
Wei-Cheng Lin, Carlos Busso
Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-Corpus Setting for Speech Emotion Recognition
Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Björn W. Schuller
Meta-Learning for Speech Emotion Recognition Considering Ambiguity of Emotion Labels
Takuya Fujioka, Takeshi Homma, Kenji Nagamatsu
Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation
Jiaxing Liu, Zhilei Liu, Longbiao Wang, Yuan Gao, Lili Guo, Jianwu Dang
Reconciliation of Multiple Corpora for Speech Emotion Recognition by Multiple Classifiers with an Adversarial Corpus Discriminator
Zhi Zhu, Yoshinao Sato
Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks
Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li
EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification
Shuiyang Mao, P.C. Ching, Tan Lee
Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition
Shuiyang Mao, P.C. Ching, C.-C. Jay Kuo, Tan Lee
The Effect of Language Proficiency on the Perception of Segmental Foreign Accent
Rubén Pérez-Ramón, María Luisa García Lecumberri, Martin Cooke
The Effect of Language Dominance on the Selective Attention of Segments and Tones in Urdu-Cantonese Speakers
Yi Liu, Jinghong Ning
The Effect of Input on the Production of English Tense and Lax Vowels by Chinese Learners: Evidence from an Elementary School in China
Mengrou Li, Ying Chen, Jie Cui
Exploring the Use of an Artificial Accent of English to Assess Phonetic Learning in Monolingual and Bilingual Speakers
Laura Spinu, Jiwon Hwang, Nadya Pincus, Mariana Vasilita
Effects of Dialectal Code-Switching on Speech Modules: A Study Using Egyptian Arabic Broadcast Speech
Shammur A. Chowdhury, Younes Samih, Mohamed Eldesouki, Ahmed Ali
Bilingual Acoustic Voice Variation is Similarly Structured Across Languages
Khia A. Johnson, Molly Babel, Robert A. Fuhrman
Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition
Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng
Perception and Production of Mandarin Initial Stops by Native Urdu Speakers
Dan Du, Xianjin Zhu, Zhu Li, Jinsong Zhang
Now You’re Speaking My Language: Visual Language Identification
Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The Different Enhancement Roles of Covarying Cues in Thai and Mandarin Tones
Nari Rhee, Jianjing Kuang
Singing Voice Extraction with Attention-Based Spectrograms Fusion
Hao Shi, Longbiao Wang, Sheng Li, Chenchen Ding, Meng Ge, Nan Li, Jianwu Dang, Hiroshi Seki
Incorporating Broad Phonetic Information for Speech Enhancement
Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung, Yu Tsao
A Recursive Network with Dynamic Attention for Monaural Speech Enhancement
Andong Li, Chengshi Zheng, Cunhang Fan, Renhua Peng, Xiaodong Li
Constrained Ratio Mask for Speech Enhancement Using DNN
Hongjiang Yu, Wei-Ping Zhu, Yuhong Yang
SERIL: Noise Adaptive Speech Enhancement Using Regularization-Based Incremental Learning
Chi-Chang Lee, Yu-Chen Lin, Hsuan-Tien Lin, Hsin-Min Wang, Yu Tsao
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
Yoshiaki Bando, Kouhei Sekiguchi, Kazuyoshi Yoshii
Low-Latency Single Channel Speech Dereverberation Using U-Net Convolutional Neural Networks
Ahmet E. Bulut, Kazuhito Koishida
Single-Channel Speech Enhancement by Subspace Affinity Minimization
Dung N. Tran, Kazuhito Koishida
Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement
Haoyu Li, Junichi Yamagishi
NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement
Feng Deng, Tao Jiang, Xiao-Rui Wang, Chen Zhang, Yan Li
Online Monaural Speech Enhancement Using Delayed Subband LSTM
Xiaofei Li, Radu Horaud
INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising
Maximilian Strake, Bruno Defraene, Kristoff Fluyt, Wouter Tirry, Tim Fingscheidt
DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
Nils L. Westhausen, Bernd T. Meyer
A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech
Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri, Karim Helwani, Arvindh Krishnaswamy
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss
Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke
The Implication of Sound Level on Spatial Selective Auditory Attention for Cochlear Implant Users: Behavioral and Electrophysiological Measurement
Sara Akbarzadeh, Sungmin Lee, Chin-Tuan Tan
Enhancing the Interaural Time Difference of Bilateral Cochlear Implants with the Temporal Limits Encoder
Yangyang Wan, Huali Zhou, Qinglin Meng, Nengheng Zheng
Speech Clarity Improvement by Vocal Self-Training Using a Hearing Impairment Simulator and its Correlation with an Auditory Modulation Index
Toshio Irino, Soichi Higashiyama, Hanako Yoshigi
Investigation of Phase Distortion on Perceived Speech Quality for Hearing-Impaired Listeners
Zhuohuang Zhang, Donald S. Williamson, Yi Shen
EEG-Based Short-Time Auditory Attention Detection Using Multi-Task Deep Learning
Zhuo Zhang, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Di Zhou, Longbiao Wang
Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders — Step 1: CNN Model-Based Phone Classification
Sondes Abderrazek, Corinne Fredouille, Alain Ghio, Muriel Lalain, Christine Meunier, Virginie Woisard
Improving Cognitive Impairment Classification by Generative Neural Network-Based Feature Augmentation
Bahman Mirheidari, Daniel Blackburn, Ronan O’Malley, Annalena Venneri, Traci Walker, Markus Reuber, Heidi Christensen
UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech
Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan
Towards Automatic Assessment of Voice Disorders: A Clinical Approach
Purva Barche, Krishna Gurugubelli, Anil Kumar Vuppala
BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages
Abhishek Shivkumar, Jack Weston, Raphael Lenain, Emil Fristed
Depthwise Separable Convolutional ResNet with Squeeze-and-Excitation Blocks for Small-Footprint Keyword Spotting
Menglong Xu, Xiao-Lei Zhang
Predicting Detection Filters for Small Footprint Open-Vocabulary Keyword Spotting
Théodore Bluche, Thibault Gisselbrecht
Deep Convolutional Spiking Neural Networks for Keyword Spotting
Emre Yılmaz, Özgür Bora Gevrek, Jibin Wu, Yuxiang Chen, Xuanbo Meng, Haizhou Li
Domain Aware Training for Far-Field Small-Footprint Keyword Spotting
Haiwei Wu, Yan Jia, Yuanfei Nie, Ming Li
Re-Weighted Interval Loss for Handling Data Imbalance Problem of End-to-End Keyword Spotting
Kun Zhang, Zhiyong Wu, Daode Yuan, Jian Luan, Jia Jia, Helen Meng, Binheng Song
Deep Template Matching for Small-Footprint and Configurable Keyword Spotting
Peng Zhang, Xueliang Zhang
Multi-Scale Convolution for Robust Keyword Spotting
Chen Yang, Xue Wen, Liming Song
An Investigation of Few-Shot Learning in Spoken Term Classification
Yangbin Chen, Tom Ko, Lifeng Shang, Xiao Chen, Xin Jiang, Qing Li
End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages
Zeyu Zhao, Wei-Qiang Zhang
Stacked 1D Convolutional Networks for End-to-End Small Footprint Voice Trigger Detection
Takuya Higuchi, Mohammad Ghasemzadeh, Kisun You, Chandra Dhir
Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments
Jens Heitkaemper, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
Speaker Diarization System Based on DPCA Algorithm for Fearless Steps Challenge Phase-2
Xueshuai Zhang, Wenchao Wang, Pengyuan Zhang
The DKU Speech Activity Detection and Speaker Identification Systems for Fearless Steps Challenge Phase-02
Qingjian Lin, Tingle Li, Ming Li
“This is Houston. Say again, please”. The Behavox System for the Apollo-11 Fearless Steps Challenge (Phase II)
Arseniy Gorin, Daniil Kulko, Steven Grima, Alex Glasman
FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data
Aditya Joglekar, John H.L. Hansen, Meena Chandra Shekar, Abhijeet Sangwan
Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss
Yi Luo, Nima Mesgarani
On Synthesis for Supervised Monaural Speech Separation in Time Domain
Jingjing Chen, Qirong Mao, Dong Liu
Learning Better Speech Representations by Worsening Interference
Jun Wang
Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers
Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent
Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation
Jingjing Chen, Qirong Mao, Dong Liu
Conv-TasSAN: Separative Adversarial Network Based on Conv-TasNet
Chengyun Deng, Yi Zhang, Shiqian Ma, Yongtao Sha, Hui Song, Xiangang Li
Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation
Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Unsupervised Audio Source Separation Using Generative Priors
Vivek Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, Andreas Spanias
Adversarial Latent Representation Learning for Speech Enhancement
Yuanhang Qiu, Ruili Wang
An NMF-HMM Speech Enhancement Method Based on Kullback-Leibler Divergence
Yang Xiang, Liming Shi, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen
Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement
Lu Zhang, Mingjiang Wang
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Quan Wang, Ignacio Lopez Moreno, Mert Saglam, Kevin Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, Alexander Gruenstein
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Ziqiang Shi, Rujie Liu, Jiqing Han
Sub-Band Knowledge Distillation Framework for Speech Enhancement
Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai Gao, Xiaofei Li
A Deep Learning-Based Kalman Filter for Speech Enhancement
Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal
Subband Kalman Filtering with DNN Estimated Parameters for Speech Enhancement
Hongjiang Yu, Wei-Ping Zhu, Benoit Champagne
Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement
Xiaoqi Li, Yaxing Li, Yuanjie Dong, Shan Xu, Zhihui Zhang, Dan Wang, Shengwu Xiong
Speaker-Conditional Chain Model for Speech Separation and Extraction
Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu
Unsupervised vs. Transfer Learning for Multimodal One-Shot Matching of Speech and Images
Leanne Nortje, Herman Kamper
Multimodal Speech Emotion Recognition Using Cross Attention with Aligned Audio and Text
Yoonhyung Lee, Seunghyun Yoon, Kyomin Jung
Speaker Dependent Articulatory-to-Acoustic Mapping Using Real-Time MRI of the Vocal Tract
Tamás Gábor Csapó
Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis
Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, Alexandra Markó
Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling
Siyuan Feng, Odette Scharenborg
Generative Adversarial Training Data Adaptation for Very Low-Resource Automatic Speech Recognition
Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
Neural Speech Completion
Kazuki Tsunematsu, Johanes Effendi, Sakriani Sakti, Satoshi Nakamura
Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization
Benjamin Milde, Chris Biemann
Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning
Katerina Papadimitriou, Gerasimos Potamianos
MLS: A Large-Scale Multilingual Dataset for Speech Research
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert
Combining Audio and Brain Activity for Predicting Speech Quality
Ivan Halim Parmonangan, Hiroki Tanaka, Sakriani Sakti, Satoshi Nakamura
The “Sound of Silence” in EEG — Cognitive Voice Activity Detection
Rini A. Sharon, Hema A. Murthy
Low Latency Auditory Attention Detection with Common Spatial Pattern Analysis of EEG Signals
Siqi Cai, Enze Su, Yonghao Song, Longhan Xie, Haizhou Li
Speech Spectrogram Estimation from Intracranial Brain Activity Using a Quantization Approach
Miguel Angrick, Christian Herff, Garett Johnson, Jerry Shih, Dean Krusienski, Tanja Schultz
Neural Speech Decoding for Amyotrophic Lateral Sclerosis
Debadatta Dash, Paul Ferrari, Angel Hernandez, Daragh Heitzman, Sara G. Austin, Jun Wang
Semi-Supervised ASR by End-to-End Self-Training
Yang Chen, Weiran Wang, Chao Wang
Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants
Hitesh Tulsiani, Ashtosh Sapru, Harish Arsikere, Surabhi Punjabi, Sri Garimella
Serialized Output Training for End-to-End Overlapped Speech Recognition
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
Semi-Supervised Learning with Data Augmentation for End-to-End ASR
Felix Weninger, Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, Puming Zhan
Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition
Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas
A New Training Pipeline for an Improved Neural Transducer
Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney
Improved Noisy Student Training for Automatic Speech Recognition
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le
Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition
Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi
Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition
Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Hejung Yang, Abhinav Garg, Sachin Singh, Jiyeon Kim, Mehul Kumar, Sichen Jin, Shatrughan Singh, Chanwoo Kim
SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR
Gary Wang, Andrew Rosenberg, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno
Fundamental Frequency Model for Postfiltering at Low Bitrates in a Transform-Domain Speech and Audio Codec
Sneha Das, Tom Bäckström, Guillaume Fuchs
Hearing-Impaired Bio-Inspired Cochlear Models for Real-Time Auditory Applications
Arthur Van Den Broucke, Deepak Baby, Sarah Verhulst
Improving Opus Low Bit Rate Quality with Neural Speech Synthesis
Jan Skoglund, Jean-Marc Valin
A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences
Pranay Manocha, Adam Finkelstein, Richard Zhang, Nicholas J. Bryan, Gautham J. Mysore, Zeyu Jin
StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation
Piotr Masztalski, Mateusz Matuszewski, Karol Piaskowski, Michal Romaniuk
An Open Source Implementation of ITU-T Recommendation P.808 with Validation
Babak Naderi, Ross Cutler
DNN No-Reference PSTN Speech Quality Prediction
Gabriel Mittag, Ross Cutler, Yasaman Hosseinkashi, Michael Revow, Sriram Srinivasan, Naglakshmi Chande, Robert Aichner
Non-Intrusive Diagnostic Monitoring of Fullband Speech Quality
Sebastian Möller, Tobias Hübschen, Thilo Michael, Gabriel Mittag, Gerhard Schmidt
Transfer Learning of Articulatory Information Through Phone Information
Abdolreza Sabzi Shahrebabaki, Negar Olfati, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen
Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals
Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen
Discriminative Singular Spectrum Analysis for Bioacoustic Classification
Bernardo B. Gatto, Eulanda M. dos Santos, Juan G. Colonna, Naoya Sogi, Lincon S. Souza, Kazuhiro Fukui
Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data
Renuka Mannem, Hima Jyothi R., Aravind Illa, Prasanta Kumar Ghosh
Dysarthria Detection and Severity Assessment Using Rhythm-Based Metrics
Abner Hernandez, Eun Jung Yeo, Sunhee Kim, Minhwa Chung
LungRN+NL: An Improved Adventitious Lung Sound Classification Using Non-Local Block ResNet Neural Network with Mixup Data Augmentation
Yi Ma, Xinzi Xu, Yongfu Li
Attention and Encoder-Decoder Based Models for Transforming Articulatory Movements at Different Speaking Rates
Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh
Adventitious Respiratory Classification Using Attentive Residual Neural Networks
Zijiang Yang, Shuo Liu, Meishu Song, Emilia Parada-Cabaleiro, Björn W. Schuller
Surfboard: Audio Feature Extraction for Modern Machine Learning
Raphael Lenain, Jack Weston, Abhishek Shivkumar, Emil Fristed
Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task
Abinay Reddy Naini, Malla Satyapriya, Prasanta Kumar Ghosh
Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion
Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma
Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Zhaoyu Liu, Brian Mak
Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis
Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang
Phonological Features for 0-Shot Multilingual Speech Synthesis
Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S. Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao
Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari
Tone Learning in Low-Resource Bilingual TTS
Ruolan Liu, Xue Wen, Chunhui Lu, Xiao Chen
On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model
Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupeshkumar Mehta
Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework
Anusha Prakash, Hema A. Murthy
Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling
Marcel de Korte, Jaebok Kim, Esther Klabbers
One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech
Tomáš Nekvinda, Ondřej Dušek
In Defence of Metric Learning for Speaker Recognition
Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
Seong Min Kye, Youngmoon Jung, Hae Beom Lee, Sung Ju Hwang, Hoirin Kim
Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification
Kai Li, Masato Akagi, Yibo Wu, Jianwu Dang
Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification
Yanpei Shi, Qiang Huang, Thomas Hain
Multi-Task Learning for Voice Related Recognition Tasks
Ana Montalvo, Jose R. Calvo, Jean-François Bonastre
Unsupervised Training of Siamese Networks for Speaker Verification
Umair Khan, Javier Hernando
An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions
Ying Liu, Yan Song, Yiheng Jiang, Ian McLoughlin, Lin Liu, Li-Rong Dai
Speaker-Aware Linear Discriminant Analysis in Speaker Verification
Naijun Zheng, Xixin Wu, Jinghua Zhong, Xunying Liu, Helen Meng
Adversarial Domain Adaptation for Speaker Verification Using Partially Shared Network
Zhengyang Chen, Shuai Wang, Yanmin Qian
Automatic Scoring at Multi-Granularity for L2 Pronunciation
Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang
An Effective End-to-End Modeling Approach for Mispronunciation Detection
Tien-Hong Lo, Shi-Yan Weng, Hsiu-Jui Chang, Berlin Chen
An End-to-End Mispronunciation Detection System for L2 English Speech Leveraging Novel Anti-Phone Modeling
Bi-Cheng Yan, Meng-Che Wu, Hsiao-Tsung Hung, Berlin Chen
Unsupervised Feature Adaptation Using Adversarial Multi-Task Training for Automatic Evaluation of Children’s Speech
Richeng Duan, Nancy F. Chen
Pronunciation Erroneous Tendency Detection with Language Adversarial Represent Learning
Longfei Yang, Kaiqi Fu, Jinsong Zhang, Takahiro Shinozaki
ASR-Free Pronunciation Assessment
Sitong Cheng, Zhixin Liu, Lantian Li, Zhiyuan Tang, Dong Wang, Thomas Fang Zheng
Automatic Detection of Accent and Lexical Pronunciation Errors in Spontaneous Non-Native English Speech
Konstantinos Kyriakopoulos, Kate M. Knill, Mark J.F. Gales
Context-Aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training
Jiatong Shi, Nan Huo, Qin Jin
Recognize Mispronunciations to Improve Non-Native Acoustic Modeling Through a Phone Decoder Built from One Edit Distance Finite State Automaton
Wei Chu, Yang Liu, Jianwei Zhou
Partial AUC Optimisation Using Recurrent Neural Networks for Music Detection with Limited Training Data
Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings
Marvin Lavechin, Ruben Bousbib, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia
Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space
Chao Peng, Xihong Wu, Tianshu Qu
Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework
Shoufeng Lin, Xinyuan Qian
Towards Speech Robustness for Acoustic Scene Classification
Shuo Liu, Andreas Triantafyllopoulos, Zhao Ren, Björn W. Schuller
Identify Speakers in Cocktail Parties with End-to-End Attention
Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sarı
Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR
Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Attentive Convolutional Recurrent Neural Network Using Phoneme-Level Acoustic Representation for Rare Sound Event Detection
Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee
Detecting and Counting Overlapping Speakers in Distant Speech Scenarios
Samuele Cornell, Maurizio Omologo, Stefano Squartini, Emmanuel Vincent
All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection
Niko Moritz, Gordon Wichern, Takaaki Hori, Jonathan Le Roux
Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from Electromyographic Signals
Lorenz Diener, Shahin Amiriparian, Catarina Botelho, Kevin Scheck, Dennis Küster, Isabel Trancoso, Björn W. Schuller, Tanja Schultz
Predicting Collaborative Task Performance Using Graph Interlocutor Acoustic Network in Small Group Interaction
Shun-Chang Zhong, Bo-Hao Su, Wei Huang, Yi-Ching Liu, Chi-Chun Lee
Very Short-Term Conflict Intensity Estimation Using Fisher Vectors
Gábor Gosztolya
Gaming Corpus for Studying Social Screams
Hiroki Mori, Yuki Kikuchi
Speaker Discrimination in Humans and Machines: Effects of Speaking Style Variability
Amber Afshan, Jody Kreiman, Abeer Alwan
Automatic Prediction of Confidence Level from Children’s Oral Reading Recordings
Kamini Sabu, Preeti Rao
Towards a Comprehensive Assessment of Speech Intelligibility for Pathological Speech
W. Xue, V. Mendoza Ramos, W. Harmsen, Catia Cucchiarini, R.W.N.M. van Hout, Helmer Strik
Effects of Communication Channels and Actor’s Gender on Emotion Identification by Native Mandarin Speakers
Yi Lin, Hongwei Ding
Detection of Voicing and Place of Articulation of Fricatives with Deep Learning in a Virtual Speech and Language Therapy Tutor
Ivo Anjos, Maxine Eskenazi, Nuno Marques, Margarida Grilo, Isabel Guimarães, João Magalhães, Sofia Cavaco
Unsupervised Learning for Sequence-to-Sequence Text-to-Speech for Low-Resource Languages
Haitong Zhang, Yue Lin
Conditional Spoken Digit Generation with StyleGAN
Kasperi Palkama, Lauri Juvela, Alexander Ilin
Towards Universal Text-to-Speech
Jingzhou Yang, Lei He
Speaker-Independent Mel-Cepstrum Estimation from Articulator Movements Using D-Vector Input
Kouichi Katsurada, Korin Richmond
Enhancing Monotonicity for Robust Autoregressive Transformer TTS
Xiangyu Liang, Zhiyong Wu, Runnan Li, Yanqing Liu, Sheng Zhao, Helen Meng
Incremental Text to Speech for Neural Sequence-to-Sequence Models Using Reinforcement Learning
Devang S. Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao
Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation
Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-yi Lee
Learning Joint Articulatory-Acoustic Representations with Normalizing Flows
Pramit Saha, Sidney Fels
Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis
Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari
Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification
Jacob J. Webber, Olivier Perrotin, Simon King
Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms
Weiwei Lin, Man-Wai Mak
How Does Label Noise Affect the Quality of Speaker Embeddings?
Minh Pham, Zeqian Li, Jacob Whitehill
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
Xuechen Liu, Md. Sahidullah, Tomi Kinnunen
Speaker Representation Learning Using Global Context Guided Channel and Time-Frequency Transformations
Wei Xia, John H.L. Hansen
Intra-Class Variation Reduction of Speaker Representation in Disentanglement Framework
Yoohwan Kwon, Soo-Whan Chung, Hong-Goo Kang
Compact Speaker Embedding: lrx-Vector
Munir Georges, Jonathan Huang, Tobias Bocklet
Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings
Florian L. Kreyssig, Philip C. Woodland
Deep Speaker Embedding with Long Short Term Centroid Learning for Text-Independent Speaker Verification
Junyi Peng, Rongzhi Gu, Yuexian Zou
Neural Discriminant Analysis for Deep Speaker Embedding
Lantian Li, Dong Wang, Thomas Fang Zheng
Learning Speaker Embedding from Text-to-Speech
Jaejin Cho, Piotr Żelasko, Jesús Villalba, Shinji Watanabe, Najim Dehak
Noisy-Reverberant Speech Enhancement Using DenseUNet with Time-Frequency Attention
Yan Zhao, DeLiang Wang
On Loss Functions and Recurrency Training for GAN-Based Speech Enhancement Systems
Zhuohuang Zhang, Chengyun Deng, Yi Shen, Donald S. Williamson, Yongtao Sha, Yi Zhang, Hui Song, Xiangang Li
Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement
Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang
Deep Speech Inpainting of Time-Frequency Masks
Mikolaj Kegler, Pierre Beckmann, Milos Cernak
Real-Time Single-Channel Deep Neural Network-Based Speech Enhancement on Edge Devices
Nikhil Shankar, Gautam Shreedhar Bhat, Issa M.S. Panahi
Improved Speech Enhancement Using a Time-Domain GAN with Mask Learning
Ju Lin, Sufeng Niu, Adriaan J. van Wijngaarden, Jerome L. McClendon, Melissa C. Smith, Kuang-Ching Wang
Real Time Speech Enhancement in the Waveform Domain
Alexandre Défossez, Gabriel Synnaeve, Yossi Adi
Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks
Michal Romaniuk, Piotr Masztalski, Karol Piaskowski, Mateusz Matuszewski
Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition
Yuya Chiba, Takashi Nose, Akinori Ito
Microphone Array Post-Filter for Target Speech Enhancement Without a Prior Information of Point Interferers
Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao
Similarity-and-Independence-Aware Beamformer: Method for Target Source Extraction Using Magnitude Spectrogram as Reference
Atsuo Hiroe
The Method of Random Directions Optimization for Stereo Audio Source Separation
Oleg Golokolenko, Gerald Schuller
Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations
Cunhang Fan, Jianhua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen
Generalized Minimal Distortion Principle for Blind Source Separation
Robin Scheibler
A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition
Ying Zhong, Ying Hu, Hao Huang, Wushour Silamu
Meta Multi-Task Learning for Speech Emotion Recognition
Ruichu Cai, Kaibin Guo, Boyan Xu, Xiaoyan Yang, Zhenjie Zhang
GEV Beamforming Supported by DOA-Based Masks Generated on Pairs of Microphones
François Grondin, Jean-Samuel Lauzon, Jonathan Vincent, François Michaud
Accurate Detection of Wake Word Start and End Using a CNN
Christin Jose, Yuriy Mishchenko, Thibaud Sénéchal, Anish Shah, Alex Escott, Shiv Naga Prasad Vitaladevuni
Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering
Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir
MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
Somshubra Majumdar, Boris Ginsburg
Iterative Compression of End-to-End ASR Model Using AutoML
Abhinav Mehrotra, Łukasz Dudziak, Jinsu Yeo, Young-yoon Lee, Ravichander Vipperla, Mohamed S. Abdelfattah, Sourav Bhattacharya, Samin Ishtiaq, Alberto Gil C.P. Ramos, SangJeong Lee, Daehyun Kim, Nicholas D. Lane
Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition
Hieu Duy Nguyen, Anastasios Alexandridis, Athanasios Mouchtaris
Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing
Abhinav Garg, Gowtham P. Vadisetti, Dhananjaya Gowda, Sichen Jin, Aditya Jayasimha, Youngho Han, Jiyeon Kim, Junmo Park, Kwangyoun Kim, Sooyeon Kim, Young-yoon Lee, Kyungbo Min, Chanwoo Kim
Scaling Up Online Speech Recognition Using ConvNets
Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models
Grant P. Strimel, Ariya Rastrow, Gautam Tiwari, Adrien Piérard, Jon Webb
Multi-Speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network
Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman
Non-Parallel Emotion Conversion Using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator
Ravi Shankar, Jacob Sager, Archana Venkataraman
Laughter Synthesis: Combining Seq2seq Modeling with Transfer Learning
Noé Tits, Kevin El Haddad, Thierry Dutoit
Nonparallel Emotional Speech Conversion Using VAE-GAN
Yuexin Cao, Zhengchen Liu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao
Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS
Alexander Sorin, Slava Shechtman, Ron Hoory
Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion
Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li
Controlling the Strength of Emotions in Speech-Like Emotional Sound Generated by WaveNet
Kento Matsumoto, Sunao Hara, Masanobu Abe
Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation
Guangyan Zhang, Ying Qin, Tan Lee
Simultaneous Conversion of Speaker Identity and Emotion Based on Multiple-Domain Adaptive RBM
Takuya Kishida, Shin Tsukamoto, Toru Nakashika
Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda
GAN-Based Data Generation for Speech Emotion Recognition
Sefik Emre Eskimez, Dimitrios Dimitriadis, Robert Gmyr, Kenichi Kumanati
The Phonetic Bases of Vocal Expressed Emotion: Natural versus Acted
Hira Dhamyal, Shahan Ali Memon, Bhiksha Raj, Rita Singh
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge
Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, Haizhou Li
Deep Embedding Learning for Text-Dependent Speaker Verification
Peng Zhang, Peng Hu, Xueliang Zhang
STC-Innovation Speaker Recognition Systems for Far-Field Speaker Verification Challenge 2020
Aleksei Gusev, Vladimir Volokhov, Alisa Vinogradova, Tseren Andzhukaev, Andrey Shulipa, Sergey Novoselov, Timur Pekhovsky, Alexander Kozlov
NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge
Li Zhang, Jian Wu, Lei Xie
The JD AI Speaker Verification System for the FFSVC 2020 Challenge
Ying Tong, Wei Xue, Shanluo Huang, Lu Fan, Chao Zhang, Guohong Ding, Xiaodong He
FaceFilter: Audio-Visual Speech Separation Using Still Images
Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang
Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision
Soo-Whan Chung, Hong-Goo Kang, Joon Son Chung
Fusion Architectures for Word-Based Audiovisual Speech Recognition
Michael Wand, Jürgen Schmidhuber
Audio-Visual Multi-Channel Recognition of Overlapped Speech
Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng
TMT: A Transformer-Based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-Aware Dialog
Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li
Should we Hard-Code the Recurrence Concept or Learn it Instead ? Exploring the Transformer Architecture for Audio-Visual Speech Recognition
George Sterpu, Christian Saam, Naomi Harte
Resource-Adaptive Deep Learning for Visual Speech Recognition
Alexandros Koumparoulis, Gerasimos Potamianos, Samuel Thomas, Edmilson da Silva Morais
Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
Masood S. Mortazavi
Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion
Hong Liu, Zhan Chen, Bing Yang
Caption Alignment for Low Resource Audio-Visual Data
Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, Varad Bhatnagar, Jayaprakash Akula, Preethi Jyothi, Ganesh Ramakrishnan, Gholamreza Haffari, Pankaj Singh
Vocoder-Based Speech Synthesis from Silent Videos
Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen
Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation
Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda
A Cyclical Post-Filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-Speech Systems
Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda
Audio Dequantization for High Fidelity Audio Generation in Flow-Based Neural Vocoder
Hyun-Wook Yoon, Sang-Hoon Lee, Hyeong-Rae Noh, Seong-Whan Lee
StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes
Manish Sharma, Tom Kenter, Rob Clark
An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis
Yang Cui, Xi Wang, Lei He, Frank K. Soong
Reverberation Modeling for Source-Filter-Based Neural Vocoder
Yang Ai, Xin Wang, Junichi Yamagishi, Zhen-Hua Ling
Bunched LPCNet: Vocoder for Low-Cost Neural Text-To-Speech Systems
Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C.P. Ramos, Nicholas D. Lane
Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder
Eunwoo Song, Min-Jae Hwang, Ryuichi Yamamoto, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim
SpeedySpeech: Efficient Neural Speech Synthesis
Jan Vainer, Ondřej Dušek
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
Zi-qiang Zhang, Yan Song, Jian-shu Zhang, Ian McLoughlin, Li-Rong Dai
Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models
Ashtosh Sapru, Sri Garimella
Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability
Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong
End-to-End ASR with Adaptive Span Self-Attention
Xuankai Chang, Aswin Shanmugam Subramanian, Pengcheng Guo, Shinji Watanabe, Yuya Fujita, Motoi Omachi
Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition
Egor Lakomkin, Jahn Heymann, Ilya Sklyar, Simon Wiesler
Early Stage LM Integration Using Local and Global Log-Linear Combination
Wilfried Michel, Ralf Schlüter, Hermann Ney
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu
Emitting Word Timings with End-to-End Models
Tara N. Sainath, Ruoming Pang, David Rybach, Basi García, Trevor Strohman
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Danni Liu, Gerasimos Spanakis, Jan Niehues
Neural Language Modeling with Implicit Cache Pointers
Ke Li, Daniel Povey, Sanjeev Khudanpur
Finnish ASR with Deep Transformer Models
Abhilash Jain, Aku Rouhe, Stig-Arne Grönroos, Mikko Kurimo
Distilling the Knowledge of BERT for Sequence-to-Sequence ASR
Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
Stochastic Convolutional Recurrent Networks for Language Modeling
Jen-Tzung Chien, Yu-Min Huang
Investigation of Large-Margin Softmax in Neural Language Modeling
Jingjing Huo, Yingbo Gao, Weiyue Wang, Ralf Schlüter, Hermann Ney
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig
Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi
Insertion-Based Modeling for End-to-End Automatic Speech Recognition
Yuya Fujita, Shinji Watanabe, Motoi Omachi, Xuankai Chang
Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection
Yefei Chen, Heinrich Dinkel, Mengyue Wu, Kai Yu
Dual Attention in Time and Frequency Domain for Voice Activity Detection
Joohyung Lee, Youngmoon Jung, Hoirin Kim
Polishing the Classical Likelihood Ratio Test by Supervised Learning for Voice Activity Detection
Tianjiao Xu, Hui Zhang, Xueliang Zhang
A Noise Robust Technique for Detecting Vowels in Speech Signals
Avinash Kumar, S. Shahnawazuddin, Waquar Ahmad
End-to-End Domain-Adversarial Voice Activity Detection
Marvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, Hervé Bredin, Leibny Paola Garcia-Perera
VOP Detection in Variable Speech Rate Condition
Ayush Agarwal, Jagabandhu Mishra, S.R. Mahadeva Prasanna
MLNET: An Adaptive Multiple Receptive-Field Attention Neural Network for Voice Activity Detection
Zhenpeng Zheng, Jianzong Wang, Ning Cheng, Jian Luo, Jing Xiao
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
Felix Kreuk, Joseph Keshet, Yossi Adi
That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages
Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
Analyzing Read Aloud Speech by Primary School Pupils: Insights for Research and Development
S. Limonard, Catia Cucchiarini, R.W.N.M. van Hout, Helmer Strik
Discovering Articulatory Speech Targets from Synthesized Random Babble
Heikki Rasilo, Yannick Jadoul
Speaker Dependent Acoustic-to-Articulatory Inversion Using Real-Time MRI of the Vocal Tract
Tamás Gábor Csapó
Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet
Narjes Bozorg, Michael T. Johnson
Using Silence MR Image to Synthesise Dynamic MRI Vocal Tract Data of CV
Ioannis K. Douros, Ajinkya Kulkarni, Chrysanthi Dourou, Yu Xie, Jacques Felblinger, Karyna Isaieva, Pierre-André Vuissoz, Yves Laprie
Quantification of Transducer Misalignment in Ultrasound Tongue Imaging
Tamás Gábor Csapó, Kele Xu
Independent and Automatic Evaluation of Speaker-Independent Acoustic-to-Articulatory Reconstruction
Maud Parrot, Juliette Millet, Ewan Dunbar
CSL-EMG_Array: An Open Access Corpus for EMG-to-Speech Conversion
Lorenz Diener, Mehrdad Roustay Vishkasougheh, Tanja Schultz
Links Between Production and Perception of Glottalisation in Individual Australian English Speaker/Listeners
Joshua Penney, Felicity Cox, Anita Szakay
Jointly Fine-Tuning “BERT-Like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition
Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, Suranga Nanayakkara
Vector-Quantized Autoregressive Predictive Coding
Yu-An Chung, Hao Tang, James Glass
Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks
Xingchen Song, Guangsen Wang, Yiheng Huang, Zhiyong Wu, Dan Su, Helen Meng
Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR
Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
Sequence-Level Self-Learning with Multiple Hypotheses
Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng
Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning
Haibin Wu, Andy T. Liu, Hung-yi Lee
Understanding Self-Attention of Self-Supervised Audio Transformers
Shu-wen Yang, Andy T. Liu, Hung-yi Lee
A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning
Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass
Automatic Speech Recognition for ILSE-Interviews: Longitudinal Conversational Speech Recordings Covering Aging and Cognitive Decline
Ayimunishagu Abulimiti, Jochen Weiner, Tanja Schultz
Dynamic Margin Softmax Loss for Speaker Verification
Dao Zhou, Longbiao Wang, Kong Aik Lee, Yibo Wu, Meng Liu, Jianwu Dang, Jianguo Wei
On Parameter Adaptation in Softmax-Based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-Based Speaker Recognition
Magdalena Rybicka, Konrad Kowalczyk
Training Speaker Enrollment Models by Network Optimization
Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida
Supervised Domain Adaptation for Text-Independent Speaker Verification Using Limited Data
Seyyed Saeed Sarfjoo, Srikanth Madikeri, Petr Motlicek, Sébastien Marcel
Angular Margin Centroid Loss for Text-Independent Speaker Recognition
Yuheng Wei, Junzhao Du, Hui Liu
Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning
Jiawen Kang, Ruiqi Liu, Lantian Li, Yunqi Cai, Dong Wang, Thomas Fang Zheng
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck
Length- and Noise-Aware Training Techniques for Short-Utterance Speaker Recognition
Wenda Chen, Jonathan Huang, Tobias Bocklet
Spoken Language ‘Grammatical Error Correction’
Yiting Lu, Mark J.F. Gales, Yu Wang
Mixtures of Deep Neural Experts for Automated Speech Scoring
Sara Papi, Edmondo Trentin, Roberto Gretter, Marco Matassoni, Daniele Falavigna
Targeted Content Feedback in Spoken Language Learning and Assessment
Xinhao Wang, Klaus Zechner, Christopher Hamill
Universal Adversarial Attacks on Spoken Language Assessment Systems
Vyas Raina, Mark J.F. Gales, Kate M. Knill
Ensemble Approaches for Uncertainty in Spoken Language Assessment
Xixin Wu, Kate M. Knill, Mark J.F. Gales, Andrey Malinin
Shadowability Annotation with Fine Granularity on L2 Utterances and its Improvement with Native Listeners’ Script-Shadowing
Zhenchao Lin, Ryo Takashima, Daisuke Saito, Nobuaki Minematsu, Noriko Nakanishi
ASR-Based Evaluation and Feedback for Individualized Reading Practice
Yu Bai, Ferdy Hubers, Catia Cucchiarini, Helmer Strik
Domain Adversarial Neural Networks for Dysarthric Speech Recognition
Dominika Woszczyk, Stavros Petridis, David Millard
Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram
Shunsuke Hidaka, Yogaku Lee, Kohei Wakamiya, Takashi Nakagawa, Tokihiko Kaburagi
Stochastic Curiosity Exploration for Dialogue Systems
Jen-Tzung Chien, Po-Chien Hsu
Conditional Response Augmentation for Dialogue Using Knowledge Distillation
Myeongho Jeong, Seungtaek Choi, Hojae Han, Kyungho Kim, Seung-won Hwang
Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption
Hongyin Luo, Shang-Wen Li, James Glass
End-to-End Task-Oriented Dialog System Through Template Slot Value Generation
Teakgyu Hong, Oh-Woog Kwon, Young-Kil Kim
Task-Oriented Dialog Generation with Enhanced Entity Representation
Zhenhao He, Jiachun Wang, Jian Chen
End-to-End Speech-to-Dialog-Act Recognition
Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara
Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-Oriented Spoken Dialog
Yao Qian, Yu Shi, Michael Zeng
Datasets and Benchmarks for Task-Oriented Log Dialogue Ranking Task
Xinnuo Xu, Yizhe Zhang, Lars Liden, Sungjin Lee
A Semi-Blind Source Separation Approach for Speech Dereverberation
Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian, Qiang Fu
Virtual Acoustic Channel Expansion Based on Neural Networks for Weighted Prediction Error-Based Speech Dereverberation
Joon-Young Yang, Joon-Hyuk Chang
SkipConvNet: Skip Convolutional Neural Network for Speech Dereverberation Using Optimally Smoothed Spectral Mapping
Vinay Kothapally, Wei Xia, Shahram Ghorbani, John H.L. Hansen, Wei Xue, Jing Huang
A Robust and Cascaded Acoustic Echo Cancellation Based on Deep Learning
Chenggang Zhang, Xueliang Zhang
Generative Adversarial Network Based Acoustic Echo Cancellation
Yi Zhang, Chengyun Deng, Shiqian Ma, Yongtao Sha, Hui Song, Xiangang Li
Nonlinear Residual Echo Suppression Using a Recurrent Neural Network
Lukas Pfeifenberger, Franz Pernkopf
Independent Echo Path Modeling for Stereophonic Acoustic Echo Cancellation
Yi Gao, Ian Liu, J. Zheng, Cheng Luo, Bin Li
Nonlinear Residual Echo Suppression Based on Multi-Stream Conv-TasNet
Hongsheng Chen, Teng Xiang, Kai Chen, Jing Lu
Improving Partition-Block-Based Acoustic Echo Canceler in Under-Modeling Scenarios
Wenzhi Fan, Jing Lu
Attention Wave-U-Net for Acoustic Echo Cancellation
Jung-Hee Kim, Joon-Hyuk Chang
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint
Zexin Cai, Chuxiong Zhang, Ming Li
Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi
Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding
Tao Wang, Xuefei Liu, Jianhua Tao, Jiangyan Yi, Ruibo Fu, Zhengqi Wen
Bi-Level Speaker Supervision for One-Shot Speech Synthesis
Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Chunyu Qiang
Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding
Alex Peiró-Lilja, Mireia Farrús
MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou
JDI-T: Jointly Trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment
Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bongwan Kim, Jaesam Yoon
End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention
Masashi Aso, Shinnosuke Takamichi, Hiroshi Saruwatari
Attention Forcing for Speech Synthesis
Qingyun Dou, Joshua Efiong, Mark J.F. Gales
Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis
Jason Fong, Jason Taylor, Simon King
MultiSpeech: Multi-Speaker Text to Speech with Transformer
Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin
Exploiting Conic Affinity Measures to Design Speech Enhancement Systems Operating in Unseen Noise Conditions
Pavlos Papadopoulos, Shrikanth Narayanan
Adversarial Dictionary Learning for Monaural Speech Enhancement
Yunyun Ji, Longting Xu, Wei-Ping Zhu
Semi-Supervised Self-Produced Speech Enhancement and Suppression Based on Joint Source Modeling of Air- and Body-Conducted Signals Using Variational Autoencoder
Shogo Seki, Moe Takada, Tomoki Toda
Spatial Covariance Matrix Estimation for Reverberant Speech with Application to Speech Enhancement
Ran Weisman, Vladimir Tourbabin, Paul Calamia, Boaz Rafaely
A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement
Minh Tri Ho, Jinyoung Lee, Bong-Ki Lee, Dong Hoon Yi, Hong-Goo Kang
TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids
Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, Paul N. Whatmough
Intelligibility Enhancement Based on Speech Waveform Modification Using Hearing Impairment
Shu Hikosaka, Shogo Seki, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, Hideki Banno, Tomoki Toda
Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network
Nana Hou, Chenglin Xu, Van Tung Pham, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li
Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension
Nana Hou, Chenglin Xu, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li
Phase-Aware Music Super-Resolution Using Generative Adversarial Networks
Shichao Hu, Bin Zhang, Beici Liang, Ethan Zhao, Simon Lui
Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition
Jian Huang, Jianhua Tao, Bin Liu, Zheng Lian
Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition
Md. Asif Jalal, Rosanna Milner, Thomas Hain, Roger K. Moore
Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition
Weiquan Fan, Xiangmin Xu, Xiaofen Xing, Dongyan Huang
Speech Emotion Recognition with Discriminative Feature Learning
Huan Zhou, Kai Liu
Using Speech Enhancement Preprocessing for Speech Emotion Recognition in Realistic Noisy Conditions
Hengshun Zhou, Jun Du, Yan-Hui Tu, Chin-Hui Lee
Comparison of Glottal Source Parameter Values in Emotional Vowels
Yongwei Li, Jianhua Tao, Bin Liu, Donna Erickson, Masato Akagi
Learning to Recognize Per-Rater’s Emotion Perception Using Co-Rater Training Strategy with Soft and Hard Labels
Huang-Cheng Chou, Chi-Chun Lee
Empirical Interpretation of Speech Emotion Perception with Attention Based Model for Speech Emotion Recognition
Md. Asif Jalal, Rosanna Milner, Thomas Hain
Phonetic Accommodation of L2 German Speakers to the Virtual Language Learning Tutor Mirabella
Iona Gessinger, Bernd Möbius, Bistra Andreeva, Eran Raveh, Ingmar Steiner
Characterization of Singaporean Children’s English: Comparisons to American and British Counterparts Using Archetypal Analysis
Yuling Gu, Nancy F. Chen
Rhythmic Convergence in Canadian French Varieties?
Svetlana Kaminskaïa
Malayalam-English Code-Switched: Grapheme to Phoneme System
Sreeja Manghat, Sreeram Manghat, Tanja Schultz
Ongoing Phonologization of Word-Final Voicing Alternations in Two Romance Languages: Romanian and French
Mathilde Hutin, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker
Cues for Perception of Gender in Synthetic Voices and the Role of Identity
Maxwell Hope, Jason Lilley
Phonetic Entrainment in Cooperative Dialogues: A Case of Russian
Alla Menshikova, Daniil Kocharov, Tatiana Kachkovskaia
Prosodic Characteristics of Genuine and Mock (Im)polite Mandarin Utterances
Chengwei Xu, Wentao Gu
Tone Variations in Regionally Accented Mandarin
Yanping Li, Catherine T. Best, Michael D. Tyler, Denis Burnham
F0 Patterns in Mandarin Statements of Mandarin and Cantonese Speakers
Yike Yang, Si Chen, Xi Chen
SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering
Yung-Sung Chuang, Chi-Liang Liu, Hung-yi Lee, Lin-shan Lee
An Audio-Enriched BERT-Based Framework for Spoken Multiple-Choice Question Answering
Chia-Chih Kuo, Shang-Bao Luo, Kuan-Yu Chen
Entity Linking for Short Text Using Structured Knowledge Graph via Multi-Grained Text Matching
Binxuan Huang, Han Wang, Tong Wang, Yue Liu, Yang Liu
Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition
Mingxin Zhang, Tomohiro Tanaka, Wenxin Hou, Shengzhou Gao, Takahiro Shinozaki
Semi-Supervised Learning for Character Expression of Spoken Dialogue Systems
Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara
Dimensional Emotion Prediction Based on Interactive Context in Conversation
Xiaohan Shi, Sixia Li, Jianwu Dang
HRI-RNN: A User-Robot Dynamics-Oriented RNN for Engagement Decrease Detection
Asma Atamna, Chloé Clavel
Neural Representations of Dialogical History for Improving Upcoming Turn Acoustic Parameters Prediction
Simone Fuscone, Benoit Favre, Laurent Prévot
Detecting Domain-Specific Credibility and Expertise in Text and Speech
Shengli Hu
The Attacker’s Perspective on Automatic Speaker Verification: An Overview
Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, Haizhou Li
Extrapolating False Alarm Rates in Automatic Speaker Verification
Alexey Sholokhov, Tomi Kinnunen, Ville Vestman, Kong Aik Lee
Self-Supervised Spoofing Audio Detection Scheme
Ziyue Jiang, Hongcheng Zhu, Li Peng, Wenbing Ding, Yanzhen Ren
Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition
Qing Wang, Pengcheng Guo, Lei Xie
x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification
Jesús Villalba, Yuekai Zhang, Najim Dehak
Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples
Yuekai Zhang, Ziyan Jiang, Jesús Villalba, Najim Dehak
Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks
Krishna D. N., Ankita Patil
Abstractive Spoken Document Summarization Using Hierarchical Model with Multi-Stage Attention Diversity Optimization
Potsawee Manakul, Mark J.F. Gales, Linlin Wang
Improved Learning of Word Embeddings with Word Definitions and Semantic Injection
Yichi Zhang, Yinpei Dai, Zhijian Ou, Huixin Wang, Junlan Feng
Wake Word Detection with Alignment-Free Lattice-Free MMI
Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur
Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models
Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Chi Mai Luong
End-to-End Named Entity Recognition from English Speech
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah
Semantic Complexity in End-to-End Spoken Language Understanding
Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris
Analysis of Disfluency in Children’s Speech
Trang Tran, Morgan Tinkler, Gary Yeung, Abeer Alwan, Mari Ostendorf
Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition
Ashish Mittal, Samarth Bharadwaj, Shreya Khare, Saneem Chemmengath, Karthik Sankaranarayanan, Brian Kingsbury
Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation
Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, Devang Naik
Speaker-Utterance Dual Attention for Speaker and Utterance Verification
Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, Shengmei Shen, Haizhou Li
Adversarial Separation and Adaptation Network for Far-Field Speaker Verification
Lu Yi, Man-Wai Mak
MIRNet: Learning Multiple Identities Representations in Overlapped Speech
Hyewon Han, Soo-Whan Chung, Hong-Goo Kang
Strategies for End-to-End Text-Independent Speaker Verification
Weiwei Lin, Man-Wai Mak, Jen-Tzung Chien
Why Did the x-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch Upon Target Score on VoxCeleb Data
Rosa González Hautamäki, Tomi Kinnunen
Variable Frame Rate-Based Data Augmentation to Handle Speaking-Style Variability for Automatic Speaker Verification
Amber Afshan, Jinxi Guo, Soo Jin Park, Vijay Ravi, Alan McCree, Abeer Alwan
A Machine of Few Words: Interactive Speaker Recognition with Reinforcement Learning
Mathieu Seurin, Florian Strub, Philippe Preux, Olivier Pietquin
Improving On-Device Speaker Verification Using Federated Learning with Privacy
Filip Granqvist, Matt Seigel, Rogier van Dalen, Áine Cahill, Stephen Shum, Matthias Paulik
Neural PLDA Modeling for End-to-End Speaker Verification
Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy
State Sequence Pooling Training of Acoustic Models for Keyword Spotting
Kuba Łopatka, Tobias Bocklet
Training Keyword Spotting Models on Non-IID Data with Federated Learning
Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio Lopez Moreno, Rajiv Mathews
Class LM and Word Mapping for Contextual Biasing in End-to-End ASR
Rongqing Huang, Ossama Abdel-hamid, Xinwei Li, Gunnar Evermann
Do End-to-End Speech Recognition Models Care About Context?
Lasse Borgholt, Jakob D. Havtorn, Željko Agić, Anders Søgaard, Lars Maaløe, Christian Igel
Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios
Ankur Kumar, Sachin Singh, Dhananjaya Gowda, Abhinav Garg, Shatrughan Singh, Chanwoo Kim
Speaker Code Based Speaker Adaptive Training Using Model Agnostic Meta-Learning
Huaxin Wu, Genshun Wan, Jia Pan
Domain Adaptation Using Class Similarity for Robust Speech Recognition
Han Zhu, Jiangjiang Zhao, Yuling Ren, Li Wang, Pengyuan Zhang
Incremental Machine Speech Chain Towards Enabling Listening While Speaking in Real-Time
Sashi Novitasari, Andros Tjandra, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura
Context-Dependent Acoustic Modeling Without Explicit Phone Clustering
Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney
Voice Conversion Based Data Augmentation to Improve Children’s Speech Recognition in Limited Data Scenario
S. Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, Waquar Ahmad
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman
Joint Detection of Sentence Stress and Phrase Boundary for Prosody
Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang
Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis
Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet
Speaking Speed Control of End-to-End Speech Synthesis Using Sentence-Level Conditioning
Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho
Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection
Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba
Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
Tom Kenter, Manish Sharma, Rob Clark
Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi
Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao
Discriminative Method to Extract Coarse Prosodic Structure and its Application for Statistical Phrase/Accent Command Estimation
Yuma Shirahata, Daisuke Saito, Nobuaki Minematsu
Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features
Tuomo Raitio, Ramya Rasipuram, Dan Castellani
Controllable Neural Prosody Synthesis
Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore
Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song
Interactive Text-to-Speech System via Joint Style Analysis
Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Köhler, Christian Fuegen, Qing He
Mobile-Assisted Prosody Training for Limited English Proficiency: Learner Background and Speech Learning Pattern
Kevin Hirschi, Okim Kang, Catia Cucchiarini, John H.L. Hansen, Keelan Evanini, Helmer Strik
Finding Intelligible Consonant-Vowel Sounds Using High-Quality Articulatory Synthesis
Daniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul K. Krug, Peter Birkholz, Yi Xu
Audiovisual Correspondence Learning in Humans and Machines
Venkat Krishnamohan, Akshara Soman, Anshul Gupta, Sriram Ganapathy
Perception of English Fricatives and Affricates by Advanced Chinese Learners of English
Yizhou Lan
Perception of Japanese Consonant Length by Native Speakers of Korean Differing in Japanese Learning Experience
Kimiko Tsukada, Joo-Yeon Kim, Jeong-Im Han
Automatic Detection of Phonological Errors in Child Speech Using Siamese Recurrent Autoencoder
Si-Ioi Ng, Tan Lee
A Comparison of English Rhythm Produced by Native American Speakers and Mandarin ESL Primary School Learners
Hongwei Ding, Binghuai Lin, Liyuan Wang, Hui Wang, Ruomei Fang
Cross-Linguistic Interaction Between Phonological Categorization and Orthography Predicts Prosodic Effects in the Acquisition of Portuguese Liquids by L1-Mandarin Learners
Chao Zhou, Silke Hamann
Cross-Linguistic Perception of Utterances with Willingness and Reluctance in Mandarin by Korean L2 Learners
Wenqian Li, Jung-Yueh Tu
Speech Enhancement Based on Beamforming and Post-Filtering by Combining Phase Information
Rui Cheng, Changchun Bao
A Noise-Aware Memory-Attention Network Architecture for Regression-Based Speech Enhancement
Yu-Xuan Wang, Jun Du, Li Chai, Chin-Hui Lee, Jia Pan
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
Jiaqi Su, Zeyu Jin, Adam Finkelstein
Learning Complex Spectral Mapping for Speech Enhancement with Improved Cross-Corpus Generalization
Ashutosh Pandey, DeLiang Wang
Speech Enhancement with Stochastic Temporal Convolutional Networks
Julius Richter, Guillaume Carbajal, Timo Gerkmann
Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System
Mandar Gogate, Kia Dashtipour, Amir Hussain
Sparse Mixture of Local Experts for Efficient Speech Enhancement
Aswin Sivaraman, Minje Kim
Improved Speech Enhancement Using TCN with Multiple Encoder-Decoder Layers
Vinith Kishore, Nitya Tiwari, Periyasamy Paramasivam
Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations
Cunhang Fan, Jianhua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen
Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization
Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Kazuyoshi Yoshii
Squeeze for Sneeze: Compact Neural Networks for Cold and Flu Recognition
Merlin Albes, Zhao Ren, Björn W. Schuller, Nicholas Cummins
Extended Study on the Use of Vocal Tract Variables to Quantify Neuromotor Coordination in Depression
Nadee Seneviratne, James R. Williamson, Adam C. Lammert, Thomas F. Quatieri, Carol Espy-Wilson
Affective Conditioning on Hierarchical Attention Networks Applied to Depression Detection from Transcribed Clinical Interviews
Danai Xezonaki, Georgios Paraskevopoulos, Alexandros Potamianos, Shrikanth Narayanan
Domain Adaptation for Enhancing Speech-Based Depression Detection in Natural Environmental Conditions Using Dilated CNNs
Zhaocheng Huang, Julien Epps, Dale Joachim, Brian Stasak, James R. Williamson, Thomas F. Quatieri
Making a Distinction Between Schizophrenia and Bipolar Disorder Based on Temporal Parameters in Spontaneous Speech
Gábor Gosztolya, Anita Bagi, Szilvia Szalóki, István Szendi, Ildikó Hoffmann
Prediction of Sleepiness Ratings from Voice by Man and Machine
Mark Huckvale, András Beke, Mirei Ikushima
Tongue and Lip Motion Patterns in Alaryngeal Speech
Kristin J. Teplansky, Alan Wisler, Beiming Cao, Wendy Liang, Chad W. Whited, Ted Mau, Jun Wang
Autoencoder Bottleneck Features with Multi-Task Optimisation for Improved Continuous Dysarthric Speech Recognition
Zhengjun Yue, Heidi Christensen, Jon Barker
Raw Speech Waveform Based Classification of Patients with ALS, Parkinson’s Disease and Healthy Controls Using CNN-BLSTM
Jhansi Mallela, Aravind Illa, Yamini Belur, Nalini Atchayaram, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh
Assessment of Parkinson’s Disease Medication State Through Automatic Speech Analysis
Anna Pompili, Rubén Solera-Ureña, Alberto Abad, Rita Cardoso, Isabel Guimarães, Margherita Fabbri, Isabel P. Martins, Joaquim Ferreira
Improving Replay Detection System with Channel Consistency DenseNeXt for the ASVspoof 2019 Challenge
Chao Zhang, Junjie Cheng, Yanmei Gu, Huacan Wang, Jun Ma, Shaojun Wang, Jing Xiao
Subjective Quality Evaluation of Speech Signals Transmitted via BPL-PLC Wired System
Przemyslaw Falkowski-Gilski, Grzegorz Debita, Marcin Habrych, Bogdan Miedzinski, Przemyslaw Jedlikowski, Bartosz Polnik, Jan Wandzio, Xin Wang
Investigating the Visual Lombard Effect with Gabor Based Features
Waito Chiu, Yan Xu, Andrew Abel, Chun Lin, Zhengzheng Tu
Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models
Qiang Huang, Thomas Hain
Development of a Speech Quality Database Under Uncontrolled Conditions
Alessandro Ragano, Emmanouil Benetos, Andrew Hines
Evaluating the Reliability of Acoustic Speech Embeddings
Robin Algayres, Mohamed Salah Zaiem, Benoît Sagot, Emmanuel Dupoux
Frame-Level Signal-to-Noise Ratio Estimation Using Deep Learning
Hao Li, DeLiang Wang, Xueliang Zhang, Guanglai Gao
A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals
Xuan Dong, Donald S. Williamson
Effect of Spectral Complexity Reduction and Number of Instruments on Musical Enjoyment with Cochlear Implants
Avamarie Brueggeman, John H.L. Hansen
Spectrum Correction: Acoustic Scene Classification with Mismatched Recording Devices
Michał Kośmider
Distributed Summation Privacy for Speech Enhancement
Matt O’Connor, W. Bastiaan Kleijn
Perception of Privacy Measured in the Crowd — Paired Comparison on the Effect of Background Noises
Anna Leschanowsky, Sneha Das, Tom Bäckström, Pablo Pérez Zarazaga
Hide and Speak: Towards Deep Neural Networks for Speech Steganography
Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, Joseph Keshet
Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification
Sina Däubener, Lea Schönherr, Asja Fischer, Dorothea Kolossa
Privacy Guarantees for De-Identifying Text Transformations
David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, Dietrich Klakow
Detecting Audio Attacks on ASR Systems with Dropout Uncertainty
Tejas Jayashankar, Jonathan Le Roux, Pierre Moulin
Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda
Nonparallel Training of Exemplar-Based Voice Conversion System Using INCA-Based Alignment Technique
Hitoshi Suda, Gaku Kotani, Daisuke Saito
Enhancing Intelligibility of Dysarthric Speech Using Gated Convolutional-Based Voice Conversion System
Chen-Yu Chen, Wei-Zhong Zheng, Syu-Siang Wang, Yu Tsao, Pei-Chun Li, Ying-Hui Lai
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture
Da-Yi Wu, Yen-Hao Chen, Hung-yi Lee
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion Without Parallel Data
Seung-won Park, Doo-young Kim, Myun-chul Joe
Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis
Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Tao Wang, Chunyu Qiang
ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
Zheng Lian, Zhengqi Wen, Xinyong Zhou, Songbai Pu, Shengkai Zhang, Jianhua Tao
Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals
Shahan Nercessian
Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks
Minchuan Chen, Weijian Hou, Jun Ma, Shaojun Wang, Jing Xiao
Transferring Source Style in Non-Parallel Voice Conversion
Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, Helen Meng
Voice Conversion Using Speech-to-Speech Neuro-Style Transfer
Ehab A. AlBadawy, Siwei Lyu
Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
Changhan Wang, Juan Pino, Jiatao Gu
Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings
Samuel Thomas, Kartik Audhkhasi, Brian Kingsbury
Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu, Parisa Haghani, Anshuman Tripathi, Bhuvana Ramabhadran, Brian Farris, Hainan Xu, Han Lu, Hasim Sak, Isabel Leal, Neeraj Gaur, Pedro J. Moreno, Qian Zhang
Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems
Srikanth Madikeri, Banriskhem K. Khonglah, Sibo Tong, Petr Motlicek, Hervé Bourlard, Daniel Povey
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert
Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages
Hardik B. Sailor, Thomas Hain
Style Variation as a Vantage Point for Code-Switching
Khyathi Raghavi Chandu, Alan W. Black
Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts
Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, Yanmin Qian
Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS
Yash Sharma, Basil Abraham, Karan Taneja, Preethi Jyothi
Towards Context-Aware End-to-End Code-Switching Speech Recognition
Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, William M. Campbell
Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency
Tuan Dinh, Alexander Kain, Robin Samlan, Beiming Cao, Jun Wang
Automatic Assessment of Dysarthric Severity Level Using Audio-Video Cross-Modal Approach in Deep Learning
Han Tong, Hamid Sharifzadeh, Ian McLoughlin
Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription
Yuqin Lin, Longbiao Wang, Sheng Li, Jianwu Dang, Chenchen Ding
Dysarthric Speech Recognition Based on Deep Metric Learning
Yuki Takashima, Ryoichi Takashima, Tetsuya Takiguchi, Yasuo Ariki
Automatic Glottis Detection and Segmentation in Stroboscopic Videos Using Convolutional Networks
Divya Degala, Achuth Rao M.V., Rahul Krishnamurthy, Pebbili Gopikishore, Veeramani Priyadharshini, Prakash T.K., Prasanta Kumar Ghosh
Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification
Yilin Pan, Bahman Mirheidari, Zehai Tu, Ronan O’Malley, Traci Walker, Annalena Venneri, Markus Reuber, Daniel Blackburn, Heidi Christensen
Coswara — A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis
Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, Sriram Ganapathy
Acoustic-Based Articulatory Phenotypes of Amyotrophic Lateral Sclerosis and Parkinson’s Disease: Towards an Interpretable, Hypothesis-Driven Framework of Motor Control
Hannah P. Rowe, Sarah E. Gutz, Marc F. Maffei, Jordan R. Green
Recognising Emotions in Dysarthric Speech Using Typical Speech Data
Lubna Alhinti, Stuart Cunningham, Heidi Christensen
Detecting and Analysing Spontaneous Oral Cancer Speech in the Wild
Bence Mark Halpern, Rob van Son, Michiel van den Brekel, Odette Scharenborg
The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units
Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux
Vector-Quantized Neural Networks for Acoustic Unit Discovery in the ZeroSpeech 2020 Challenge
Benjamin van Niekerk, Leanne Nortje, Herman Kamper
Exploration of End-to-End Synthesisers for Zero Resource Speech Challenge 2020
Karthik Pandia D.S., Anusha Prakash, Mano Ranjith Kumar M., Hema A. Murthy
Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery
Batuhan Gundogdu, Bolaji Yusuf, Mansur Yesilbursa, Murat Saraclar
Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Exploring TTS Without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020)
Takashi Morita, Hiroki Koda
Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling
Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda
Unsupervised Acoustic Unit Representation Learning for Voice Conversion Using WaveNet Auto-Encoders
Mingjie Chen, Thomas Hain
Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics
Okko Räsänen, María Andrea Cruz Blandón
Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak
Perceptimatic: A Human Speech Perception Benchmark for Unsupervised Subword Modelling
Juliette Millet, Ewan Dunbar
Decoding Imagined, Heard, and Spoken Speech: Classification and Regression of EEG Using a 14-Channel Dry-Contact Mobile Headset
Jonathan Clayton, Scott Wellington, Cassia Valentini-Botinhao, Oliver Watts
Glottal Closure Instants Detection from EGG Signal by Classification Approach
Gurunath Reddy M., K. Sreenivasa Rao, Partha Pratim Das
Classify Imaginary Mandarin Tones with Cortical EEG Signals
Hua Li, Fei Chen
Augmenting Images for ASR and TTS Through Single-Loop and Dual-Loop Multimodal Chain Framework
Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?
Łukasz Augustyniak, Piotr Szymański, Mikołaj Morzy, Piotr Żelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, Najim Dehak
Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech
Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff
Efficient MDI Adaptation for n-Gram Language Models
Ruizhe Huang, Ke Li, Ashish Arora, Daniel Povey, Sanjeev Khudanpur
Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus
Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar
Language Model Data Augmentation Based on Text Domain Transfer
Atsunori Ogawa, Naohiro Tawara, Marc Delcroix
Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach
Krzysztof Wołk
Improving Speech Recognition of Compound-Rich Languages
Prabhat Pandey, Volker Leutnant, Simon Wiesler, Jahn Heymann, Daniel Willett
Language Modeling for Speech Analytics in Under-Resourced Languages
Simone Wills, Pieter Uys, Charl van Heerden, Etienne Barnard
An Early Study on Intelligent Analysis of Speech Under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety
Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller
An Evaluation of the Effect of Anxiety on Speech — Computational Prediction of Anxiety from Sustained Vowels
Alice Baird, Nicholas Cummins, Sebastian Schnieder, Jarek Krajewski, Björn W. Schuller
Hybrid Network Feature Extraction for Depression Assessment from Speech
Ziping Zhao, Qifei Li, Nicholas Cummins, Bin Liu, Haishuai Wang, Jianhua Tao, Björn W. Schuller
Improving Detection of Alzheimer’s Disease Using Automatic Speech Recognition to Identify High-Quality Segments for More Robust Feature Extraction
Yilin Pan, Bahman Mirheidari, Markus Reuber, Annalena Venneri, Daniel Blackburn, Heidi Christensen
Classification of Manifest Huntington Disease Using Vowel Distortion Measures
Amrit Romana, John Bandon, Noelle Carlozzi, Angela Roberts, Emily Mower Provost
Parkinson’s Disease Detection from Speech Using Single Frequency Filtering Cepstral Coefficients
Sudarsana Reddy Kadiri, Rashmi Kethireddy, Paavo Alku
Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer
Sebastião Quintas, Julie Mauclair, Virginie Woisard, Julien Pinquier
Spectral Moment and Duration of Burst of Plosives in Speech of Children with Hearing Impairment and Typically Developing Children — A Comparative Study
Ajish K. Abraham, M. Pushpavathi, N. Sreedevi, A. Navya, C.M. Vikram, S.R. Mahadeva Prasanna
Aphasic Speech Recognition Using a Mixture of Speech Intelligibility Experts
Matthew Perez, Zakaria Aldeneh, Emily Mower Provost
Automatic Discrimination of Apraxia of Speech and Dysarthria Using a Minimalistic Set of Handcrafted Features
Ina Kodrasi, Michaela Pernon, Marina Laganaro, Hervé Bourlard
Weak-Attention Suppression for Transformer Based Speech Recognition
Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer
Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition
Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen
Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning
Song Li, Lin Li, Qingyang Hong, Lingling Liu
Transformer-Based Long-Context End-to-End Speech Recognition
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux
Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR
Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang, Haizhou Li
Universal Speech Transformer
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma
Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen
Cross Attention with Monotonic Alignment for Speech Transformer
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
Exploring Transformers for Large-Scale Speech Recognition
Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong
Sparseness-Aware DOA Estimation with Majorization Minimization
Masahito Togami, Robin Scheibler
Spatial Resolution of Early Reflection for Speech and White Noise
Xiaoli Zhong, Hao Song, Xuejie Liu
Effect of Microphone Position Measurement Error on RIR and its Impact on Speech Intelligibility and Quality
Aditya Raikar, Karan Nathwani, Ashish Panda, Sunil Kumar Kopparapu
Online Blind Reverberation Time Estimation Using CRNNs
Shuwen Deng, Wolfgang Mack, Emanuël A.P. Habets
Single-Channel Blind Direct-to-Reverberation Ratio Estimation Using Masking
Wolfgang Mack, Shuwen Deng, Emanuël A.P. Habets
The Importance of Time-Frequency Averaging for Binaural Speaker Localization in Reverberant Environments
Hanan Beit-On, Vladimir Tourbabin, Boaz Rafaely
Acoustic Signal Enhancement Using Relative Harmonic Coefficients: Spherical Harmonics Domain Approach
Yonggang Hu, Prasanga N. Samarasinghe, Thushara D. Abhayapala
Instantaneous Time Delay Estimation of Broadband Signals
B.H.V.S. Narayana Murthy, J.V. Satyanarayana, Nivedita Chennupati, B. Yegnanarayana
U-Net Based Direct-Path Dominance Test for Robust Direction-of-Arrival Estimation
Hao Wang, Kai Chen, Jing Lu
Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning
Wei Xue, Ying Tong, Chao Zhang, Guohong Ding, Xiaodong He, Bowen Zhou
Article |
---|