doi: 10.21437/Interspeech.2022
ISSN: 2958-1796
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Hyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch
Hanbin Bae, Young-Sun Joo
Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings
Martin Lenglet, Olivier Perrotin, Gérard Bailly
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner
Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech
Dan Lim, Sunghee Jung, Eesung Kim
Interpretable dysarthric speaker adaptation based on optimal-transport
Rosanna Turrisi, Leonardo Badino
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs
Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic
The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition
Luke Prananta, Bence Halpern, Siyuan Feng, Odette Scharenborg
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition
Lester Phillip Violeta, Wen Chin Huang, Tomoki Toda
Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation
Chitralekha Bhat, Ashish Panda, Helmer Strik
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition
Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Noeth, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang
Regularizing Transformer-based Acoustic Models by Penalizing Attention Weights
Munhak Lee, Joon-Hyuk Chang, Sang-Eon Lee, Ju-Seok Seong, Chanhee Park, Haeyoung Kwon
Content-Context Factorized Representations for Automated Speech Recognition
David Chan, Shalini Ghosh
Comparison and Analysis of New Curriculum Criteria for End-to-End ASR
Georgios Karakasidis, Tamás Grósz, Mikko Kurimo
Incremental learning for RNN-Transducer based speech recognition models
Deepak Baby, Pasquale D'Alterio, Valentin Mendelev
Production federated keyword spotting via distillation, filtering, and joint federated-centralized training
Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez-Moreno, Rajiv Mathews, Francoise Beaufays
Use of prosodic and lexical cues for disambiguating wh-words in Korean
Jieun Song, Hae-Sung Jeon, Jieun Kiaer
Autoencoder-Based Tongue Shape Estimation During Continuous Speech
Vinicius Ribeiro, Yves Laprie
Phonetic erosion and information structure in function words: the case of mia
Giuseppe Magistro, Claudia Crocco
Dynamic Vertical Larynx Actions Under Prosodic Focus
Miran Oh, Yoonjeong Lee
Fundamental Frequency Variability over Time in Telephone Interactions
Leah Bradshaw, Eleanor Chodroff, Lena Jäger, Volker Dellwo
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
Jinming Zhao, Hao Yang, Gholamreza Haffari, Ehsan Shareghi
Cross-Modal Decision Regularization for Simultaneous Speech Translation
Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura
Generalized Keyword Spotting using ASR embeddings
Kirandevraj R, Vinod Kumar Kurmi, Vinay Namboodiri, C V Jawahar
Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss
Youngdo Ahn, Sung Joo Lee, Jong Won Shin
Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms
Junghun Kim, Yoojin An, Jihie Kim
The Emotion is Not One-hot Encoding: Learning with Grayscale Label for Emotion Recognition in Conversation
Joosung Lee
Probing speech emotion recognition transformers for linguistic knowledge
Andreas Triantafyllopoulos, Johannes Wagner, Hagen Wierstorf, Maximilian Schmitt, Uwe Reichel, Florian Eyben, Felix Burkhardt, Björn W. Schuller
End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks
Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann
Mind the gap: On the value of silence representations to lexical-based speech emotion recognition
Matthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost
Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier
Huang-Cheng Chou, Chi-Chun Lee, Carlos Busso
Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection
Hira Dhamyal, Bhiksha Raj, Rita Singh
Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion
Tuan Vu Ho, Maori Kobayashi, Masato Akagi
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement
Tuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki
iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement
Minseung Kim, Hyungchan Song, Sein Cheong, Jong Won Shin
Boosting Self-Supervised Embeddings for Speech Enhancement
Kuo-Hsuan Hung, Szu-wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin
Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections
Seorim Hwang, Sung Wook Park, Youngcheol Park
CycleGAN-based Unpaired Speech Dereverberation
Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey
Attentive Training: A New Training Framework for Talker-independent Speaker Extraction
Ashutosh Pandey, DeLiang Wang
Improved Modulation-Domain Loss for Neural-Network-based Speech Enhancement
Tyler Vuong, Richard Stern
Perceptual Characteristics Based Multi-objective Model for Speech Enhancement
Chiang-Jen Peng, Yun-Ju Chan, Yih-Liang Shen, Cheng Yu, Yu Tsao, Tai-Shih Chi
Listen only to me! How well can target speech extraction handle false alarms?
Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction
Hao Shi, Longbiao Wang, Sheng Li, Jianwu Dang, Tatsuya Kawahara
Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments
Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann
PodcastMix: A dataset for separating music and speech in podcasts
Nicolás Schmidt, Jordi Pons, Marius Miron
Independence-based Joint Dereverberation and Separation with Neural Source Model
Kohei Saijo, Robin Scheibler
Spatial Loss for Unsupervised Multi-channel Source Separation
Kohei Saijo, Robin Scheibler
Effect of Head Orientation on Speech Directivity
Samuel Bellows, Timothy W. Leishman
Unsupervised Training of Sequential Neural Beamformer Using Coarsely-separated and Non-separated Signals
Kohei Saijo, Tetsuji Ogawa
Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by Language
Marvin Borsdorf, Kevin Scheck, Haizhou Li, Tanja Schultz
NTF of Spectral and Spatial Features for Tracking and Separation of Moving Sound Sources in Spherical Harmonic Domain
Mateusz Guzik, Konrad Kowalczyk
Modelling Turn-taking in Multispeaker Parties for Realistic Data Simulation
Jack Deadman, Jon Barker
An Initialization Scheme for Meeting Separation with Spatial Mixture Models
Christoph Boeddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach
Prototypical speaker-interference loss for target voice separation using non-parallel audio samples
Seongkyu Mun, Dhananjaya Gowda, Jihwan Lee, Changwoo Han, Dokyun Lee, Chanwoo Kim
Reliability criterion based on learning-phase entropy for speaker recognition with neural network
Pierre-Michel Bousquet, Mickael Rouvier, Jean-Francois Bonastre
Attentive Feature Fusion for Robust Speaker Verification
Bei Liu, Zhengyang Chen, Yanmin Qian
Dual Path Embedding Learning for Speaker Verification with Triplet Attention
Bei Liu, Zhengyang Chen, Yanmin Qian
DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design
Bei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian
Adaptive Rectangle Loss for Speaker Verification
Li Ruida, Fang Shuo, Ma Chenguang, Li Liang
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung-yi Lee, Helen Meng
Enroll-Aware Attentive Statistics Pooling for Target Speaker Verification
Leying Zhang, Zhengyang Chen, Yanmin Qian
Transport-Oriented Feature Aggregation for Speaker Embedding Learning
Yusheng Tian, Jingyu Li, Tan Lee
Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning
Mufan Sang, John H.L. Hansen
CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolution
Linjun Cai, Yuhong Yang, Xufeng Chen, Weiping Tu, Hongyang Chen
Reliable Visualization for Deep Speaker Recognition
Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang
Unifying Cosine and PLDA Back-ends for Speaker Verification
Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, Guanglu Wan
CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding Extractor
Yuheng Wei, Junzhao Du, Hui Liu, Qian Wang
SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech
Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du
VoiceLab: Software for Fully Reproducible Automated Voice Analysis
David Feinberg
TRILLsson: Distilled Universal Paralinguistic Speech Representations
Joel Shor, Subhashini Venugopalan
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network
Nan LI, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li, Jianwu Dang
A Sparsity-promoting Dictionary Model for Variational Autoencoders
Mostafa Sadeghi, Paul Magron
Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition
Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, Li Zhao
Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning
John H.L. Hansen, ZHENYU WANG
PEAF: Learnable Power Efficient Analog Acoustic Features for Audio Recognition
Boris Bergsma, Minhao Yang, Milos Cernak
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load
Gasser Elbanna, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar, Mikolaj Kegler, Pierre Beckmann, Milos Cernak
Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition
Shijun Wang, Hamed Hemati, Jón Guðnason, Damian Borth
Learning neural audio features without supervision
Sarthak Yadav, Neil Zeghidour
Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy Speech
Yixuan Zhang, Heming Wang, DeLiang Wang
Predicting label distribution improves non-intrusive speech quality estimation
Abu Zaher Md Faridee, Hannes Gamper
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models
Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka
Dataset Pruning for Resource-constrained Spoofed Audio Detection
Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
EdiTTS: Score-based Editing for Controllable Text-to-Speech
Jaesung Tae, Hyeongju Kim, Taesu Kim
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information
Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng
SpeechPainter: Text-conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi
A polyphone BERT for Polyphone Disambiguation in Mandarin Chinese
Song Zhang, Ken Zheng, Xiaoxu Zhu, Baoxiang Li
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge
Mutian He, Jingzhou Yang, Lei He, Frank Soong
ByT5 model for massively multilingual grapheme-to-phoneme conversion
Jian Zhu, Cong Zhang, David Jurgens
DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis
Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, Jiuxiang Gu, Ani Nenkova, Vlad Morariu, Rajiv Jain, Dinesh Manocha
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition
Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson
An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis
Tho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Hung Quoc Truong
Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks
Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter
An Automatic Soundtracking System for Text-to-Speech Audiobooks
Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin
Environment Aware Text-to-Speech Synthesis
Daxin Tan, Guangyan Zhang, Tan Lee
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
Artem Ploujnikov, Mirco Ravanelli
Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization
Evelina Bakhturina, Yang Zhang, Boris Ginsburg
Prosodic alignment for off-screen automatic dubbing
Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis
Qibing Bai, Tom Ko, Yu Zhang
CAUSE: Crossmodal Action Unit Sequence Estimation from Speech
Hirokazu Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka
Visualising Model Training via Vowel Space for Text-To-Speech Systems
Binu Nisal Abeysinghe, Jesin James, Catherine Watson, Felix Marattukalam
Binary Early-Exit Network for Adaptive Inference on Low-Resource Devices
Aaqib Saeed
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data
Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura
Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation
Yi-Kai Zhang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
Federated Domain Adaptation for ASR with Full Self-Supervision
Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide
Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection
Longfei Yang, Wenqing Wei, Sheng Li, Jiyi Li, Takahiro Shinozaki
Extending RNN-T-based speech recognition systems with emotion and language classification
Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization
Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg
Leveraging Prosody for Punctuation Prediction of Spontaneous Speech
Yeonjin Cho, Sara Ng, Trang Tran, Mari Ostendorf
A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
Fan Yu, Zhihao Du, ShiLiang Zhang, Yuxiao Lin, Lei Xie
TMGAN-PLC: Audio Packet Loss Concealment using Temporal Memory Generative Adversarial Network
Yuansheng Guan, Guochen Yu, Andong Li, Chengshi Zheng, Jie Wang
Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model
Jean-Marc Valin, Ahmed Mustafa, Christopher Montgomery, Timothy B. Terriberry, Michael Klingbeil, Paris Smaragdis, Arvindh Krishnaswamy
PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network
Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
Lorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner, Ross Cutler
End-to-End Multi-Loss Training for Low Delay Packet Loss Concealment
Nan Li, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu
Extended U-Net for Speaker Verification in Noisy Environments
Ju-Ho Kim, Jungwoo Heo, Hye-jin Shim, Ha-Jin Yu
Domain Agnostic Few-shot Learning for Speaker Verification
Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun
Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA?
Qiongqiong Wang, Kong Aik Lee, Tianchi Liu
Training speaker embedding extractors using multi-speaker audio with unknown speaker boundaries
Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Anna Silnova, Lukas Burget, Jan Černocký
Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations
Chau Luu, Steve Renals, Peter Bell
Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification
Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak
Variability in Production of Non-Sibilant Fricative [ç] in /hi/
Tsukasa Yoshinaga, Kikuo Maekawa, Akiyoshi Iida
Streaming model for Acoustic to Articulatory Inversion with transformer networks
Sathvik Udupa, Aravind Illa, Prasanta Ghosh
Trajectories predicted by optimal speech motor control using LSTM networks
Tsiky Rakotomalala, Pierre Baraduc, Pascal Perrier
Exploration strategies for articulatory synthesis of complex syllable onsets
Daniel Van Niekerk, Anqi Xu, Branislav Gerazov, Paul Konstantin Krug, Peter Birkholz, Yi Xu
Linguistic versus biological factors governing acoustic voice variation
Yoonjeong Lee, Jody Kreiman
Acquisition of allophonic variation in second language speech: An acoustic and articulatory study of English laterals by Japanese speakers
Takayuki Nagamine
SAQAM: Spatial Audio Quality Assessment Metric
Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel Degene Gebru, Vamsi Krishna Ithapu, Paul Calamia
Speech Quality Assessment through MOS using Non-Matching References
Pranay Manocha, Anurag Kumar
An objective test tool for pitch extractors' response attributes
Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise
Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection
Kai Li, Sheng Li, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang, Jianwu Dang, Masashi Unoki
Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning
Salah Zaiem, Titouan Parcollet, Slim Essid
Transformer-based quality assessment model for generalized user-generated multimedia audio content
Deebha Mumtaz, Ajit Jena, Vinit Jakhetiya, Karan Nathwani, Sharath Chandra Guntuku
Space-Efficient Representation of Entity-centric Query Language Models
Christophe Van Gysel, Mirko Hannemann, Ernest Pusateri, Youssef Oualil, Ilya Oparin
Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems
Saket Dingliwal, Ashish Shenoy, Sravan Bodapati, Ankur Gandhe, Ravi Teja Gadde, Katrin Kirchhoff
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara Sainath, Ruoming Pang, Trevor D. Strohman, Shankar Kumar
UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey
A BERT-based Language Modeling Framework
Chin-Yueh Chien, Kuan-Yu Chen
Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed Microphones
Yoshiki Masuyama, Kouei Yamaoka, Nobutaka Ono
Challenges and Opportunities in Multi-device Speech Processing
Gregory Ciccarelli, Jarred Barber, Arun Nair, Israel Cohen, Tao Zhang
Practical Over-the-air Perceptual AcousticWatermarking
Ameya Agaskar
Clustering-based Wake Word Detection in Privacy-aware Acoustic Sensor Networks
Timm Koppelmann, Luca Becker, Alexandru Nelus, Rene Glitza, Lea Schönherr, Rainer Martin
Relative Acoustic Features for Distance Estimation in Smart-Homes
Francesco Nespoli, Daniel Barreda, Patrick Naylor
Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path Network
Ashutosh Pandey, Buye Xu, Anurag Kumar, Jacob Donley, Paul Calamia, DeLiang Wang
Relationship between the acoustic time intervals and tongue movements of German diphthongs
Arne-Lukas Fietkau, Simon Stone, Peter Birkholz
Development of allophonic realization until adolescence: A production study of the affricate-fricative variation of /z/ among Japanese children
Sanae Matsui, Kyoji Iwamoto, Reiko Mazuka
Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition
Chung-Soo Ahn, Chamara Kasun, Sunil Sivadas, Jagath Rajapakse
Low-Level Physiological Implications of End-to-End Learning for Speech Recognition
Louise Coppieters de Gibson, Philip N. Garner
Idiosyncratic lingual articulation of American English /æ/ and /ɑ/ using network analysis
Carolina Lins Machado, Volker Dellwo, Lei He
Method for improving the word intelligibility of presented speech using bone-conduction headphones
Teruki Toya, Wenyu Zhu, Maori Kobayashi, Kenichi Nakamura, Masashi Unoki
Three-dimensional finite-difference time-domain acoustic analysis of simplified vocal tract shapes
Debasish Mohapatra, Mario Fleischer, Victor Zappi, Peter Birkholz, Sidney Fels
Speech imitation skills predict automatic phonetic convergence: a GMM-UBM study on L2
Dorina de Jong, Aldo Pastore, Noël Nguyen, Alessandro D'Ausilio
Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE
Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber
Deep Speech Synthesis from Articulatory Representations
Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala Krishna Anumanchipalli
Orofacial somatosensory inputs in speech perceptual training modulate speech production
Monica Ashokumar, Jean-Luc Schwartz, Takayuki Ito
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim
DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto
MSR-NV: Neural Vocoder Using Multiple Sampling Rates
Kentaro Mitsui, Kei Sawada
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge
Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung
Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
Jaesung Bae, Jinhyeok Yang, Taejun Bak, Young-Sun Joo
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation
Krishna Subramani, Jean-Marc Valin, Umut Isik, Paris Smaragdis, Arvindh Krishnaswamy
EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models
Perry Lam, Huayun Zhang, Nancy Chen, Berrak Sisman
Fine-grained Noise Control for Multispeaker Speech Synthesis
Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis
WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby
Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU
Ivan Vovk, Tasnima Sadekova, Vladimir Gogoryan, Vadim Popov, Mikhail Kudinov, Jiansheng Wei
Simple and Effective Unsupervised Speech Synthesis
Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass
Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation
Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda
NeMo Open Source Speaker Diarization System
Tae Jin Park, Nithin Rao Koluguri, Fei Jia, Jagadeesh Balam, Boris Ginsburg
Voice2Alliance: Automatic Speaker Diarization and Quality Assurance of Conversational Alignment
Baihan Lin
VAgyojaka: An Annotating and Post-Editing Tool for Automatic Speech Recognition
Rishabh Kumar, Devaraja Adiga, Mayank Kothari, Jatin Dalal, Ganesh Ramakrishnan, Preethi Jyothi
SKYE: More than a conversational AI
Alzahra Badi, Chungho Park, Minseok Keum, Miguel Alba, Youngsuk Ryu, Jeongmin Bae
Training Data Generation with DOA-based Selecting and Remixing for Unsupervised Training of Deep Separation Models
Hokuto Munakata, Ryu Takeda, Kazunori Komatani
Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output
Hangting Chen, Yi Yang, Feng Dang, Pengyuan Zhang
Joint Estimation of Direction-of-Arrival and Distance for Arrays with Directional Sensors based on Sparse Bayesian Learning
Feifei Xiong, Pengyu Wang, Zhongfu Ye, Jinwei Feng
How to Listen? Rethinking Visual Sound Localization
Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello
Small Footprint Neural Networks for Acoustic Direction of Arrival Estimation
Zhiheng Ouyang, Miao Wang, Wei-Ping Zhu
Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
Xiaoyu Wang, Xiangyu Kong, Xiulian Peng, Yan Lu
MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources
Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang
Iterative Sound Source Localization for Unknown Number of Sources
Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang
Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang
TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation
Ali Aroudi, Stefan Uhlich, Marc Ferras Font
PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement
Xiaofeng Ge, Jiangyu Han, Yanhua Long, Haixin Guan
Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement
Zhuangqi Chen, Pingjian Zhang
Cross-Layer Similarity Knowledge Distillation for Speech Enhancement
Jiaming Cheng, Ruiyu Liang, Yue Xie, Li Zhao, Björn Schuller, Jie Jia, Yiyuan Peng
Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation
Feifei Xiong, Weiguang Chen, Pengyu Wang, Xiaofei Li, Jinwei Feng
CMGAN: Conformer-based Metric GAN for Speech Enhancement
Ruizhe Cao, Sherif Abdulatif, Bin Yang
Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement
Zeyuan Wei, Li Hao, Xueliang Zhang
Single-channel speech enhancement using Graph Fourier Transform
Chenhui Zhang, Xiang Pan
Joint Optimization of the Module and Sign of the Spectral Real Part Based on CRN for Speech Denoising.
Zilu Guo, Xu Xu, Zhongfu Ye
Attentive Recurrent Network for Low-Latency Active Noise Control
Hao Zhang, Ashutosh Pandey, DeLiang Wang
Memory-Efficient Multi-Step Speech Enhancement with Neural ODE
Jen-Hung Huang, Chung-Hsien Wu
GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD Block
Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Jianjun Hao
Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li
Speech Enhancement with Fullband-Subband Cross-Attention Network
Jun Chen, Wei Rao, Zilin Wang, Zhiyong Wu, Yannan Wang, Tao Yu, Shidong Shang, Helen Meng
OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu, Szu-wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli
Efficient Speech Enhancement with Neural Homomorphic Synthesis
Wenbin Jiang, Tao Liu, Kai Yu
Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation
Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations
Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura
FedNST: Federated Noisy Student Training for Automatic Speech Recognition
Haaris Mehmood, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay
SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition
Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen, Youzheng Wu, Xiaodong He
NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech Recognition
Yukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan
Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR
Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma
PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
Guodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno
Improving Rare Word Recognition with LM-aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach
Improving the Training Recipe for a Robust Conformer-based Hybrid Model
Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney
CTC Variations Through New WFST Topologies
Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg
Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition
Martin Sustek, Samik Sadhu, Hynek Hermansky
Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition
Chenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao
On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training
Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker
From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition
Selen Hande Kabil, Herve Bourlard
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks
Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya
Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe
Reducing Offensive Replies in Open Domain Dialogue Systems
Naokazu Uchida, Takeshi Homma, Makoto Iwayama, Yasuhiro Sogawa
Induce Spoken Dialog Intents via Deep Unsupervised Context Contrastive Clustering
Ting-Wei Wu, Biing Juang
Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information
Fumio Nihei, Ryo Ishii, Yukiko Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura
Contextual Acoustic Barge-In Classification for Spoken Dialog Systems
Dhanush Bekal, Sundararajan Srinivasan, Srikanth Ronanki, Sravan Bodapati, Katrin Kirchhoff
Calibrate and Refine! A Novel and Agile Framework for ASR Error Robust Intent Detection
Peilin Zhou, Dading Chong, Helin Wang, Qingcheng Zeng
ASR-Robust Natural Language Understanding on ASR-GLUE dataset
Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng
From Disfluency Detection to Intent Detection and Slot Filling
Mai Hoang Dao, Thinh Truong, Dat Quoc Nguyen
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis
Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jian-Qing Gao
Extending Compositional Attention Networks for Social Reasoning in Videos
Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos
TopicKS: Topic-driven Knowledge Selection for Knowledge-grounded Dialogue Generation
Shiquan Wang, Yuke Si, Xiao Wei, Longbiao Wang, Zhiqiang Zhuang, Xiaowang Zhang, Jianwu Dang
Bottom-up discovery of structure and variation in response tokens (‘backchannels’) across diverse languages
Andreas Liesenfeld, Mark Dingemanse
Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding
Yi Zhu, Zexun Wang, Hang Liu, Peiying Wang, Mingchao Feng, Meng Chen, Xiaodong He
Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism
Keiko Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, Shigeki Sagayama, Hidenori Yamasue
Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition
Einari Vaaras, Manu Airaksinen, Okko Räsänen
Emotion-Shift Aware CRF for Decoding Emotion Sequence in Conversation
Chun-Yu Chen, Yun-Shao Lin, Chi-Chun Lee
Vaccinating SER to Neutralize Adversarial Attacks with Self-Supervised Augmentation Strategy
Bo-Hao Su, Chi-Chun Lee
Speech Emotion Recognition in the Wild using Multi-task and Adversarial Learning
Jack Parry, Eric DeMattos, Anita Klementiev, Axel Ind, Daniela Morse-Kopp, Georgia Clarke, Dimitri Palaz
The Magnitude and Phase based Speech Representation Learning using Autoencoder for Classifying Speech Emotions using Deep Canonical Correlation Analysis
Ashishkumar Gudmalwar, Biplove Basel, Anirban Dutta, Ch V Rama Rao
Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks
Lucas Goncalves, Carlos Busso
SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani
Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches
Yutaro Sanada, Takumi Nakagawa, Yuichiro Wada, Kosaku Takanashi, Yuhui Zhang, Kiichi Tokuyama, Takafumi Kanamori, Tomonori Yamada
NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling
Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao
FFC-SE: Fast Fourier Convolution for Speech Enhancement
Ivan Shchekotov, Pavel K. Andreev, Oleg Ivanov, Aibek Alanov, Dmitry Vetrov
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement
Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi
Multi-View Attention Transfer for Efficient Speech Enhancement
Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Byung Hoon Lee, Sung Won Han
SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate
Nabarun Goswami, Tatsuya Harada
Correcting Mispronunciations in Speech using Spectrogram Inpainting
Talia Ben Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph Keshet
Speech Audio Corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech
Jason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, Simon King
End-to-End Binaural Speech Synthesis
Wen Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon
PoeticTTS - Controllable Poetry Reading for Literary Studies
Julia Koch, Florian Lux, Nadja Schauffler, Toni Bernhart, Felix Dieterle, Jonas Kuhn, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu
Articulatory Synthesis for Data Augmentation in Phoneme Recognition
Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu
SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary Task
Jihyun Lee, Gary Geunbae Lee
Benchmarking Transformers-based models on French Spoken Language Understanding tasks
Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset
mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling
Seong-Hwan Heo, WonKee Lee, Jong-Hyeok Lee
Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding
Pu Wang, Hugo Van hamme
On joint training with interfaces for spoken language understanding
Anirudh Raju, Milind Rao, Gautam Tiwari, PRANAV DHERAM, Bryan Anderson, Zhe Zhang, Chul Lee, Bach Bui, Ariya Rastrow
Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models
Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed Hussen Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik
Building African Voices
Perez Ogayo, Graham Neubig, Alan W Black
Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities
PRANAV DHERAM, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke
Training and typological bias in ASR performance for world Englishes
May Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, Nicole Holliday
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems
Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko, Yannick Estève
Automatic Dialect Density Estimation for African American English
Alexander Johnson, Kevin Everson, Vijay Ravi, Anissa Gladney, Mari Ostendorf, Abeer Alwan
Improving Language Identification of Accented Speech
Kunnar Kukk, Tanel Alumäe
Design Guidelines for Inclusive Speaker Verification Evaluation Datasets
Wiebke Toussaint, Lauriane Gorce, Aaron Yi Ding
Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation
Viet Anh Trinh, Pegah Ghahremani, Brian King, Jasha Droppo, Andreas Stolcke, Roland Maas
Gradual Improvements Observed in Learners' Perception and Production of L2 Sounds Through Continuing Shadowing Practices on a Daily Basis
Takuya Kunihara, Chuanbo Zhu, Nobuaki Minematsu, Noriko Nakanishi
Spoofed speech from the perspective of a forensic phonetician
Christin Kirchhübel, Georgina Brown
Investigating Prosodic Variation in British English Varieties using ProPer
Hae-Sung Jeon, Stephen Nichols
Perceived prominence and downstep in Japanese
Hyun Kyung Hwang, Manami Hirayama, Takaomi Kato
The discrimination of [zi]-[dʑi] by Japanese listeners and the prospective phonologization of /zi/
Andrea Alicehajic, Silke Hamann
Glottal inverse filtering based on articulatory synthesis and deep learning
Ingo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz
Investigating phonetic convergence of laughter in conversation
Bogdan Ludusan, Marin Schröer, Petra Wagner
Telling self-defining memories: An acoustic study of natural emotional speech productions
Veronique Delvaux, Audrey Lavallée, Fanny Degouis, Xavier Saloppe, Jean-Louis Nandrino, Thierry Pham
Voicing neutralization in Romanian fricatives across different speech styles
Laura Spinu, Ioana Vasilescu, Lori Lamel, Jason Lilley
Nasal Coda Loss in the Chengdu Dialect of Mandarin: Evidence from RT-MRI
Sishi Liao, Phil Hoole, Conceição Cunha, Esther Kunay, Aletheia Cui, Lia Saki Bučar Shigemori, Felicitas Kleber, Dirk Voit, Jens Frahm, Jonathan Harrington
ema2wav: doing articulation by Praat
Philipp Buech, Simon Roessig, Lena Pagel, Doris Muecke, Anne Hermes
Improving Phonetic Transcriptions of Children’s Speech by Pronunciation Modelling with Constrained CTC-Decoding
Lars Rumberg, Christopher Gebauer, Hanna Ehlert, Ulrike Lüdtke, Jörn Ostermann
Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism
Kak Soky, Sheng Li, Masato Mimura, Chenhui Chu, Tatsuya Kawahara
KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus
Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol
Knowledge of accent differences can be used to predict speech recognition
Tuende Szalay, Mostafa Shahin, Beena Ahmed, Kirrie Ballard
Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features
Maximilian Karl Scharf, Sabine Hochmuth, Lena L.N. Wong, Birger Kollmeier, Anna Warzybok
End-to-end speech recognition modeling from de-identified data
Martin Flechl, Shou-Chun Yin, Junho Park, Peter Skala
Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition
Aditya Yadavalli, Ganesh Mirishkar, Anil Kumar Vuppala
DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition
Jiamin Xie, John H.L. Hansen
Keyword Spotting with Synthetic Data using Heterogeneous Knowledge Distillation
Yuna Lee, Seung Jun Baek
Probing phoneme, language and speaker information in unsupervised speech representations
Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski
Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver Sessions
Andrei Bîrlădeanu, Helen Minnis, Alessandro Vinciarelli
Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning
Eesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim
Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech
Tyler Miller, David Harwath
Pseudo Label Is Better Than Human Label
Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman
A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery
Werner van der Merwe, Herman Kamper, Johan Adam du Preez
PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification
Siqi Zheng, Hongbin Suo, Qian Chen
Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings
Xiaoyi Qin, Na Li, Weng Chao, Dan Su, Ming Li
Online Target Speaker Voice Activity Detection for Speaker Diarization
Weiqing Wang, Ming Li, Qingjian Lin
Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings
Niko Brummer, Albert Swart, Ladislav Mosner, Anna Silnova, Oldrich Plchot, Themos Stafylakis, Lukas Burget
Deep speaker embedding with frame-constrained training strategy for speaker verification
Bin Gu
Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization
Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan
End-to-End Audio-Visual Neural Speaker Diarization
Mao-Kui He, Jun Du, Chin-Hui Lee
Online Speaker Diarization with Core Samples Selection
Yanyan Yue, Jun Du, Mao-Kui He, YuTing Yeung, Renyu Wang
Robust End-to-end Speaker Diarization with Generic Neural Clustering
Chenyu Yang, Yu Wang
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild
Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Yanmin Qian, Kai Yu
Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free
Md Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones
Utterance-by-utterance overlap-aware neural diarization with Graph-PIT
Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach
Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party Meeting
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Qingyang Hong
Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection
Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang
An End-to-End Macaque Voiceprint Verification Method Based on Channel Fusion Mechanism
Peng Liu, Songbin Li, Jigang Tang
Human Sound Classification based on Feature Fusion Method with Air and Bone Conducted Signal
Liang Xu, Jing Wang, Lizhong Wang, Sijun Bi, Jianqian Zhang, Qiuyue Ma
RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, WenWu Wang
Temporal Self Attention-Based Residual Network for Environmental Sound Classification
Achyut Tripathi, Konark Paul
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Juncheng Li, Shuhui Qu, Po-Yao Huang, Florian Metze
Improving Target Sound Extraction with Timestamp Information
Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou
A Multi-grained based Attention Network for Semi-supervised Sound Event Detection
Ying Hu, Xiujuan Zhu, Yunlong Li, Hao Huang, Liang He
Temporal coding with magnitude-phase regularization for sound event detection
Sangwook Park, Sandeep Reddy Kothinti, Mounya Elhilali
RCT: Random consistency training for semi-supervised sound event detection
Nian Shao, Erfan Loweimi, Xiaofei Li
Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification
Yifei Xin, Dongchao Yang, Yuexian Zou
Active Few-Shot Learning for Sound Event Detection
Yu Wang, Mark Cartwright, Juan Pablo Bello
Uncertainty Calibration for Deep Audio Classifiers
Tong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Jing Xiao
Event-related data conditioning for acoustic event classification
Yuanbo Hou, Dick Botteldooren
A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS
Haohan Guo, Hui Lu, Xixin Wu, Helen Meng
RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion
Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo
FlowVocoder: A small Footprint Neural Vocoder based Normalizing Flow for Speech Synthesis
Manh Luong, Viet Anh Tran
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders
Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, Sheng Zhao
AdaVocoder: Adaptive Vocoder for Custom Voice
Xin Yuan, Robin Feng, Mingming Ye, Cheng Tuo, Minhang Zhang
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses
Shengyuan Xu, Wenxiao Zhao, Jing Guo
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu
Improving GAN-based vocoder for fast and high-quality speech synthesis
He Mengnan, Tingwei Guo, Zhenxing Lu, Zhang Ruixiong, Gong Caixia
SoftSpeech: Unsupervised Duration Model in FastSpeech 2
Yuan-Hao Yi, Lei He, Shifeng Pan, Xi Wang, Yuchao Zhang
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS
Haohan Guo, Feng-Long Xie, Frank Soong, Xixin Wu, Helen Meng
SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge
Yuhan Li, Ying Shen, Dongqing Wang, Lin Zhang
Text-to-speech synthesis using spectral modeling based on non-negative autoencoder
Takeru Gorai, Daisuke Saito, Nobuaki Minematsu
Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU
Hiroki Kanagawa, Yusuke Ijima, Hiroyuki Toda
MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki
A compact transformer-based GAN vocoder
Chenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao
Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE Solver
Hideyuki Tachibana, Muneyoshi Inahara, Mocho Go, Yotaro Katayama, Yotaro Watanabe
On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran
Learning to rank with BERT-based confidence models in ASR rescoring
Ting-Wei Wu, I-Fan Chen, Ankur Gandhe
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury
WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit
Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR
Yufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition
Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang
CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer
Zhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long Ma, Lei Xie
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT
Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li
Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition
Jash Rathod, Nauman Dawalatabad, SHATRUGHAN SINGH, Dhananjaya Gowda
Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath
Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Francoise Beaufays
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Wang Weiran, Ding Zhao, Tara Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman
4-bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov
Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model
Qiang Xu, Tongtong Song, Longbiao Wang, Hao Shi, Yuqin Lin, Yongjie Lv, Meng Ge, Qiang Yu, Jianwu Dang
Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka
A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation
Linh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc Nguyen
Investigating Parameter Sharing in Multilingual Speech Translation
Qian Wang, Chen Wang, Jiajun Zhang
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset
Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, Yonghong Yan
TALCS: An open-source Mandarin-English code-switching corpus and a speech recognition baseline
Chengfei Li, Shuhao Deng, Yaoping Wang, Guangjing Wang, Yaguang Gong, Changbin Chen, Jinfeng Bai
Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
Nguyen Luong Tran, Duong Le, Dat Quoc Nguyen
Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task
Maxim Markitantov, Elena Ryumina, Dmitry Ryumin, Alexey Karpov
Bayesian Transformer Using Disentangled Mask Attention
Jen-Tzung Chien, Yu-Han Huang
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis
Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan
From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation
Danni Liu, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, Juan Pino
Isochrony-Aware Neural Machine Translation for Automatic Dubbing
Derek Tam, Surafel M. Lakew, Yogesh Virkar, Prashant Mathur, Marcello Federico
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang
A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction
Zexu Pan, Meng Ge, Haizhou Li
Extending GCC-PHAT using Shift Equivariant Neural Networks
Axel Berg, Mark O'Connor, Kalle Åström, Magnus Oskarsson
Heterogeneous Target Speech Separation
Efthymios Tzinis, Gordon Wichern, Aswin Shanmugam Subramanian, Paris Smaragdis, Jonathan Le Roux
Separate What You Describe: Language-Queried Audio Source Separation
Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain
Dejan Markovic, Alexandre Defossez, Alexander Richard
End-to-end Speech-to-Punctuated-Text Recognition
Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto
End-to-End Dependency Parsing of Spoken French
Adrien Pupier, Maximin Coavoux, Benjamin Lecouteux, Jerome Goulian
Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He
Streaming Intended Query Detection using E2E Modeling for Continued Conversation
Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman
Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech
Jan Lehečka, Jan Švec, Ales Prazak, Josef Psutka
SVTS: Scalable Video-to-Speech Synthesis
Rodrigo Schoburg Carrillo de Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic
One-step models in pitch perception: Experimental evidence from Japanese
Takeshi Kishiyama, Chuyu Huang, Yuki Hirose
Generating iso-accented stimuli for second language research: methodology and a dataset for Spanish-accented English
Rubén Pérez Ramón, Martin Cooke, Maria Luisa Garcia Lecumberri
Factors affecting the percept of Yanny v. Laurel (or mixed): Insights from a large-scale study on Swiss German listeners
Adrian Leemann, Péter Jeszenszky, Carina Steiner, Corinne Lanthemann
Effects of laryngeal manipulations on voice gender perception
Zhaoyan Zhang, Jason Zhang, Jody Kreiman
Why is Korean lenis stop difficult to perceive for L2 Korean learners?
Boram Lee, Naomi Yamaguchi, Cécile Fougeron
Lexical stress in Spanish word segmentation
Alvaro Martin Iturralde Zurita, Meghan Clayards
Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings
Badr M. Abdullah, Bernd Möbius, Dietrich Klakow
Personalized Keyword Spotting through Multi-task Learning
Seunghan Yang, Byeonggeun Kim, Inseop Chung, Simyung Chang
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
Jan Švec, Jan Lehečka, Luboš Šmídl
Latency Control for Keyword Spotting
Christin Jose, Joe Wang, Grant Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis
Improving Voice Trigger Detection with Metric Learning
Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik
RNN Transducers for Named Entity Recognition with constraints on alignment for understanding medical conversations
Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey
Towards Automated Counselling Decision-Making: Remarks on Therapist Action Forecasting on the AnnoMI Dataset
Zixiu Wu, Rim Helaoui, Diego Reforgiato Recupero, Daniele Riboni
Speech and the n-Back task as a lens into depression. How combining both may allow us to isolate different core symptoms of depression
Salvatore Fara, Stefano Goria, Emilia Molimpakis, Nicholas Cummins
Enabling Off-the-Shelf Disfluency Detection and Categorization for Pathological Speech
Amrit Romana, Minxue Niu, Matthew Perez, Angela Roberts, Emily Mower Provost
Challenges of using longitudinal and cross-domain corpora on studies of pathological speech
Catarina Botelho, Tanja Schultz, Alberto Abad, Isabel Trancoso
g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin
Yi-Chang Chen, Yu-Chuan Steven, Yen-Cheng Chang, Yi-Ren Yeh
A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech
Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana
Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise
Tuomo Raitio, Petko Petkov, Jiangchuan Li, Muhammed Shifas, Andrea Davis, Yannis Stylianou
TTS-by-TTS 2: Data-Selective Augmentation for Neural Speech Synthesis Using Ranking Support Vector Machine with Variational Autoencoder
Eunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan-Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, Jin-Seob Kim, Jae-Min Kim
Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation
Giulia Comini, Goeric Huybrechts, Manuel Sam Ribeiro, Adam Gabryś, Jaime Lorenzo-Trueba
Real-Time Monitoring of Silences in Contact Center Conversations
Digvijay Ingle, Ayush Kumar, Krishnachaitanya Gogineni, Jithendra Vepa
Humanizing bionic voice: interactive demonstration of aesthetic design and control factors influencing the devices assembly and waveshape engineering
Konrad Zieliński, Marek Grzelec, Martin Hagmüller
Application for Real-time Personalized Speaker Extraction
Damien Ronssin, Milos Cernak
Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms
Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan
CoachLea: an Android Application to Evaluate the Speech Production and Perception of Children with Hearing Loss
P. Schäfer, P. A. Pérez-Toro, P. Klumpp, J. R. Orozco-Arroyave, E. Nöth, K. Maier, A. Abad, M. Schuster, T. Arias-Vergara
An Automated Mood Diary for Older User’s using Ambient Assisted Living Recorded Speech
Fasih Haider, Saturnino Luz
Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition
Hai-tao Xu, Jie Zhang, Li-rong Dai
Towards Automated Dialog Personalization using MBTI Personality Indicators
Daniel Fernau, Stefan Hillmann, Nils Feldhus, Tim Polzehl
Word-wise Sparse Attention for Multimodal Sentiment Analysis
Fan Qian, Hongwei Song, Jiqing Han
Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model
Tarun Gupta, Tuan Duc Truong, Tran The Anh, Eng Siong Chng
Exploring Multi-task Learning Based Gender Recognition and Age Estimation for Class-imbalanced Data
Weiqiao Zheng, Ping Yang, Rongfeng Lai, Kongyang Zhu, Tao Zhang, Junpeng Zhang, Hongcheng Fu
Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, Yizhuo Dong
Impact of Background Noise and Contribution of Visual Information in Emotion Identification by Native Mandarin Speakers
Minyue Zhang, Hongwei Ding
Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition
Wei Yang, Satoru Fukayama, Panikos Heracleous, Jun Ogata
Characterizing Therapist's Speaking Style in Relation to Empathy in Psychotherapy
Dehua Tao, Tan Lee, Harold Chui, Sarah Luk
Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session
Dehua Tao, Tan Lee, Harold Chui, Sarah Luk
Context-aware Multimodal Fusion for Emotion Recognition
Jinchao Li, Shuai Wang, Yang Chao, Xunying Liu, Helen Meng
Unsupervised Instance Discriminative Learning for Depression Detection from Speech Signals
Jinhan Wang, Vijay Ravi, Jonathan Flint, Abeer Alwan
How do our eyebrows respond to masks and whispering? The case of Persians
Nasim Mahdinazhad Sardhaei, Marzena Zygis, Hamid Sharifzadeh
State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning Approach
Alice Baird, Panagiotis Tzirakis, Jeff Brooks, Lauren Kim, Michael Opara, Chris Gregory, Jacob Metrick, Garrett Boseck, Dacher Keltner, Alan Cowen
Confidence Measure for Automatic Age Estimation From Speech
Amruta Saraf, Ganesh Sivaraman, Elie Khoury
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization
Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan
Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition
Guangzhi Sun, Chao Zhang, Phil Woodland
Bring dialogue-context into RNN-T for streaming ASR
junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma
Conformer with dual-mode chunked attention for joint online and offline ASR
Felix Weninger, Marco Gaudesi, Md Akmal Haidar, Nicola Ferri, Jesús Andrés-Ferrer, Puming Zhan
Efficient Training of Neural Transducer for Speech Recognition
Wei Zhou, Wilfried Michel, Ralf Schlüter, Hermann Ney
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
Zhifu Gao, ShiLiang Zhang, Ian McLoughlin, Zhijie Yan
Pruned RNN-T for fast, memory-efficient ASR training
Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey
Deep Sparse Conformer for Speech Recognition
Xianchao Wu
Chain-based Discriminative Autoencoders for Speech Recognition
Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang
Streaming parallel transducer beam search with fast slow cascaded encoders
Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael Seltzer
Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition
Mohan Li, Rama Sanand Doddipatla, Catalin Zorila
On the Prediction Network Architecture in RNN-T for ASR
Dario Albesano, Jesús Andrés-Ferrer, Nicola Ferri, Puming Zhan
Minimum latency training of sequence transducers for streaming end-to-end speech recognition
Yusuke Shinohara, Shinji Watanabe
CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR
Keyu An, Huahuan Zheng, Zhijian Ou, Hongyu Xiang, Ke Ding, Guanglu Wan
Attention Enhanced Citrinet for Speech Recognition
Xianchao Wu
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition
Qiantong Xu, Alexei Baevski, Michael Auli
Robust Self-Supervised Audio-Visual Speech Recognition
Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning
Robin Algayres, Adel Nabli, Benoît Sagot, Emmanuel Dupoux
Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Swithboard Corpus
Junhao Xu, Shoukang Hu, Xunying Liu, Helen Meng
Finer-grained Modeling units-based Meta-Learning for Low-resource Tibetan Speech Recognition
Siqing Qin, Longbiao Wang, Sheng Li, Yuqin Lin, Jianwu Dang
Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification
Parvaneh Janbakhshi, Ina Kodrasi
Automated Detection of Wilson’s Disease Based on Improved Mel-frequency Cepstral Coefficients with Signal Decomposition
Zhenglin Zhang, Li-Zhuang Yang, Xun Wang, Hai Li
The effect of backward noise on lexical tone discrimination in Mandarin-speaking amusics
Zixia Fan, Jing Shao, Weigong Pan, Min Xu, Lan Wang
Automatic Selection of Discriminative Features for Dementia Detection in Cantonese-Speaking People
Xiaoquan KE, Man-Wai Mak, Helen M. Meng
Automated Voice Pathology Discrimination from Continuous Speech Benefits from Analysis by Phonetic Context
Zhuoya Liu, Mark Huckvale, Julian McGlashan
Multi-Type Outer Product-Based Fusion of Respiratory Sounds for Detecting COVID-19
Adria Mallol-Ragolta, Helena Cuesta, Emilia Gomez, Björn Schuller
Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics
Xueshuai Zhang, Jiakun Shen, Jun Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shaoxing Zhang, Aijun Sun
Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers
Farhad Javanmardi, Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku
Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition
Gerasimos Chatzoudis, Manos Plitsis, Spyridoula Stamouli, Athanasia–Lida Dimou, Nassos Katsamanis, Vassilis Katsouros
Domain-aware Intermediate Pretraining for Dementia Detection with Limited Data
Youxiang Zhu, Xiaohui Liang, John A. Batsis, Robert M. Roth
Comparison of 5 methods for the evaluation of intelligibility in mild to moderate French dysarthric speech
Cécile Fougeron, Nicolas Audibert, Ina Kodrasi, Parvaneh Janbakhshi, Michaela Pernon, Nathalie Leveque, Stephanie Borel, Marina Laganaro, Herve Bourlard, Frederic Assal
Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee
Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition
Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee
Distilling a Pretrained Language Model to a Multilingual ASR Model
Kwanghee Choi, Hyung-Min Park
Text-Only Domain Adaptation Based on Intermediate CTC
Hiroaki Sato, Tomoyasu Komori, Takeshi Mishima, Yoshihiko Kawai, Takahiro Mochizuki, Shoei Sato, Tetsuji Ogawa
Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping
Jenthe Thienpondt, Kris Demuynck
Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models
Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Leibny Paola Garcia Perera, Yohei Kawaguchi
Improved CNN-Transformer using Broadcasted Residual Learning for Text-Independent Speaker Verification
Jeong-Hwan Choi, Joon-Young Yang, Ye-Rin Jeoung, Joon-Hyuk Chang
Pushing the limits of raw waveform speaker recognition
Jee-weon Jung, Youjin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification
Hexin Liu, Leibny Paola Garcia Perera, Andy Khong, Suzy Styles, Sanjeev Khudanpur
Prosodic Information in Dialect Identification of a Tonal Language: The case of Ao
Moakala Tzudir, Priyankoo Sarmah, S R Mahadeva Prasanna
A Multimodal Strategy for Singing Language Identification
Wo Jae Lee, Emanuele Coviello
A comparative study on vowel articulation in Parkinson's disease and multiple system atrophy
Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Margherita Fabbri, Anne Pavy-Le Traon, Olivier Rascol, Virginie Woisard, Wassilios G. Meissner
Voicing decision based on phonemes classification and spectral moments for whisper-to-speech conversion
Luc Ardaillon, Nathalie Henrich, Olivier Perrotin
Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks
Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, Thomas Quatieri
Investigating the Impact of Speech Compression on the Acoustics of Dysarthric Speech
Kelvin Tran, Lingfeng Xu, Gabriela Stegmann, Julie Liss, Visar Berisha, Rene Utianski
Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion Perception
Avamarie Brueggeman, John H.L. Hansen
Optimal thyroplasty implant shape and stiffness for treatment of acute unilateral vocal fold paralysis: Evidence from a canine in vivo phonation model
Neha Reddy, Yoonjeong Lee, Zhaoyan Zhang, Dinesh K. Chhetri
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
Semantically Meaningful Metrics for Norwegian ASR Systems
Janine Rugayan, Torbjørn Svendsen, Giampiero Salvi
Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR
Ondrej Klejch, Electra Wallington, Peter Bell
Linguistically Informed Post-processing for ASR Error correction in Sanskrit
Rishabh Kumar, Devaraja Adiga, Rishav Ranjan, Amrith Krishna, Ganesh Ramakrishnan, Pawan Goyal, Preethi Jyothi
Cross-lingual articulatory feature information transfer for speech recognition using recurrent progressive neural networks
Mahir Morshed, Mark Hasegawa-Johnson
Comparison of Models for Detecting Off-Putting Speaking Styles
Diego Aguirre, Nigel Ward, Jonathan E. Avila, Heike Lehnert-LeHouillier
Multimodal Persuasive Dialogue Corpus using Teleoperated Android
Seiya Kawano, Muteki Arioka, Akishige Yuguchi, Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara, Satoshi Nakamura, Koichiro Yoshino
Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
Yookyung Shin, Younggun Lee, Suhee Jo, Yeongtae Hwang, Taesu Kim
Strategies for developing a Conversational Speech Dataset for Text-To-Speech Synthesis
Adaeze O. Adigwe, Esther Klabbers
Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in Speech
Xiyuan Gao, Shekhar Nayak, Matt Coler
End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda
Attention-based conditioning methods using variable frame rate for style-robust speaker verification
Amber Afshan, Abeer Alwan
Learning from human perception to improve automatic speaker verification in style-mismatched conditions
Amber Afshan, Abeer Alwan
Exploring audio-based stylistic variation in podcasts
Katariina Martikainen, Jussi Karlgren, Khiet Truong
Automatic Evaluation of Speaker Similarity
Kamil Deja, Ariadna Sanchez, Julian Roth, Marius Cotescu
Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)
Ziyao Zhang, Alessio Falai, Ariadna Sanchez, Orazio Angelini, Kayoko Yanagisawa
J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari
REYD – The First Yiddish Text-to-Speech Dataset and System
Jacob Webber, Samuel K. Lo, Isaac L. Bleaman
Data-augmented cross-lingual synthesis in a teacher-student framework
Marcel de Korte, Jaebok Kim, Aki Kunikoshi, Adaeze Adigwe, Esther Klabbers
Production characteristics of obstruents in WaveNet and older TTS systems
Ayushi Pandey, Sébastien Le Maguer, Julie Carson-Berndsen, Naomi Harte
Back to the Future: Extending the Blizzard Challenge 2013
Sébastien Le Maguer, Simon King, Naomi Harte
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus
Josh Meyer, David Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack, Julian Weber, Salomon KABONGO KABENAMUALU, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Chinenye Emezue, Jonathan Mukiibi, Salomey Osei, Apelete AGBOLO, Victor Akinode, Bernard Opoku, Olanrewaju Samuel, Jesujoba Alabi, Shamsuddeen Hassan Muhammad
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis
Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene Classification
Byeonggeun Kim, Seunghan Yang, Jangho Kim, Hyunsin Park, Juntae Lee, Simyung Chang
Couple learning for semi-supervised sound event detection
Tao Rui, Yan Long, Ouchi Kazushige, Xiangdong Wang
Oktoechos Classification in Liturgical Music Using SBU-LSTM/GRU
Rajeev Rajan, Ananya Ayasi
SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms
YUHANG HE, Andrew Markham
ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning
Christian Bergler, Alexander Barnhill, Dominik Perrin, Manuel Schmitt, Andreas Maier, Elmar Nöth
Convolutional Recurrent Neural Network with Auxiliary Stream for Robust Variable-Length Acoustic Scene Classification
Joon-Hyuk Chang, Won-Gook Choi
Unsupervised Symbolic Music Segmentation using Ensemble Temporal Prediction Errors
Shahaf Bassan, Yossi Adi, Jeffrey Rosenschein
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
AMIR SHIRIAN, Krishna Somandepalli, Victor Sanchez, Tanaya Guha
A Passive Similarity based CNN Filter Pruning for Efficient Acoustic Scene Classification
Arshdeep Singh, Mark D. Plumbley
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
What can Speech and Language Tell us About the Working Alliance in Psychotherapy
Sebastian Peter Bayerl, Gabriel Roccabruna, Shammur Absar Chowdhury, Tommaso Ciulli, Morena Danieli, Korbinian Riedhammer, Giuseppe Riccardi
TB or not TB? Acoustic cough analysis for tuberculosis classification
Geoffrey T. Frost, Grant Theron, Thomas Niesler
Are reported accuracies in the clinical speech machine learning literature overoptimistic?
Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, Julie Liss
Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities
Bahman Mirheidari, Andre Bittar, Nicholas Cummins, Johnny Downs, Helen L. Fisher, Heidi Christensen
Automatic cognitive assessment: Combining sparse datasets with disparate cognitive scores
Bahman Mirheidari, Daniel Blackburn, Heidi Christensen
Exploring Semi-supervised Learning for Audio-based COVID-19 Detection using FixMatch
Ting Dang, Thomas Quinnell, Cecilia Mascolo
Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals
Debarpan Bhattacharya, Debottam Dutta, Neeraj Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan
Automated Evaluation of Standardized Dementia Screening Tests
Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer
Alzheimer's Detection from English to Spanish Using Acoustic and Linguistic Embeddings
Paula Andrea Pérez-Toro, Philipp Klumpp, Abner Hernandez, Tomas Arias, Patricia Lillo, Andrea Slachevsky, Adolfo Martín García, Maria Schuster, Andreas K. Maier, Elmar Noeth, Juan Rafael Orozco-Arroyave
Extract and Abstract with BART for Clinical Notes from Doctor-Patient Conversations
Jing Su, Longxiang Zhang, Hamid Reza Hassanzadeh, Thomas Schaaf
Dyadic Interaction Assessment from Free-living Audio for Depression Severity Assessment
Bishal Lamichhane, Nidal Moukaddam, Ankit B. Patel, Ashutosh Sabharwal
COVID-19 detection based on respiratory sensing from speech
Venkata Srikanth Nallanthighal, Aki Harma, Helmer Strik
Bifurcation and Reunion: A Loss-Guided Two-Stage Approach for Monaural Speech Dereverberation
Xiaoxue Luo, Chengshi Zheng, Andong Li, Yuxuan Ke, Xiaodong Li
A deep complex multi-frame filtering network for stereophonic acoustic echo cancellation
Linjuan Cheng, Chengshi Zheng, Andong Li, Yuquan Wu, Renhua Peng, Xiaodong Li
Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation
Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li
Personalized Acoustic Echo Cancellation for Full-duplex Communications
Shimin Zhang, Ziteng Wang, Yukai Ju, Yihui Fu, Yueyue Na, Qiang Fu, Lei Xie
LCSM: A Lightweight Complex Spectral Mapping Framework for Stereophonic Acoustic Echo Cancellation
Chenggang Zhang, JinJiang Liu, Xueliang Zhang
Joint Neural AEC and Beamforming with Double-Talk Detection
Vinay Kothapally, YONG XU, Meng Yu, Shi-Xiong ZHANG, Dong Yu
Clock Skew Robust Acoustic Echo Cancellation
Karim Helwani, Erfan Soltanmohammadi, Michael Mark Goodwin, Arvindh Krishnaswamy
A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein
Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation
Vinay Kothapally, John H.L. Hansen
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers
Liumeng Xue, Shan Yang, Na Hu, Dan Su, Lei Xie
Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion
SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng
FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion
Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu
Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng
Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion
Haoquan Yang, Liqun Deng, Yu Ting Yeung, Nianzu Zheng, Yong Xu
Accent Conversion using Pre-trained Model and Synthesized Data from Voice Conversion
Tuan Nam Nguyen, Ngoc-Quan Pham, Alexander Waibel
VoiceMe: Personalized voice generation in TTS
Pol van Rijn, Silvan Mertes, Dominik Schiller, Piotr Dura, Hubert Siuzdak, Peter M. C. Harrison, Elisabeth André, Nori Jacoby
DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
Ruibin Yuan, Yuxuan Wu, Jacob Li, Jaxter Kim
Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu
Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion
Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li
Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition
Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong
A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text
Yeqian Du, Jie Zhang, Qiu-shi Zhu, Lirong Dai, MingHui Wu, Xin Fang, ZhouWang Yang
Knowledge Transfer and Distillation from Autoregressive to Non-Autoregessive Speech Recognition
Xun Gong, Zhikai Zhou, Yanmin Qian
Confidence Score Based Conformer Speaker Adaptation for Speech Recognition
Jiajun DENG, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng
Decoupled Federated Learning for ASR with Non-IID Data
Han Zhu, Jindong Wang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan
Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning
Sanli Tian, Keqi Deng, Zehan Li, Lingxuan Ye, Gaofeng Cheng, Ta Li, Yonghong Yan
Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing
Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training
Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu, Furu Wei
Speech Pre-training with Acoustic Piece
Shuo Ren, Shujie Liu, Yu Wu, Long Zhou, Furu Wei
Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training
Bowen Zhang, Songjun Cao, Xiaoming Xhang, Yike Zhang, Long Ma, Takahiro Shinozaki
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, Lirong Dai, Jinyu Li, Yao Qian, Furu Wei
PISA: PoIncaré Saliency-Aware Interpolative Augmentation
Ramit Sawhney, Megh Thakkar, Vishwa Shah, Puneet Mathur, Vasu Sharma, Dinesh Manocha
Online Continual Learning of End-to-End Speech Recognition Models
Muqiao Yang, Ian Lane, Shinji Watanabe
Streaming Target-Speaker ASR with Neural Transducer
Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki
SPLICEOUT: A Simple and Efficient Audio Augmentation Method
Arjit Jain, Pranay Reddy Samala, Deepak Mittal, Preethi Jyothi, Maneesh Singh
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems
Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas, Hong-Kwang Kuo, Brian Kingsbury
Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida
Improving Spoken Language Understanding with Cross-Modal Contrastive Learning
Jingjing Dong, Jiayi Fu, Peng Zhou, Hao Li, Xiaorui Wang
Low-bit Shift Network for End-to-End Spoken Language Understanding
Anderson R. Avila, Khalil Bibi, Rui Heng Yang, Xinlin Li, Chao Xing, Xiao Chen
Meta Auxiliary Learning for Low-resource Spoken Language Understanding
Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
Adversarial Knowledge Distillation For Robust Spoken Language Understanding
Ye Wang, Baishun Ling, Yanmeng Wang, Junhao Xue, Shaojun Wang, Jing Xiao
Incorporating Dual-Aware with Hierarchical Interactive Memory Networks for Task-Oriented Dialogue
yangyang Ou, Peng Zhang, Jing Zhang, Hui Gao, Xing Ma
Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL
Yuntao Li, Hanchu Zhang, Yutian Li, Sirui Wang, Wei Wu, Yan Zhang
Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems via Fine-Grained Contrastive Learning
Yuntao Li, Can Xu, Huang Hu, Lei Sha, Yan Zhang, Daxin Jiang
Toward Low-Cost End-to-End Spoken Language Understanding
Marco Dinarelli, Marco Naguib, François Portet
A Multi-Task BERT Model for Schema-Guided Dialogue State Tracking
Eleftherios Kapelonis, Efthymios Georgiou, Alexandros Potamianos
WavPrompt: Towards Few-Shot Spoken Language Understanding with Frozen Language Models
Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson
Analysis of praising skills focusing on utterance contents
Asahi Ogushi, Toshiki Onishi, Yohei Tahara, Ryo Ishii, Atsushi Fukayama, Takao Nakamura, Akihiro Miyata
Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech
Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang
Efficient Training of Audio Transformers with Patchout
Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, Gerhard Widmer
CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content
Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid
Frequency Dynamic Convolution: Frequency-Adaptive Pattern Recognition for Sound Event Detection
Hyeonuk Nam, Seong-Hu Kim, Byeong-Yun Ko, Yong-Hwa Park
On Breathing Pattern Information in Synthetic Speech
Zohreh Mostaani, Mathew Magimai Doss
Interactive Auido-text Representation for Automated Audio Captioning with Contrastive Learning
Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng
Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification
Yuya Yamamoto, Juhan Nam, Hiroko Terasawa
Does Audio Deepfake Detection Generalize?
Nicolas Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, Konstantin Böttinger
Attacker Attribution of Audio Deepfakes
Nicolas Müller, Franziska Diekmann, Jennifer Williams
Are disentangled representations all you need to build speaker anonymization systems?
Champion Pierre, Anthony Larcher, Denis Jouvet
Towards End-to-End Private Automatic Speaker Recognition
Francisco Teixeira, Alberto Abad, Bhiksha Raj, Isabel Trancoso
Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan Amid, Om Dipakbhai Thakkar, Arun Narayanan, Rajiv Mathews, Francoise Beaufays
Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny Huang, Steve Chien, Om Dipakbhai Thakkar, Rajiv Mathews
Transformer-Based Automatic Speech Recognition with Auxiliary Input of Source Language Text Toward Transcribing Simultaneous Interpretation
Shuta Taniguchi, Tsuneo Kato, Akihiro Tamura, Keiji Yasuda
AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
End-to-End multi-talker audio-visual ASR using an active speaker attention module
Richard Rose, Olivier Siohan
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
Frame-Level Stutter Detection
John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo
Detecting Heart Failure Through Voice Analysis using Self-Supervised Mode-Based Memory Fusion
Darshana Priyasad, Andi Partovi, Sridha Sridharan, Maryam Kashefpoor, Tharindu Fernando, Simon Denman, Clinton Fookes, Jia Tang, David Kaye
Automatic Detection of Speech Sound Disorder in Child Speech Using Posterior-based Speaker Representations
Si-Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee
Data Augmentation for Dementia Detection in Spoken Language.
Dominika Woszczyk, Anna Hedlikova, Alican Akman, Soteris Demetriou, Björn Schuller
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection
Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, Amir Hossein Poorjam, Deepak Mittal, Maneesh Singh
Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0
Sebastian Peter Bayerl, Dominik Wagner, Elmar Noeth, Korbinian Riedhammer
HYU Submission for the SASV Challenge 2022: Reforming Speaker Embeddings with Spoofing-Aware Conditioning
Jeong-Hwan Choi, Joon-Young Yang, Ye-Rin Jeoung, Joon-Hyuk Chang
Two Methods for Spoofing-Aware Speaker Verification: Multi-Layer Perceptron Score Fusion Model and Integrated Embedding Projector
Jungwoo Heo, Ju-Ho Kim, Hyun-seo Shin
Spoofing-Aware Attention based ASV Back-end with Multiple Enrollment Utterances and a Sampling Strategy for the SASV Challenge 2022
Chang Zeng, Lin Zhang, Meng Liu, Junichi Yamagishi
A Subnetwork Approach for Spoofing Aware Speaker Verification
Alexander Alenin, Nikita Torgashov, Anton Okhotnikov, Rostislav Makarov, Ivan Yakovlev
SASV 2022: The First Spoofing-Aware Speaker Verification Challenge
Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas Evans, Tomi Kinnunen
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
Jin Woo Lee, Eungbeom Kim, Junghyun Koo, Kyogu Lee
tPLCnet: Real-time Deep Packet Loss Concealment in the Time Domain Using a Short Temporal Context
Nils L. Westhausen, Bernd T. Meyer
On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement
Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann
DDS: A new device-degraded speech dataset for speech enhancement
Haoyu Li, Junichi Yamagishi
Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments
Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
Refining DNN-based Mask Estimation using CGMM-based EM Algorithm for Multi-channel Noise Reduction
Julitta Bartolewska, Stanisław Kacprzak, Konrad Kowalczyk
Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain
Simon Welker, Julius Richter, Timo Gerkmann
Enhancing Embeddings for Speech Classification in Noisy Conditions
Mohamed Nabih Ali, Alessio Brutti, Falavigna Daniele
Deep Audio Waveform Prior
Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg
Convolutive Weighted Multichannel Wiener Filter Front-end for Distant Automatic Speech Recognition in Reverberant Multispeaker Scenarios
Mieszko Fras, Marcin Witkowski, Konrad Kowalczyk
Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes
Danilo de Oliveira, Tal Peer, Timo Gerkmann
Improving Speech Enhancement through Fine-Grained Speech Characteristics
Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
Creating New Voices using Normalizing Flows
Piotr Bilinski, Thomas Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa
Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)
Ariadna Sanchez, Alessio Falai, Ziyao Zhang, Orazio Angelini, Kayoko Yanagisawa
Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari
GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion
Magdalena Proszewska, Grzegorz Beringer, Daniel Sáez-Trigueros, Thomas Merritt, Abdelhamid Ezzerg, Roberto Barra-Chicote
One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS
Jaeuk Lee, Joon-Hyuk Chang
Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
Alon Levkovitch, Eliya Nachmani, Lior Wolf
Advanced Speaker Embedding with Predictive Variance of Gaussian Distribution for Speaker Adaptation in TTS
Jaeuk Lee, Joon-Hyuk Chang
Karaoker: Alignment-free singing voice synthesis with speech training data
Panagiotis Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris
ACNN-VC: Utilizing Adaptive Convolution Neural Network for One-Shot Voice Conversion
Ji Sub Um, Yeunju Choi, Hoi Rin Kim
A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling
Tasnima Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, Mikhail Kudinov, Jiansheng Wei
Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis
Tae-Woo Kim, Min-Su Kang, Gyeong-Hoon Lee
Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer
Shrutina Agarwal, Naoya Takahashi, Sriram Ganapathy
Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana
Deep residual spiking neural network for keyword spotting in low-resource settings
Qu Yang, Qi Liu, Haizhou Li
Reducing Domain mismatch in Self-supervised speech pre-training
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano
Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition
Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Nathan Susanj, Athanasios Mouchtaris, Tariq Afzal, Ariya Rastrow
W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition
Dong-Hyun Kim, Jae-Hong Lee, Ji-Hwan Mo, Joon-Hyuk Chang
Compute Cost Amortized Transformer for Streaming ASR
Yi Xie, Jonathan J. Macoskey, Martin Radfar, Feng-Ju Chang, Brian King, Ariya Rastrow, Athanasios Mouchtaris, Grant Strimel
On-demand compute reduction with stochastic wav2vec 2.0
Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski
Transfer Learning from Multi-Lingual Speech Translation Benefits Low-Resource Speech Recognition
Geoffroy Vanderreydt, François REMY, Kris Demuynck
FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition
Szu-Jui Chen, Jiamin Xie, John H.L. Hansen
Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method
Tatsuya Kitamura, Naoki Kunimoto, Hideki Kawahara, Shigeaki Amano
Non-native Perception of Japanese Singleton/Geminate Contrasts: Comparison of Mandarin and Mongolian Speakers Differing in Japanese Experience
Kimiko Tsukada, Yurong Yurong
Evaluating the effects of modified speech on perceptual speaker identification performance
Benjamin O'Brien, Christine Meunier, Alain Ghio
Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese
Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai
Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues
Takayuki Arai, Miho Yamada, Megumi Okusawa
InQSS: a speech intelligibility and quality assessment model using a multi-task learning network
Yu-Wen Chen, Yu Tsao
Investigating the influence of personality on acoustic-prosodic entrainment
Andreas Weise, Rivka Levitan
Common and differential acoustic representation of interpersonal and tactile iconic perception of Mandarin vowels
Yi Li, Xiaoming Jiang
Effects of Noise on Speech Perception and Spoken Word Comprehension
Jovan Eranovic, Daniel Pape, Magda Stroińska, Elisabet Service, Marijana Matkovski
Acquisition of Two Consecutive Neutral Tones in Mandarin-Speaking Preschoolers: Phonological Representation and Phonetic Realization
Sichen Zhang, Aijun Li
Air tissue boundary segmentation using regional loss in real-time Magnetic Resonance Imaging video for speech production
Anwesha Roy, Varun Belagali, Prasanta Ghosh
Language-specific interactions of vowel discrimination in noise
Mark Gibson, Marcel Schlechtweg, Beatriz Blecua Falgueras, Judit Ayala Alcalde
An Improved Transformer Transducer Architecture for Hindi-English Code Switched Speech Recognition
Ansen Antony, Sumanth Reddy Kota, Akhilesh Lade, Spoorthy V, Shashidhar G. Koolagudi
VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
Venkatesh Shenoy Kadandale, Juan F. Montesinos, Gloria Haro
Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation
Jovan M. Dalhouse, Katunobu Itou
Global RNN Transducer Models For Multi-dialect Speech Recognition
Takashi Fukuda, Samuel Thomas, Masayuki Suzuki, Gakuto Kurata, George Saon, Brian Kingsbury
Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training
Vera Bernhard, Sandra Schwab, Jean-Philippe Goldman
On-the-fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim
FFM: A Frame Filtering Mechanism To Accelerate Inference Speed For Conformer In Speech Recognition
Zongfeng Quan, Nick J.C. Wang, Wei Chu, Tao Wei, Shaojun Wang, Jing Xiao
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng
Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods
Lingxuan Ye, Gaofeng Cheng, Runyan Yang, Zehui Yang, Sanli Tian, Pengyuan Zhang, Yonghong Yan
Mitigating bias against non-native accents
Yuanyuan Zhang, Yixuan Zhang, Bence Halpern, Tanvina Patel, Odette Scharenborg
A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition
Jin Li, Rongfeng Su, Xurong Xie, Lan Wang, Nan Yan
LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
Jinchuan Tian, Jianwei Yu, Chunlei Zhang, Yuexian Zou, Dong Yu
Significance of single frequency filter for the development of children’s KWS system
Biswaranjan Pattanayak, Gayadhar Pradhan
A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
Minimizing Sequential Confusion Error in Speech Command Recognition
Zhanheng Yang, Hang Lv, Xiong Wang, Ao Zhang, Lei Xie
Homophone Disambiguation Profits from Durational Information
Barbara Schuppler, Emil Berger, Xenia Kogler, Franz Pernkopf
Speaker-Specific Utterance Ensemble based Transfer Attack on Speaker Identification
Chu-Xiao Zuo, Jia-Yi Leng, Wu-Jun Li
Complex Frequency Domain Linear Prediction: A Tool to Compute Modulation Spectrum of Speech
Samik Sadhu, Hynek Hermansky
Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children’s Speech
Vishwanath Pratap Singh, Hardik Sailor, Supratik Bhattacharya, Abhishek Pandey
End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training
Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando
Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani
An Anchor-Free Detector for Continuous Speech Keyword Spotting
Zhiyuan Zhao, Chuanxin Tang, Chengdong Yao, Chong Luo
Low-complex and Highly-performed Binary Residual Neural Network for Small-footprint Keyword Spotting
Xiao Wang, Song Cheng, Jun Li, Shushan Qiao, Yumei Zhou, Yi Zhan
UniKW-AT: Unified Keyword Spotting and Audio Tagging
Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang
ESSumm: Extractive Speech Summarization from Untranscribed Meeting
Jun Wang
XTREME-S: Evaluating Cross-lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
Negative Guided Abstractive Dialogue Summarization
Junpeng Liu, Yanyan Zou, Yuxuan Xi, Shengjie Li, Mian Ma, Zhuoye Ding, Bo Long
Exploring representation learning for small-footprint keyword spotting
Fan Cui, Liyong Guo, Quandong Wang, Peng Gao, Yujun Wang
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur
Phonetic Embedding for ASR Robustness in Entity Resolution
Xiaozhou Zhou, Ruying Bao, William M. Campbell
Hierarchical Tagger with Multi-task Learning for Cross-domain Slot Filling
Xiao Wei, Yuke Si, Shiquan Wang, Longbiao Wang, Jianwu Dang
Multi-class AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data
MengLong Xu, Shengqiang Li, Chengdong Liang, Xiao-Lei Zhang
Weak supervision for Question Type Detection with large language models
Jiřı́ Martı́nek, Christophe Cerisara, Pavel Kral, Ladislav Lenc, Josef Baloun
BIT-MI Deep Learning-based Model to Non-intrusive Speech Quality Assessment Challenge in Online Conferencing Applications
Miao Liu, Jing Wang, Liang Xu, Jianqian Zhang, Shicong Li, Fei Xiang
MOS Prediction Network for Non-intrusive Speech Quality Assessment in Online Conferencing
Wenjing Liu, Chuan Xie
Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network
Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang
Soft-label Learn for No-Intrusive Speech Quality Assessment
Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi
ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications
Gaoxiong Yi, Wei Xiao, Yiming Xiao, Babak Naderi, Sebastian Möller, Wafaa Wardah, Gabriel Mittag, Ross Culter, Zhuohuang Zhang, Donald S. Williamson, Fei Chen, Fuzheng Yang, Shidong Shang
MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment
Karl El Hajal, Milos Cernak, Pablo Mainar
CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment
Yuchen Liu, Li-Chia Yang, Alexander Pawlicki, Marko Stamenovic
Impairment Representation Learning for Speech Quality Assessment
Lianwu Chen, Xinlei Ren, Xu Zhang, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu
Exploring linguistic feature and model combination for speech recognition based automatic AD detection
Yi Wang, Tianzi Wang, Zi Ye, Lingwei Meng, Shoukang Hu, Xixin Wu, Xunying Liu, Helen Meng
ECAPA-TDNN Based Depression Detection from Clinical Speech
Dong Wang, Yanhui Ding, Qing Zhao, Peilin Yang, Shuping Tan, Ya Li
A Step Towards Preserving Speakers’ Identity While Detecting Depression Via Speaker Disentanglement
Vijay Ravi, Jinhan Wang, Jonathan Flint, Abeer Alwan
Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
Tomasz Rutowski, Amir Harati, Elizabeth Shriberg, Yang Lu, Piotr Chlebek, Ricardo Oliveira
Deep Learning Approaches for Detecting Alzheimer’s Dementia from Conversational Speech of ILSE Study
Ayimnisagul Ablimit, Karen Scholz, Tanja Schultz
Multimodal Depression Severity Score Prediction Using Articulatory Coordination Features and Hierarchical Attention Based Text Embeddings
Nadee Seneviratne, Carol Espy-Wilson
ASR Error Detection via Audio-Transcript entailment
Nimshi Venkat Meripo, Sandeep Konam
CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
Sri Karlapati, Penny Karanasou, Mateusz Łajszczak, Syed Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman
Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody
Peter Makarov, Syed Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou
Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History
Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
Emphasis Control for Parallel Neural TTS
Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li
BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model
Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber
Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis
Johannah O'Mahony, Catherine Lai, Simon King
Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani
CTRL: Continual Representation Learning to Transfer Information of Pre-trained for WAV2VEC 2.0
Jae-Hong Lee, Chae-Won Lee, Jin-Seong Choi, Joon-Hyuk Chang, Woo Kyeong Seong, Jeonghan Lee
Speaker adaptation for Wav2vec2 based dysarthric ASR
Murali Karthick Baskar, Tim Herzig, Diana Nguyen, Mireia Diez, Tim Polzehl, Lukas Burget, Jan Černocký
Non-Parallel Voice Conversion for ASR Augmentation
Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno
Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing
Heli Qi, Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura
Joint Encoder-Decoder Self-Supervised Pre-training for ASR
A Arunkumar, Srinivasan Umesh
An overview of discourse clicks in Central Swedish
Margaret Zellers
VOT and F0 perturbations for the realization of voicing contrast in Tohoku Japanese
Hiroto Noguchi, Sanae Matsui, Naoya Watabe, Chuyu Huang, Ayako Hashimoto, Ai Mizoguchi, Mafuyu Kitahara
Complex sounds and cross-language influence: The case of ejectives in Omani Mehri
Rachid Ridouane, Philipp Buech
When Phonetics Meets Morphology: Intervocalic Voicing Within and Across Words in Romance Languages
Mathilde Hutin, Martine Adda-Decker, Lori Lamel, Ioana Vasilescu
The mapping between syntactic and prosodic phrasing in English and Mandarin
Jianjing Kuang, May Pik Yu Chan, Nari Rhee, Mark Liberman, Hongwei Ding
Pharyngealization in Amazigh: Acoustic and articulatory marking over time
Philipp Buech, Rachid Ridouane, Anne Hermes
ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks
Valentin Pelloin, Franck Dary, Nicolas Hervé, Benoit Favre, Nathalie Camelin, Antoine LAURENT, Laurent Besacier
Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding
Ya-Hsin Chang, Yun-Nung Chen
Learning Under Label Noise for Robust Spoken Language Understanding systems
Anoop Kumar, Pankaj Kumar Sharma, Aravind Illa, Sriram Venkatapathy, Subhrangshu Nandi, Pritam Varma, Anurag Dwarakanath, Aram Galstyan
Deliberation Model for On-Device Spoken Language Understanding
Duc Le, Akshat Shrivastava, Paden D. Tomasello, Suyoun Kim, Aleksandr Livshits, Ozlem Kalinli, Michael Seltzer
Intent classification using pre-trained language agnostic embeddings for low resource languages
Hemant Yadav, Akshat Gupta, Sai Krishna Rallabandi, Alan W Black, Rajiv Ratn Shah
Two-Pass Low Latency End-to-End Spoken Language Understanding
Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan W Black, Shinji Watanabe
Non-intrusive Speech Intelligibility Metric Prediction for Hearing Impaired Individuals
George Close, Samuel Hollands, Stefan Goetze, Thomas Hain
Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners
Zehai Tu, Ning Ma, Jon Barker
Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction
Zehai Tu, Ning Ma, Jon Barker
Speech Intelligibility Prediction for Hearing-Impaired Listeners with the LEAP Model
Jana Roßbach, Rainer Huber, Saskia Röttges, Christopher F. Hauth, Thomas Biberger, Thomas Brand, Bernd T. Meyer, Jan Rennies
Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index
Franklin Alvarez Cardinale, Waldo Nogueira
The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction
Jon Barker, Michael Akeroyd, Trevor J. Cox, John F. Culling, Jennifer Firth, Simone Graetzer, Holly Griffiths, Lara Harris, Graham Naylor, Zuzanna Podwinska, Eszter Porter, Rhoddy Viveros Munoz
Voice Conversion Can Improve ASR in Very Low-Resource Settings
Matthew Baas, Herman Kamper
Data Augmentation for Low-Resource Quechua ASR Improvement
Rodolfo Zevallos, Núria Bel, Guillermo Cámbara, Mireia Farrús, Jordi Luque
ScoutWav: Two-Step Fine-Tuning on Self-Supervised Automatic Speech Recognition for Low-Resource Environments
Kavan Fatehi, Mercedes Torres Torres, Ayse Kucukyilmaz
Semi-supervised Acoustic and Language Modeling for Hindi ASR
Tarun Sai Bandarupalli, Shakti Rath, Nirmesh Shah, Onoe Naoyuki, Sriram Ganapathy
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel López-Francisco, Jonathan Amith, Shinji Watanabe
When Is TTS Augmentation Through a Pivot Language Useful?
Nathaniel Romney Robinson, Perez Ogayo, Swetha R. Gangu, David R. Mortensen, Shinji Watanabe
Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0
Aku Rouhe, Anja Virkkunen, Juho Leinonen, Mikko Kurimo
Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi
Anish Bhanushali, Grant Bridgman, Deekshitha G, Prasanta Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda Sukhadia, Umesh S, Sathvik Udupa, Lodagala V. S. V. Durga Prasad
Audio Similarity is Unreliable as a Proxy for Audio Quality
Pranay Manocha, Zeyu Jin, Adam Finkelstein
Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure
Sunmook Choi, Il-Youp Kwak, Seungsang Oh
Formant Estimation and Tracking using Probabilistic Heat-Maps
Yosi Shrem, Felix Kreuk, Joseph Keshet
Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck
Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoi Rin Kim
Robust Pitch Estimation Using Multi-Branch CNN-LSTM and 1-Norm LP Residual
Mudit D. Batra, JAYESH, C.S. Ramalingam
DeepFry: Identifying Vocal Fry Using Deep Neural Networks
Bronya Roni Chernyak, Talia Ben Simon, Yael Segal, Jeremy Steffman, Eleanor Chodroff, Jennifer Cole, Joseph Keshet
Phonetic Analysis of Self-supervised Representations of English Speech
Dan Wells, Hao Tang, Korin Richmond
FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Models
Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoi Rin Kim
On Combining Global and Localized Self-Supervised Models of Speech
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore
Self-supervised Representation Fusion for Speech and Wearable Based Emotion Recognition
Vipula Dissanayake, Sachith Seneviratne, Hussel Suriyaarachchi, Elliott Wen, Suranga Nanayakkara
Towards Disentangled Speech Representations
Cal Peyser, W. Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho
Automatic Assessment of Speech Intelligibility using Consonant Similarity for Head and Neck Cancer
Sebastião Quintas, Julie Mauclair, Virginie Woisard, Julien Pinquier
Compensation in Verbal and Nonverbal Communication after Total Laryngectomy
Marise Neijman, Femke Hof, Noelle Oosterom, Roland Pfau, Bertus van Rooy, Rob J.J.H. van Son, Michiel M.W.M. van den Brekel
wav2vec2-based Speech Rating System for Children with Speech Sound Disorder
Yaroslav Getman, Ragheb Al-Ghezi, Katja Voskoboinik, Tamás Grósz, Mikko Kurimo, Giampiero Salvi, Torbjørn Svendsen, Sofia Strömbergsson
Distinguishing between pre- and post-treatment in the speech of patients with chronic obstructive pulmonary disease
Andreas Triantafyllopoulos, Markus Fendler, Anton Batliner, Maurice Gerczuk, Shahin Amiriparian, Thomas Berghaus, Björn W. Schuller
A Study on the Phonetic Inventory Development of Children with Cochlear Implants for 5 Years after Implantation
Seonwoo Lee, Sunhee Kim, Minhwa Chung
Evaluation of different antenna types and positions in a stepped frequency continuous-wave radar-based silent speech interface
Joao Vitor Menezes, Pouriya Amini Digehsara, Christoph Wagner, Marco Mütze, Michael Bärhold, Petr Schaffer, Dirk Plettemeier, Peter Birkholz
Validation of the Neuro-Concept Detector framework for the characterization of speech disorders: A comparative study including Dysarthria and Dysphonia
Sondes Abderrazek, Corinne Fredouille, Alain Ghio, Muriel Lalain, Christine Meunier, Virginie Woisard
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children
Ilja Baumann, Dominik Wagner, Sebastian Bayerl, Tobias Bocklet
PERCEPT-R: An Open-Access American English Child/Clinical Speech Corpus Specialized for the Audio Classification of /ɹ/
Nina Benway, Jonathan L. Preston, Elaine Hitchcock, Asif Salekin, Harshit Sharma, Tara McAllister
Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees
Beiming Cao, Kristin Teplansky, Nordine Sebkhi, Arpan Bhavsar, Omer Inan, Robin Samlan, Ted Mau, Jun Wang
Statistical and clinical utility of multimodal dialogue-based speech and facial metrics for Parkinson's disease assessment
Hardik Kothare, Michael Neumann, Jackson Liscombe, Oliver Roesler, William Burke, Andrew Exner, Sandy Snyder, Andrew Cornish, Doug Habberstad, David Pautler, David Suendermann-Oeft, Jessica Huber, Vikram Ramanarayanan
Evaluation of call centre conversations based on a high-level symbolic representation
Leticia Arco, Carlos Mosquera, Fabjola Braho, Yisel Clavel, Johan Loeckx
Evoc-Learn — High quality simulation of early vocal learning
Yi Xu, Anqi Xu, Daniel R. van Niekerk, Branislav Gerazov, Peter Birkholz, Paul Konstantin Krug, Santitham Prom-on, Lorna F. Halliday
Watch Me Speak: 2D Visualization of Human Mouth during Speech
C Siddarth, Sathvik Udupa, Prasanta Kumar Ghosh
Classification of Accented English Using CNN Model Trained on Amplitude Mel-Spectrograms
Mariia Lesnichaia, Veranika Mikhailava, Natalia Bogach, Iurii Lezhenin, John Blake, Evgeny Pyshkin
MIM-DG: Mutual information minimization-based domain generalization for speaker verification
Woohyun Kang, Md Jahangir Alam, Abderrahim Fathan
Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays
Chengdong Liang, Yijiang Chen, Jiadi Yao, Xiao-Lei Zhang
Ant Multilingual Recognition System for OLR 2021 Challenge
Anqi Lyu, Zhiming Wang, Huijia Zhu
Class-Aware Distribution Alignment based Unsupervised Domain Adaptation for Speaker Verification
Hang-Rui Hu, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu
EDITnet: A Lightweight Network for Unsupervised Domain Adaptation in Speaker Verification
Jingyu Li, Wei Liu, Tan Lee
Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei
Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter
Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, YONG XU, Wenwu Wang
The HCCL System for the NIST SRE21
Zhuo Li, Runqiu Xiao, Hangting Chen, Zhenduo Zhao, Zihan Zhang, Wenchao Wang
UNet-DenseNet for Robust Far-Field Speaker Verification
Zhenke Gao, Manwai Mak, Weiwei Lin
Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition
Qijie Shao, Jinghao Yan, Jian Kang, Pengcheng Guo, Xian Shi, Pengfei Hu, Lei Xie
Transducer-based language embedding for spoken language identification
Peng Shen, Xugang Lu, Hisashi Kawai
Oriental Language Recognition (OLR) 2021: Summary and Analysis
Binling Wang, Feng Wang, Wenxuan Hu, Qiulin Wang, Jing Li, Dong Wang, Lin Li, Qingyang Hong
Mixup regularization strategies for spoofing countermeasure system
Woohyun Kang, Md Jahangir Alam, Abderrahim Fathan
Low-resource Low-footprint Wake-word Detection using Knowledge Distillation
Arindam Ghosh, Mark Fuhs, Deblin Bagchi, Bahman Farahani, Monika Woszczyna
Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O’Malley, Ian McGraw
Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire
Zhiyun Fan, Zhenlin Liang, Linhao Dong, Yi Liu, Shiyu Zhou, Meng Cai, Jun Zhang, Zejun Ma, Bo Xu
NAS-VAD: Neural Architecture Search for Voice Activity Detection
Daniel Rho, Jinhyeok Park, Jong Hwan Ko
Adversarial Multi-Task Deep Learning for Noise-Robust Voice Activity Detection with Low Algorithmic Delay
Claus Larsen, Peter Koch, Zheng-Hua Tan
Rainbow Keywords: Efficient Incremental Learning for Online Spoken Keyword Spotting
Yang Xiao, Nana Hou, Eng Siong Chng
Filler Word Detection and Classification: A Dataset and Benchmark
Ge Zhu, Juan-Pablo Caceres, Justin Salamon
Streaming Multi-Talker ASR with Token-Level Serialized Output Training
Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
pMCT: Patched Multi-Condition Training for Robust Speech Recognition
Pablo Peso Parada, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay
Improving ASR Robustness in Noisy Condition Through VAD Integration
Sashi Novitasari, Takashi Fukuda, Gakuto Kurata
Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model
Ryu Takeda, Yui Sudo, Kazuhiro Nakadai, Kazunori Komatani
Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition
Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang
DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition
Zixun Guo, Chen Chen, Eng Siong Chng
Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism
Kun Wei, Pengcheng Guo, Ning Jiang
Federated Self-supervised Speech Representations: Are We There Yet?
Yan Gao, Javier Fernandez-Marques, Titouan Parcollet, Abhinav Mehrotra, Nicholas Lane
Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation
Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka
End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe
Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition
Yoshiaki Bando, Takahiro Aizawa, Katsutoshi Itoyama, Kazuhiro Nakadai
A universally-deployable ASR frontend for joint acoustic echo cancellation, speech enhancement, and voice separation
Thomas R. O'Malley, Arun Narayanan, Quan Wang
Speaker conditioned acoustic modeling for multi-speaker conversational ASR
Srikanth Raj Chetupalli, Sriram Ganapathy
Hear No Evil: Towards Adversarial Robustness of Automatic Speech Recognition via Multi-Task Learning
Nilaksh Das, Polo Chau
Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription
Xianrui Zheng, Chao Zhang, Phil Woodland
Investigating the Impact of Crosslingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition
Muhammad Umar Farooq, Thomas Hain
An Improved Deliberation Network with Text Pre-training for Code-Switching Automatic Speech Recognition
Zhijie Shen, Wu Guo
CyclicAugment: Speech Data Random Augmentation with Cosine Annealing Scheduler for Automatic Speech Recognition
Zhihan Wang, Feng Hou, Yuanhang Qiu, Zhizhong Ma, Satwinder Singh, Ruili Wang
Prompt-based Re-ranking Language Model for ASR
Mengxi Nie, Ming Yan, Caixia Gong
Avoid Overfitting User Specific Information in Federated Keyword Spotting
Xin-Chun Li, Jin-Lin Tang, Shaoming Song, Bingshuai Li, Yinchuan Li, Yunfeng Shao, Le Gan, De-Chuan Zhan
ASR Error Correction with Constrained Decoding on Operation Prediction
Jingyuan Yang, Rongjun Li, Wei Peng
Adaptive multilingual speech recognition with pretrained models
Ngoc-Quan Pham, Alexander Waibel, Jan Niehues
Vietnamese Capitalization and Punctuation Recovery Models
Hoang Thi Thu Uyen, Nguyen Anh Tu, Ta Duc Huy
Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM
Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
reducing multilingual context confusion for end-to-end code-switching automatic speech recognition
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Yu Ting Yeung, Liqun Deng
Residual Language Model for End-to-end Speech Recognition
Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Prasad Narisetty, Shinji Watanabe
An Empirical Study of Language Model Integration for Transducer based Speech Recognition
Huahuan Zheng, keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan
Self-Normalized Importance Sampling for Neural Language Modeling
Zijian Yang, Yingbo Gao, Alexander Gerstenberger, Jintao Jiang, Ralf Schlüter, Hermann Ney
Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model
Jennifer Fox, Natalie Delworth
Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems
Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon
Language-specific Characteristic Assistance for Code-switching Speech Recognition
Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang
Speech intelligibility of simulated hearing loss sounds and its prediction using the Gammachirp Envelope Similarity Index (GESI)
Toshio Irino, Honoka Tamaru, Ayako Yamamoto
ELO-SPHERES intelligibility prediction model for the Clarity Prediction Challenge 2022
Mark Huckvale, Gaston Hilkhuysen
Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla
MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids
Ryandhimas Edo Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Squashed Weight Distribution for Low Bit Quantization of Deep Models
Nikko Strom, Haidar Khan, Wael Hamza
Evaluating the Performance of State-of-the-Art ASR Systems on Non-Native English using Corpora with Extensive Language Background Variation
Samuel Hollands, Daniel Blackburn, Heidi Christensen
Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture
Karan Singla, Shahab Jalalvand, Yeon-Jun Kim, Ryan Price, Daniel Pressel, Srinivas Bangalore
Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition
Thibault Bañeras Roux, Mickael Rouvier, Jane Wottawa, Richard Dufour
Toward Zero Oracle Word Error Rate on the Switchboard Benchmark
Arlo Faria, Adam Janin, Sidhi Adkoli, Korbinian Riedhammer
Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric
Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu Zhai, Christian Fuegen, Ozlem Kalinli, Michael Seltzer
Predicting Emotional Intensity in Political Debates via Non-verbal Signals
Jeewoo Yoon, Jinyoung Han, Erik Bucy, Jungseock Joo
Confusion Detection for Adaptive Conversational Strategies of An Oral Proficiency Assessment Interview Agent
Mao Saeki, Kotoka Miyagi, Shinya Fujie, Shungo Suzuki, Tetsuji Ogawa, Tetsunori Kobayashi, Yoichi Matsuyama
Deep Learning for Prosody-Based Irony Classification in Spontaneous Speech
Helen Gent, Chase Adams, Yan Tang, Chilin Shih
Span Classification with Structured Information for Disfluency Detection in Spoken Utterances
Sreyan Ghosh, Sonal Kumar, Yaman Kumar, Rajiv Ratn Shah, Srinivasan Umesh
Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis
Yi Chang, Zhao Ren, Thanh Tam Nguyen, Wolfgang Nejdl, Björn W. Schuller
Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset
Gordon Rennie, Olga Perepelkina, Alessandro Vinciarelli
Self-supervised Speaker Diarization
Yehoshua Dissen, Felix Kreuk, Joseph Keshet
Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning
Theo Lepage, Reda Dehak
Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection
Piotr Kawa, Marcin Plata, Piotr Syga
Non-contrastive self-supervised learning of utterance-level speech representations
Jaejin Cho, Raghavendra Pappagari, Piotr Żelasko, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak
Barlow Twins self-supervised learning for robust speaker recognition
Mohammad Mohammadamini, Driss Matrouf, Jean-Francois Bonastre, Sandipana Dowerah, Romain Serizel, Denis Jouvet
Relating the fundamental frequency of speech with EEG using a dilated convolutional network
Corentin Puffay, Jana Van Canneyt, Jonas Vanthornhout, Hugo Van hamme, Tom Francart
Prediction of L2 speech proficiency based on multi-level linguistic features
Verdiana De Fino, Lionel Fontan, Julien Pinquier, Isabelle Ferrané, Sylvain Detey
The effect of increasing acoustic and linguistic complexity on auditory processing: an EEG study
Fareeha S. Rana, Daniel Pape, Elisabet Service
Recording and timing vocal responses in online experimentation
Katrina Kechun Li, Julia Schwarz, Jasper Hong Sim, Yixin Zhang, Elizabeth Buchanan-Worster, Brechtje Post, Kirsty McDougall
Neural correlates of acoustic and semantic cues during speech segmentation in French
Maria del Mar Cordero, Ambre Denis-Noël, Elsa Spinelli, Fanny Meunier
Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task
Kevin Meng, Seo-Hyun Lee, Farhad Goodarzy, Simon Vogrin, Mark J. Cook, Seong-Whan Lee, David B. Grayden
End-to-end model for named entity recognition from speech without paired training data
Salima Mdhaffar, Jarod Duret, Titouan Parcollet, Yannick Estève
Multitask Learning for Low Resource Spoken Language Understanding
Quentin Meeus, Marie Francine Moens, Hugo Van hamme
Transformer Networks for Non-Intrusive Speech Quality Prediction
M K Jayesh, Mukesh Sharma, Praneeth Vonteddu, Mahaboob Ali Basha Shaik, Sriram Ganapathy
Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications
Bastiaan Tamm, Helena Balabin, Rik Vandenberghe, Hugo Van hamme
Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction
Helard Becerra, Alessandro Ragano, Andrew Hines
MAESTRO: Matched Speech Text Representations through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen
FiLM Conditioning with Enhanced Feature to the Transformer-based End-to-End Noisy Speech Recognition
Da-Hee Yang, Joon-Hyuk Chang
SepTr: Separable Transformer for Audio Spectrogram Processing
Nicolaea Catalin Ristea, Radu Tudor Ionescu, Fahad Shahbaz Khan
End-to-End Spontaneous Speech Recognition Using Disfluency Labeling
Koharu Horii, Meiko Fukuda, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka
Recent improvements of ASR models in the face of adversarial attacks
Raphael Olivier, Bhiksha Raj
Similarity and Content-based Phonetic Self Attention for Speech Recognition
Kyuhong Shim, Wonyong Sung
Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers
Juntae Kim, Jeehye Lee
Knowledge distillation for In-memory keyword spotting model
Zeyang Song, Qi Liu, Qu Yang, Haizhou Li
Automatic Learning of Subword Dependent Model Scales
Felix Meyer, Wilfried Michel, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney
Bayesian Recurrent Units and the Forward-Backward Algorithm
Alexandre Bittar, Philip N. Garner
On Metric Learning for Audio-Text Cross-Modal Retrieval
Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark Plumbley, Wenwu Wang
CT-SAT: Contextual Transformer for Sequential Audio Tagging
Yuanbo Hou, Zhaoyi Liu, Bo Kang, Yun Wang, Dick Botteldooren
ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition
Zi Huang, Shulei Ji, Zhilan Hu, Chuangjian Cai, Jing Luo, Xinyu Yang
Audio-Visual Scene Classification Based on Multi-modal Graph Fusion
Han Lei, Ning Chen
MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
Chandan Reddy, Vishak Gopal, Harishchandra Dubey, Ross Cutler, Sergiy Matusevych, Robert Aichner
iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning
Kun Chen, Jun Wang, Feng Deng, Xiaorui Wang
ATST: Audio Representation Learning with Teacher-Student Transformer
Xian LI, Xiaofei Li
Deep Segment Model for Acoustic Scene Classification
Yajian Wang, Jun Du, Hang Chen, Qing Wang, Chin-Hui Lee
Novel Augmentation Schemes for Device Robust Acoustic Scene Classification
Sukanya Sonowal, Anish Tamse
WideResNet with Joint Representation Learning and Data Augmentation for Cover Song Identification
Shichao Hu, Bin Zhang, Jinhong Lu, Yiliang Jiang, Wucheng Wang, Lingcheng Kong, Weifeng Zhao, Tao Jiang
Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework
Rahil Parikh, Harshavardhan Sundar, Ming Sun, Chao Wang, Spyros Matsoukas
Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval
Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino
Speaker recognition-assisted robust audio deepfake detection
Jiahui Pan, Shuai Nie, Hui Zhang, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang, Jianhua Tao
Preventing sensitive-word recognition using self-supervised learning to preserve user-privacy for automatic speech recognition
Yuchen Liu, Apu Kapadia, Donald Williamson
NESC: Robust Neural End-2-End Speech Coding with GANs
Nicola Pia, Kishan Gupta, Srikanth Korse, Markus Multrus, Guillaume Fuchs
Towards Error-Resilient Neural Speech Coding
Huaying Xue, Xiulian Peng, Xue Jiang, Yan Lu
Cross-Scale Vector Quantization for Scalable Neural Speech Coding
Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, Yan Lu
Neural Vocoder is All You Need for Speech Super-resolution
Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang
VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration
Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang
Generating gender-ambiguous voices for privacy-preserving speech recognition
Dimitrios Stoidis, Andrea Cavallaro
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis
Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, Mengxiao Bi
Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech
Haoyue Zhan, Xinyuan YU, Haitong Zhang, Yang Zhang, Yue Lin
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses
Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu
Decoupled Pronunciation and Prosody Modeling in Meta-Learning-based Multilingual Speech Synthesis
Yukun Peng, Zhenhua Ling
KaraTuner: Towards End-to-End Natural Pitch Correction for Singing Voice in Karaoke
Xiaobin Zhuang, Huiran Yu, Weifeng Zhao, Tao Jiang, Peng Hu
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher
Heyang Xue, Xinsheng Wang, Yongmao Zhang, Lei Xie, Pengcheng Zhu, Mengxiao Bi
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy
Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin
Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis
Jiatong Shi, Shuai Guo, Tao Qian, Tomoki Hayashi, Yuning Wu, Fangzheng Xu, Xuankai Chang, Huazhe Li, Peter Wu, Shinji Watanabe, Qin Jin
Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations
Chang Liu, Zhen-Hua Ling, Ling-Hui Chen
Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding
Chao Wang, Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Yibiao Yu, Zejun Ma
Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information
Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu, Shiyin Kang, Helen Meng
Normalization of code-switched text for speech synthesis
Sreeram Manghat, Sreeja Manghat, Tanja Schultz
Synthesizing Near Native-accented Speech for a Non-native Speaker by Imitating the Pronunciation and Prosody of a Native Speaker
Raymond Chung, Brian Mak
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion
Xu Li, Shansong Liu, Ying Shan
Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Verification
Longfei Yang, Jinsong Zhang, Takahiro Shinozaki
L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis
Daniel Zhang, Ashwinkumar Ganesan, Sarah Campbell, Daniel Korzekwa
Challenges remain in Building ASR for Spontaneous Preschool Children Speech in Naturalistic Educational Environments
Satwik Dutta, Sarah Anne Tao, Jacob C. Reyna, Rebecca Elizabeth Hacker, Dwight W. Irvin, Jay F. Buzhardt, John H.L. Hansen
End-to-end Mispronunciation Detection with Simulated Error Distance
Zhan Zhang, Yuehai Wang, Jianyi Yang
BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows
Zhan Zhang, Yuehai Wang, Jianyi Yang
Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring
Kaiqi Fu, Shaojun Gao, Xiaohai Tian, Wei Li, MA Zejun
An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English
Qi Chen, BingHuai Lin, YanLu Xie
RefTextLAS: Reference Text Biased Listen, Attend, and Spell Model For Accurate Reading Evaluation
Phani Sankar Nidadavolu, Na Xu, Nick Jutila, Ravi Teja Gadde, Aswarth Abhilash Dara, Joseph Savold, Sapan Patel, Aaron Hoff, Veerdhawal Pande, Kevin Crews, Ankur Gandhe, Ariya Rastrow, Roland Maas
CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, Qun Liu
Spoofing-Aware Speaker Verification by Multi-Level Fusion
Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li, Xu Li, Xixin Wu, Hung-yi Lee, Helen Meng
End-to-end framework for spoof-aware speaker verification
Woohyun Kang, Md Jahangir Alam, Abderrahim Fathan
The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge
Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, Wei Shi, Weizhen Huang, Yapeng Mao
Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification
Peng Zhang, Peng Hu, Xueliang Zhang
SASV Based on Pre-trained ASV System and Integrated Scoring Module
Yuxiang Zhang, Zhuo Li, Wenchao Wang, Pengyuan Zhang
Backend Ensemble for Speaker Verification and Spoofing Countermeasure
Li Zhang, Yue Li, Huan Zhao, Qing Wang, Lei Xie
NRI-FGSM: An Efficient Transferable Adversarial Attack for Speaker Recognition Systems
Hao Tan, Junjian Zhang, Huan Zhang, Le Wang, Yaguan Qian, Zhaoquan Gu
SA-SASV: An End-to-End Spoof-Aggregated Spoofing-Aware Speaker Verification System
Zhongwei Teng, Quchen Fu, Jules White, Maria Powell, Douglas Schmidt
The DKU-OPPO System for the 2022 Spoofing-Aware Speaker Verification Challenge
Xingming Wang, Xiaoyi Qin, Yikang Wang, Yunfei Xu, Ming Li
NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates
Seungu Han, Junhyeok Lee
SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling
Takaaki Saeki, Shinnosuke Takamichi, Tomohiko Nakamura, Naoko Tanji, Hiroshi Saruwatari
Optimization of Deep Neural Network (DNN) Speech Coder Using a Multi Time Scale Perceptual Loss Function
Joon Byun, Seungmin Shin, Jongmo Sung, Seungkwon Beack, Youngcheol Park
Phase Vocoder For Time Stretch Based On Center Frequency Estimation
Donghyeon Kim, Bowon Lee
Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund
Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions
Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition
Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris
Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer
Kaiqi Zhao, Hieu Nguyen, Animesh Jain, Nathan Susanj, Athanasios Mouchtaris, Lokesh Gupta, Ming Zhao
Memory-Efficient Training of RNN-Transducer with Sampled Softmax
Jaesong Lee, Lukas Lee, Shinji Watanabe
Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer
Cong-Thanh Do, Mohan Li, Rama Doddipatla
Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech
Ilya Sklyar, Anna Piunova, Christian Osendorfer
Variations of multi-task learning for spoken language assessment
Jeremy Heng Meng Wong, Huayun Zhang, Nancy Chen
Detection of Learners' Listening Breakdown with Oral Dictation and Its Use to Model Listening Skill Improvement Exclusively Through Shadowing
Takuya Kunihara, Chuanbo Zhu, Daisuke Saito, Nobuaki Minematsu, Noriko Nakanishi
Automatic Prosody Evaluation of L2 English Read Speech in Reference to Accent Dictionary with Transformer Encoder
Yu Suzuki, Tsuneo Kato, Akihiro Tamura
View-Specific Assessment of L2 Spoken English
Stefano Bannò, Bhanu Balusu, Mark Gales, Kate Knill, Konstantinos Kyriakopoulos
The Effects of Implicit and Explicit Feedback in an ASR-based Reading Tutor for Dutch First-graders
Yu Bai, Ferdy Hubers, Catia Cucchiarini, Roeland van Hout, Helmer Strik
Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment
Mu Yang, Kevin Hirschi, Stephen Daniel Looney, Okim Kang, John H.L. Hansen
Response Timing Estimation for Spoken Dialog System using Dialog Act Estimation
Jin Sakuma, Shinya Fujie, Tetsunori Kobayashi
Hesitations in Urdu/Hindi: Distribution and Properties of Fillers & Silences
Farhat Jabeen, Simon Betz
Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings
K V Vijay Girish, Srikanth Konjeti, Jithendra Vepa
Does Utterance entails Intent?: Evaluating Natural Language Inference Based Setup for Few-Shot Intent Detection
Ayush Kumar, Vijit Malik, Jithendra Vepa
Investigating perception of spoken dialogue acceptability through surprisal
Sarenne Carrol Wallbridge, Catherine Lai, Peter Bell
Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers
Chiori Hori, Takaaki Hori, Jonathan Le Roux
The ZevoMOS entry to VoiceMOS Challenge 2022
Adriana Stan
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari
Automatic Mean Opinion Score Estimation with Temporal Modulation Features on Gammatone Filterbank for Speech Assessment
Huy Nguyen, Kai Li, Masashi Unoki
Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines
The VoiceMOS Challenge 2022
Wen Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores
Wei-Cheng Tseng, Wei-Tsung Kao, Hung-yi Lee
Expressive, Variable, and Controllable Duration Modelling in TTS
Syed Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis
Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari
Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS
Min-Kyung Kim, Joon-Hyuk Chang
FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS
Changhwan Kim, Seyun Um, Hyungchan Yoon, Hong-Goo Kang
Few Shot Cross-Lingual TTS Using Transferable Phoneme Embedding
Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-yi Lee
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark
Spoken-Text-Style Transfer with Conditional Variational Autoencoder and Content Word Storage
Daiki Yoshioka, Yusuke Yasuda, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda
Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems
Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet
Cross-lingual Style Transfer with Conditional Prior VAE and Style Loss
Dino Rattcliffe, You Wang, Alex Mansbridge, Penny Karanasou, Alexis Moinet, Marius Cotescu
Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau
Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems
Hyun-Wook Yoon, Ohsung Kwon, Hoyeon Lee, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim, Min-Jae Hwang
Text aware Emotional Text-to-speech with BERT
Arijit Mukherjee, Shubham Bansal, Sandeepkumar Satpal, Rupesh Mehta
Overlapped Speech Detection in Broadcast Streams Using X-vectors
Lukas Mateju, Frantisek Kynych, Petr Cerva, Jiri Malek, Jindrich Zdansky
DDKtor: Automatic Diadochokinetic Speech Analysis
Yael Segal, Kasia Hitczenko, Matt Goldrick, Adam Buchwald, Angela Roberts, Joseph Keshet
SiDi KWS: A Large-Scale Multilingual Dataset for Keyword Spotting
Michel Cardoso Meneses, Rafael Bérgamo Holanda, Luis Vasconcelos Peres, Gabriela Dantas Rocha
Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting
Byeonggeun Kim, Seunghan Yang, Inseop Chung, Simyung Chang
Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering
Eklavya Sarkar, RaviShankar Prasad, Mathew Magimai Doss
Multilingual and Multimodal Abuse Detection
Rini Sharon, Heet Shah, Debdoot Mukherjee, Vikram Gupta
Microphone Array Channel Combination Algorithms for Overlapped Speech Detection
Theo Mariotte, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas
Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection
Yui Sudo, Shakeel Muhammad, Kazuhiro Nakadai, Jiatong Shi, Shinji Watanabe
Unsupervised Word Segmentation using K Nearest Neighbors
Tzeviya Fuchs, Yedid Hoshen, Yossi Keshet
Investigation on the Band Importance of Phase-aware Speech Enhancement
Zhuohuang Zhang, Donald Williamson, Yi Shen
Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy
Yifan Sun, Qinlong Huang, Xihong Wu
Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab
Yifan Sun, Qinlong Huang, Xihong Wu
Radio2Speech: High Quality Speech Recovery from Radio Frequency Signals
Running Zhao, Jiangtao Yu, Tingle Li, Hang Zhao, Edith C. H. Ngai
Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech
Mamady NABE, Julien Diard, Jean-Luc Schwartz
An investigation of regression-based prediction of the femininity or masculinity in speech of transgender people
Leon Liebig, Christoph Wagner, Alexander Mainka, Peter Birkholz
Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals
Rahil Parikh, Nadee Seneviratne, Ganesh Sivaraman, Shihab Shamma, Carol Espy-Wilson
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition
Jiachen Lian, Alan W Black, Louis Goldstein, Gopala Krishna Anumanchipalli
Vocal-Tract Area Functions with Articulatory Reality for Tract Opening
Zhao Zhang, Ju Zhang, Jianguo Wei, Kiyoshi Honda, Tatsuya Kitamura
Coupled Discriminant Subspace Alignment for Cross-database Speech Emotion Recognition
Shaokai Li, Peng Song, Keke Zhao, Wenjing Zhang, Wenming Zheng
Performance Improvement of Speech Emotion Recognition by Neutral Speech Detection Using Autoencoder and Intermediate Representation
Jennifer Santoso, Takeshi Yamada, Kenkichi Ishizuka, Taiichi Hashimoto, Shoji Makino
A Graph Isomorphism Network with Weighted Multiple Aggregators for Speech Emotion Recognition
Ying Hu, Yuwu Tang, Hao Huang, Liang He
Speech Emotion Recognition via Generation using an Attention-based Variational Recurrent Neural Network
Murchana Baruah, Bonny Banerjee
Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation
Vikramjit Mitra, Hsiang-Yun Sherry Chien, Vasudha Kowtha, Joseph Yitan Cheng, Erdrin Azemi
Multiple Enhancements to LSTM for Learning Emotion-Salient Features in Speech Emotion Recognition
Desheng Hu, Xinhui Hu, Xinkang Xu
Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
Zihan Zhao, Yanfeng Wang, Yu Wang
CTA-RNN: Channel and Temporal-wise Attention RNN leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition
Chengxin Chen, Pengyuan Zhang
Complex Paralinguistic Analysis of Speech: Predicting Gender, Emotions and Deception in a Hierarchical Framework
Alena Velichko, Maxim Markitantov, Heysem Kaya, Alexey Karpov
Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi
SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
Zuheng Kang, Junqing Peng, Jianzong Wang, Jing Xiao
Discriminative Feature Representation Based on Cascaded Attention Network with Adversarial Joint Loss for Speech Emotion Recognition
Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, Zhen Zhao
Intra-speaker phonetic variation in read speech: comparison with inter-speaker variability in a controlled population
Nicolas Audibert, Cécile Fougeron
Training speaker recognition systems with limited data
Nik Vaessen, David van Leeuwen
A Deep One-Class Learning Method for Replay Attack Detection
Yijie Lou, Shiliang Pu, Jianfeng Zhou, Xin Qi, Qinbo Dong, Hongwei Zhou
A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network
Haodong Zhao, Wei Du, Junjie Guo, Gongshen Liu
A Novel Phoneme-based Modeling for Text-independent Speaker Identification
Xin Wang, Chuan Xie, Qiang Wu, Huayi Zhan, Ying Wu
Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction
Bing Han, Zhengyang Chen, Yanmin Qian
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu
Acoustic Feature Shuffling Network for Text-independent Speaker Verification
Jin Li, Xin Fang, Fan Chu, Tian Gao, Yan Song, Rong Li Dai
Multi-Path GMM-MobileNet Based on Attack Algorithms and Codecs for Synthetic Speech and Deepfake Detection
Yan Wen, Zhenchun Lei, Yingen Yang, Changhong Liu, Minglei Ma
Adversarial Reweighting for Speaker Verification Fairness
Minho Jin, Chelsea Ju, Zeya Chen, Yi Chieh Liu, Jasha Droppo, Andreas Stolcke
Graph-based Multi-View Fusion and Local Adaptation: Mitigating Within-Household Confusability for Speaker Identification
Long Chen, Yixiong Meng, Venkatesh Ravichandran, Andreas Stolcke
Local Context-aware Self-attention for Continuous Sign Language Recognition
Ronglai Zuo, Brian Mak
Disentangled Latent Speech Representation for Automatic Pathological Intelligibility Assessment
Tobias Weise, Philipp Klumpp, Andreas Maier, Elmar Nöth, Björn Heismann, Maria Schuster, Seung Hee Yang
Improving Hypernasality Estimation with Automatic Speech Recognition in Cleft Palate Speech
Kaitao Song, Teng Wan, Bixia Wang, Huiqiang Jiang, Luna Qiu, Jiahang Xu, Liping Jiang, Qun Lou, Yuqing Yang, Dongsheng Li, Xudong Wang, Lili Qiu
Conformer Based Elderly Speech Recognition System for Alzheimer’s Disease Detection
Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, Helen Meng
Revisiting visuo-spatial processing in individuals with congenital amusia
Zixia Fan, Jing Shao, Weigong Pan, Lan Wang
A user-friendly headset for radar-based silent speech recognition
Pouriya Amini Digehsara, João Vítor Possamai de Menezes, Christoph Wagner, Michael Bärhold, Petr Schaffer, Dirk Plettemeier, Peter Birkholz
A study of production error analysis for Mandarin-speaking Children with Hearing Impairment
Jingwen Cheng, Yuchen Yan, Yingming Gao, Xiaoli Feng, Yannan Wang, Jinsong Zhang
Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Francoise Beaufays
Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion
Muhammad Umar Farooq, Darshan Adiga Haniya Narayana, Thomas Hain
The THUEE System Description for the IARPA OpenASR21 Challenge
Jing Zhao, Haoyu Wang, Jinpeng Li, Shuzhou Chai, Guanbo Wang, Guoguo Chen, Wei-Qiang Zhang
External Text Based Data Augmentation for Low-Resource Speech Recognition in the Constrained Condition of OpenASR21 Challenge
Guolong Zhong, Hongyu Song, Ruoyu Wang, Lei Sun, Diyuan Liu, Jia Pan, Xin Fang, Jun Du, Jie Zhang, Lirong Dai
Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish
Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide
Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
Han Zhu, Li Wang, Gaofeng Cheng, Jindong Wang, Pengyuan Zhang, Yonghong Yan
Comparison of Unsupervised Learning and Supervised Learning with Noisy Labels for Low-Resource Speech Recognition
Yanick Schraner, Christian Scheller, Michel Plüss, Lukas Neukom, Manfred Vogel
Using cross-model learnings for the Gram Vaani ASR Challenge 2022
Tanvina Patel, Odette Scharenborg
ASR2K: Speech Recognition for Around 2000 Languages without Audio
Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe
Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR
Ronit Damania, Christopher Homan, Emily Prud'hommeaux
OpenASR21: The Second Open Challenge for Automatic Speech Recognition of Low-Resource Languages
Kay Peterson, Audrey Tong, Yan Yu
DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children’s ASR
Ruchao Fan, Abeer Alwan
Plugging a neural phoneme recognizer into a simple language model: a workflow for low-resource setting
Séverine Guillaume, Guillaume Wisniewski, Benjamin Galliot, Minh-Châu Nguyên, Maxime Fily, Guillaume Jacques, Alexis Michaud
An Evaluation of Three-Stage Voice Conversion Framework for Noisy and Reverberant Conditions
Yeonjong Choi, Chao Xie, Tomoki Toda
An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion
Zijiang Yang, Xin Jing, Andreas Triantafyllopoulos, Meishu Song, Ilhan Aslan, Björn W. Schuller
Zero-Shot Foreign Accent Conversion without a Native Reference
Waris Quamer, Anurag Das, John Levis, Evgeny Chukharev-Hudilainen, Ricardo Gutierrez-Osuna
Speaker Anonymization with Phonetic Intermediate Representations
Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli, Ngoc Thang Vu
Investigation into Target Speaking Rate Adaptation for Voice Conversion
Michael Kuhlmann, Fritz Seebauer, Janek Ebbers, Petra Wagner, Reinhold Haeb-Umbach
Self supervised learning for robust voice cloning
Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panagiotis Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis
Improving Deliberation by Text-Only and Semi-Supervised Training
Ke Hu, Tara Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang
K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables
Jounghee Kim, Pilsung Kang
Wav2Vec-Aug: Improved self-supervised training with limited data
Anuroop Sriram, Michael Auli, Alexei Baevski
Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model
Martin Kocour, Katerina Zmolikova, Lucas Ondel, Jan Svec, Marc Delcroix, Tsubasa Ochiai, Lukas Burget, Jan Cernocky
RNN-T lattice enhancement by grafting of pruned paths
Mirek Novak, Pavlos Papadopoulos
Better Intermediates Improve CTC Inference
Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida
Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices
Iona Gessinger, Michelle Cohn, Georgia Zellou, Bernd Möbius
Discriminative Adversarial Learning for Speaker Independent Emotion Recognition
Chamara Kasun, Chung Soo Ahn, Jagath Rajapakse, Zhiping Lin, Guang-Bin Huang
Representing 'how you say' with 'what you say': English corpus of focused speech and text reflecting corresponding implications
Naoaki Suzuki, Satoshi Nakamura
Production Strategies of Vocal Attitudes
Léane Salais, Pablo Arias, Clément Le Moine, Victor Rosi, Yann TEYTAUT, Nicolas Obin, Axel Roebel
Where's the uh, hesitation? The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence
Ambika Kirkland, Harm Lameris, Eva Szekely, Joakim Gustafson
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-Yiin Chang, David Rybach, Tara Sainath, Rohit Prabhavalkar, Cal Peyser, Zhiyun Lu, Cyril Allauzen
Autoregressive Co-Training for Learning Discrete Speech Representation
Sung-Lin Yeh, Hao Tang
An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks
Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, Hung-yi Lee
Overlapped speech and gender detection with WavLM pre-trained features
Martin Lebourdais, Marie Tahon, Antoine LAURENT, Sylvain Meignier
A study on constraining Connectionist Temporal Classification for temporal audio alignment
Yann TEYTAUT, Baptiste Bouvier, Axel Roebel
Acoustic-to-articulatory Speech Inversion with Multi-task Learning
Yashish M. Siriwardena, Ganesh Sivaraman, Carol Espy-Wilson
Enhancing Speech Privacy with Slicing
Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent
An Attention-Based Method for Guiding Attribute-Aligned Speech Representation Learning
Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
Defense against Adversarial Attacks on Hybrid Speech Recognition System using Adversarial Fine-tuning with Denoiser
Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Żelasko, Jesús Villalba, Sanjeev Khudanpur, Najim Dehak
Membership Inference Attacks Against Self-supervised Speech Models
Wei-Cheng Tseng, Wei-Tsung Kao, Hung-yi Lee
Chunking Defense for Adversarial Attacks on ASR
Yiwen Shao, Jesus Villalba, Sonal Joshi, Saurabh Kataria, Sanjeev Khudanpur, Najim Dehak
Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling
Tiantian Feng, Shrikanth Narayanan
User-Level Differential Privacy against Attribute Inference Attack of Speech Emotion Recognition on Federated Learning
Tiantian Feng, Raghuveer Peri, Shrikanth Narayanan
AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification
Sonal Joshi, Saurabh Kataria, Jesús Villalba, Najim Dehak
Online Learning of Open-set Speaker Identification by Active User-registration
Eunkyung Yoo, Hyeonseop Song, Taehyeong Kim, Chul Lee
Automatic Speaker Verification System for Dysarthria Patients
Shinimol Salim, Syed Shahnawazuddin, Waquar Ahmad
Multimodal Clustering with Role Induced Constraints for Speaker Diarization
Nikolaos Flemotomos, Shrikanth Narayanan
Multi-scale Speaker Diarization with Dynamic Scale Weighting
Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
Improved Relation Networks for End-to-End Speaker Verification and Identification
Ashutosh Chaubey, Sparsh Sinha, Susmita Ghose
End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
Magdalena Rybicka, Jesus Villalba, Najim Dehak, Konrad Kowalczyk
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
Federico Landini, Alicia Lozano-Diez, Mireia Diez, Lukáš Burget
Can Humans Correct Errors From System? Investigating Error Tendencies in Speaker Identification Using Crowdsourcing
Yuta Ide, Susumu Saito, Teppei Nakano, Tetsuji Ogawa
Light-Weight Speaker Verification with Global Context Information
MISEUL KIM, ZHENYU PIAO, Seyun Um, Ran Lee, Jaemin Joh, Seungshin Lee, Hong-Goo Kang
Learnable Sparse Filterbank for Speaker Verification
Junyi Peng, Rongzhi Gu, Ladislav Mošner, Oldrich Plchot, Lukas Burget, Jan Černocký
Using Data Augmentation and Consistency Regularization to Improve Semi-supervised Speech Recognition
Ashtosh Sapru
Unsupervised domain adaptation for speech recognition with unsupervised error correction
Long Mai, Julie Carson-Berndsen
A Scalable Model Specialization Framework for Training and Inference using Submodels and its Application to Speech Model Personalization
Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno
Wav2vec behind the Scenes: How end2end Models learn Phonetics
Teena tom Dieck, Paula Andrea Pérez-Toro, Tomas Arias, Elmar Noeth, Philipp Klumpp
Scaling ASR Improves Zero and Few Shot Learning
Weiyi Zheng, Alex Xiao, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed
InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR
Yu Nakagome, Tatsuya Komatsu, Yusuke Fujita, Shuta Ichimura, Yusuke Kida
Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition
A Arunkumar, Vrunda Nileshkumar Sukhadia, Srinivasan Umesh
Dynamic Sliding Window Modeling for Abstractive Meeting Summarization
Zhengyuan Liu, Nancy Chen
STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent
Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
kidsTALC: A Corpus of 3- to 11-year-old German Children’s Connected Natural Speech
Lars Rumberg, Christopher Gebauer, Hanna Ehlert, Maren Wallbaum, Lena Bornholt, Jörn Ostermann, Ulrike Lüdtke
DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering
Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Annie Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee
Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings
Myunghun Jung, Hoi Rin Kim
Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation
Chih-Chiang Chang, Hung-yi Lee
Building Vietnamese Conversational Smart Home Dataset and Natural Language Understanding Model
Thi Thu Trang NGUYEN, Trung Duc Anh Dang, Quoc Viet Vu, Woomyoung Park
DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances
Sreyan Ghosh, Samden Lepcha, S Sakshi, Rajiv Ratn Shah, Srinivasan Umesh
Voice Activity Projection: Self-supervised Learning of Turn-taking Events
Erik Ekstedt, Gabriel Skantze
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee
QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using MLPMixer
Jinmiao Huang, Waseem Gharbieh, Qianhui Wan, Han Suk Shim, Hyun Chul Lee
DyConvMixer: Dynamic Convolution Mixer Architecture for Open-Vocabulary Keyword Spotting
Waseem Gharbieh, Jinmiao Huang, Qianhui Wan, Han Suk Shim, Hyun Chul Lee
Challenges in Metadata Creation for Massive Naturalistic Team-Based Audio Data
Chelzy Belitz, John H.L. Hansen
Spoken Dialogue System for Call Centers with Expressive Speech Synthesis
Davis Nicmanis, Askars Salimbajevs
OCTRA – An Innovative Approach to Orthographic Transcription
Christoph Draxler, Julian Pomp
Voice Puppetry with FastPitch
Emelie Van De Vreken, Korin Richmond, Catherine Lai
Improving Data Driven Inverse Text Normalization using Data Augmentation and Machine Translation
Debjyoti Paul, Yutong Pang, Szu-Jui Chen, Xuedong Zhang
Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification
Yizhou Wang, Rikke Bundgaard-Nielsen, Brett Baker, Olga Maxwell
Mandarin nasal place assimilation revisited: an acoustic study
Mingqiong Luo
Bending the string: intonation contour length as a correlate of macro-rhythm
Constantijn Kaland
Eliciting and evaluating likelihood ratios for speaker recognition by human listeners under forensically realistic channel-mismatched conditions
Vincent Hughes, Carmen Llamas, Thomas Kettig
Reducing uncertainty at the score-to-LR stage in likelihood ratio-based forensic voice comparison using automatic speaker recognition systems
Bruce Xiao Wang, Vincent Hughes
Durational Patterning at Discourse Boundaries in Relation to Therapist Empathy in Psychotherapy
Jonathan Him Nok Lee, Dehua Tao, Harold Chui, Tan Lee, Sarah Luk, Nicolette Wing Tung Lee, Koonkan Fung
Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals
Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku
Applying Syntax–Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis
Kei Furukawa, Takeshi Kishiyama, Satoshi Nakamura
Effects of Language Contact on Vowel Nasalization in Wenzhou and Rugao Dialects
Yan Li, Ying Chen, Xinya Zhang, Yanyang Chen, Jiazheng Wang
A blueprint for using deepfakes in sociolinguistic matched-guise experiments
Nathan Joel Young, David Britain, Adrian Leemann
Mandarin Tone Sandhi Realization: Evidence from Large Speech Corpora
Zuoyu Tian, Xiao Dong, Feier Gao, Haining Wang, Charles Lin
A Laryngographic Study on the Voice Quality of Northern Vietnamese Tones under the Lombard Effect
Giang Le, Chilin Shih, Yan Tang
The Prosody of Cheering in Sport Events
Marzena Zygis, Sarah Wesolek, Nina Hosseini-Kivanani, Manfred Krifka
Contribution of the glottal flow residual in affect-related voice transformation
Zihan Wang, Christer Gobl
High level feature fusion in forensic voice comparison
Michael Carne, Yuko Kinoshita, Shunichi Ishihara
Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data
Gasper Begus, Alan Zhou
Paraguayan Guarani: Tritonal pitch accent and Accentual Phrase
Sun-Ah Jun, Maria Luisa Zubizarreta
Low-resource Accent Classification in Geographically-proximate Settings: A Forensic and Sociophonetics Perspective
Qingcheng Zeng, Dading Chong, Peilin Zhou, Jie Yang
Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation
Jian Luo, Jianzong Wang, Ning Cheng, Edward Xiao, Xulong Zhang, Jing Xiao
Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou
SepIt: Approaching a Single Channel Speech Separation Bound
Shahar Lutati, Eliya Nachmani, Lior Wolf
On the Use of Deep Mask Estimation Module for Neural Source Separation Systems
Kai Li, Xiaolin Hu, Yi Luo
Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches
Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou
Embedding Recurrent Layers with Dual-Path Strategy in a Variant of Convolutional Network for Speaker-Independent Speech Separation
Xue Yang, Changchun Bao
Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case
Amir Ivry, Israel Cohen, Baruch Berdugo
QDPN - Quasi-dual-path Network for single-channel Speech Separation
Joel Rixen, Matthias Renz
Conformer Space Neural Architecture Search for Multi-Task Audio Separation
Shun Lu, Yang Wang, Peng Yao, Chenxing Li, Jianchao Tan, Feng Deng, Xiaorui Wang, Chengru Song
ResectNet: An Efficient Architecture for Voice Activity Detection on Mobile Devices
Okan Köpüklü, Maja Taseska
Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network
Wenjing Liu, Chuan Xie
WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation
Yang Wang, Chenxing Li, Feng Deng, Shun Lu, Peng Yao, Jianchao Tan, Chengru Song, Xiaorui Wang
Multichannel Speech Separation with Narrow-band Conformer
Changsheng Quan, Xiaofei Li
Separating Long-Form Speech with Group-wise Permutation Invariant Training
Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei
Directed speech separation for automatic speech recognition of long form conversational speech
Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff, Daniel Garcia-Romero
Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors
Srikanth Raj Chetupalli, Emanuël Habets
Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices
Ryan Corey, Manan Mittal, Kanad Sarkar, Andrew C. Singer
Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi
An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models
Rahil Parikh, Gaspar Rochette, Carol Espy-Wilson, Shihab Shamma
TaylorBeamformer: Learning All-Neural Beamformer for Multi-Channel Speech Enhancement from Taylor’s Approximation Theory
Andong Li, Guochen Yu, Chengshi Zheng, Xiaodong Li
How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR
Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri
Multi-source wideband DOA estimation method by frequency focusing and error weighting
Jing Zhou, Changchun Bao
Convolutional Recurrent Smart Speech Enhancement Architecture for Hearing Aids
Soha Nossier, Julie Wall, Mansour Moniri, Cornelius Glackin, Nigel Cannings
Fully Automatic Balance between Directivity Factor and White Noise Gain for Large-scale Microphone Arrays in Diffuse Noise Fields
Weixin Meng, Chengshi Zheng, Xiaodong Li
A Transfer and Multi-Task Learning based Approach for MOS Prediction
Xiaohai Tian, Kaiqi Fu, Shaojun Gao, Yiwei Gu, Kai Wang, Wei Li, Zejun Ma
Fusion of Self-supervised Learned Models for MOS Prediction
Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre, Raphael Rubino, Yi Zhao
Perceptual Contrast Stretching on Target Feature for Speech Enhancement
Rong Chao, Cheng Yu, Szu-wei Fu, Xugang Lu, Yu Tsao
A speech enhancement method for long-range speech acquisition task
YANZHANG GENG, Heng Wang, Tao Zhang, Xin Zhao
ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding
Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe
MTI-Net: A Multi-Target Speech Intelligibility Prediction Model
Ryandhimas Edo Zezario, Szu-wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Steering vector correction in MVDR beamformer for speech enhancement
Suliang Bu, Yunxin Zhao, Tuo Zhao
Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting
Juliana N. Saba, John H.L. Hansen
DCTCN:Deep Complex Temporal Convolutional Network for Long Time Speech Enhancement
Ren Jigang, Mao Qirong
Improve Speech Enhancement using Perception-High-Related Time-Frequency Loss
Ding Zhao, Zhan Zhang, Bin Yu, Yuehai Wang
Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis
Raul Fernandez, David Haws, Guy Lorberbom, Slava Shechtman, Alexander Sorin
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning
Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li
Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, Mingqi Jiang, Lei Xie
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
Yihan Wu, Xi Wang, Shaofei Zhang, Lei He, Ruihua Song, Jian-Yun Nie
Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
Zhaoci Liu, Ningqian Wu, Yajie Zhang, Zhenhua Ling
Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, GuangZhi Li, Deng Cai, Dong Yu
Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis
Yixuan Zhou, Changhe Song, Jingbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng
Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Shun Lei, Yixuan Zhou, Liyang Chen, Jiankun Hu, Zhiyong Wu, Shiyin Kang, Helen Meng
Towards Cross-speaker Reading Style Transfer on Audiobook Dataset
Xiang Li, Changhe Song, Xianhao Wei, Zhiyong Wu, Jia Jia, Helen Meng
CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, Helen Meng
Improve emotional speech synthesis quality by learning explicit and implicit representations with semi-supervised training
Jiaxu He, Cheng Gong, Longbiao Wang, Di Jin, Xiaobao Wang, Junhai Xu, Jianwu Dang
Article |
---|