doi: 10.21437/Interspeech.2025
ISSN: 2958-1796
From Talking and Listening Devices to Intelligent Communicative Machines
Roger Moore
Speech transcription from South Tyrolean Dialect to Standard German with Whisper
Luca Ducceschi, Greta H. Franzini
Length Aware Speech Translation for Video Dubbing
Aswin Shanmugam Subramanian, Harveen Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li
ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space
Vishal Kumar, Vinayak Abrol
CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation
Jiale Ou, Hongying Zan
End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model
Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi
Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data
Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen
Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju, Anil Kumar Vuppala
Self-Improvement for Audio Large Language Model using Unlabeled Speech
Shaowen Wang, Xinyuan Chen, Yao Xu
Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio
Yan Ru Pei, Ritik Shrivastava, Fnu Sidharth
A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement
Shenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li
Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations
Teng Ma, Sile Yin, Li-Chia Yang, Shuo Zhang
Lightweight Speech Enhancement Model Based on Harmonic Attention and Phase Estimation with Skin-Attachable Accelerometer
Yonghun Song, Yeeun Kim, Yoonyoung Chung
TSDT-Net: Ultra-Low-Complexity Two-Stage Model Combining Dual-Path-Transformer and Transform-Average-Concatenate Network for Speech Enhancement
Yi Gao, Hangting Chen, Siyu Zhang, Qingshan Yang, Jingcong Chen
Structured Codebook Based Hierarchical Framework for DNN for Computationally Efficient Speech Enhancement
Chidambar B, Hanumanth Rao Naidu
Evaluation of Three Automatic Alignment Tools for the Processing of Non-native French
Qian Zhou, Mathilde Hutin
CrossPhon: An Auto Phone Mapping Tool to Streamline Cross-language Modeling for Phone Alignment of Low-resource Languages
Hongchen Wu, Yixin Gu
Multi-lingual and Zero-Shot Speech Recognition by Incorporating Classification of Language-Independent Articulatory Features
Ryo Magoshi, Shinsuke Sakai, Jaeyoung Lee, Tatsuya Kawahara
Instantaneous changes in acoustic signals reflect syllable progression and cross-linguistic syllable variation
Haley Hsu, Dani Byrd, Khalil Iskarous, Louis Goldstein
Influence of Proficiency and L2 Experience on Dynamic Spectral Cue Utilization in L2 Vowel Perception and Production
Linda Bakkouche, Brechtje Post
A Bayesian Approach to L2 Fluency Ratings by Native and Nonnative Listeners
Kakeru Yazawa, Takayuki Konishi
Are loan sequences different from foreign sequences? A perception study with Japanese listeners on coronal obstruent – high front vowel sequences
Silke Hamann, Andrea Alićehajić
Relative cue weighting in multilingual stop voicing production
Le Xuan Chan, Annika Heuser
Variability in Intervocalic /t/ and Community Diversity in Australian English
Hannah White, Joshua Penney, Felicity Cox
Vector Quantized Cross-lingual Unsupervised Domain Adaptation for Speech Emotion Recognition
Pravin Mote, Donita Robinson, Elizabeth Richerson, Carlos Busso
HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning
Shi-Xin Fang, Liang-Yeh Shen, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation
Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller
Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition
Mehedi Hasan Bijoy, Dejan Porjazovski, Tamás Grósz, Mikko Kurimo
Learning More with Less: Self-Supervised Approaches forLow-Resource Speech Emotion Recognition
Ziwei Gong, Pengyuan Shi, Kaan Donbekci, Lin Ai, Run Chen, David Sasu, Zehui Wu, Julia Hirschberg
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning
Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu
ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition
Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong
GALAXY: A Large-Scale Open-Domain Dataset for Multimodal Learning
Yihan Wu, Yichen Lu, Yijing Chen, Jiaqi Song, William Chen, Ruihua Song, Shinji Watanabe
FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems
Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng
PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs
Sho Inoue, Shuai Wang, Haizhou Li
FFD: Fine-Finger Diffusion Model for Music to Fine-grained Finger Dance Generation
Boya Dong, Wentao Lei, Li Liu
Towards Diverse and Efficient Audio Captioning via Diffusion Models
Manjie Xu, Chenxing Li, Yong Ren, Xinyi Tu, Ruibo Fu, Wei Liang, Dong Yu
Pull It Together: Reducing the Modality Gap in Contrastive Learning
Amit Sofer, Yoav Goldman, Shlomo E. Chazan
EnvSDD: Benchmarking Environmental Sound Deepfake Detection
Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley
Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution
Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli
Benchmarking Time-localized Explanations for Audio Classification Models
Cecilia Bolaños, Leonardo Pepino, Martin Meza, Luciana Ferrer
Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds
Andrew Chang, Yike Li, Iran R. Roman, David Poeppel
Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: an Analytical Study Towards Accent-robust ASR Only with Native Speech Data
Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu
Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains
Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, Shota Horiguchi
Is your model big enough? Training and interpreting large-scale monolingual speech foundation models
Yaroslav Getman, Tamás Grósz, Tommi Lehtonen, Mikko Kurimo
Semantic-Aware Interpretable Multimodal Music Auto-Tagging
Andreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou
From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models
Asim Ersoy, Basel Ahmad Mousi, Shammur Absar Chowdhury, Firoj Alam, Fahim I Dalvi, Nadir Durrani
Effective Context in Neural Speech Models
Yen Meng, Sharon Goldwater, Hao Tang
Word stress in self-supervised speech models: A cross-linguistic comparison
Martijn Bentum, Louis ten Bosch, Tomas O. Lentz
What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0
Robin Huo, Ewan Dunbar
On the reliability of feature attribution methods for speech classification
Gaofei Shen, Hosein Mohebbi, Arianna Bisazza, Afra Alishahi, Grzegorz Chrupala
An Exploration of Interpretable Deep Learning Models for the Assessment of Mild Cognitive Impairment
Emma Cathrine Liisborg Leschly, Oliver Roesler, Michael Neumann, Jackson Liscombe, Abhishek Hosamath, Lakshmi Arbatti, Line H. Clemmensen, Melanie Ganz, Vikram Ramanarayanan
Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation
Steffen Freisinger, Philipp Seeberger, Thomas Ranzenberger, Tobias Bocklet, Korbinian Riedhammer
Pick and Summarize: Integrating Extractive and Abstractive Speech Summarization
Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Ryo Fukuda, William Chen, Shinji Watanabe
Beyond Similarity Scoring: Detecting Entailment and Contradiction in Multilingual and Multimodal Contexts
Othman Istaiteh, Salima Mdhaffar, Yannick Estève
Comparison-Based Automatic Evaluation for Meeting Summarization
Ziwei Gong, Lin Ai, Harsh Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg
Voxplorer: Voice data exploration and projection in an interactive dashboard
Alessandro De Luca, Srikanth Madikeri, Volker Dellwo
ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems
Anand Rai, Satyam Rahangdale, Utkarsh Anand, Animesh Mukherjee
Transcribing Oral History Recordings Using the Transcription Portal
Christoph Draxler, Julian Pömp, Henk van den Heuvel, Fabio Ardolino, Arjan van Hessen
LiRI Corpus Platform: Demonstration of a Web-Based Infrastructure for Multimodal Corpus Analysis
Teodora Vuković, Jeremy Zehr, Jonathan Schaber, Igor Mustač, Nikolina Rajović, Daniel McDonald, Johannes Graën, Noah Bubenhofer
Speech Annotation for A: Accuracy, Access, and Application
Zirong Li, Hongchen Wu, Yixin Gu, Yao Du, Yang Yue
LATE: Open Source Toolkit for Latvian and Latgalian Speech Transcription
Arturs Znotins, Didzis Gosko, Normunds Gruzitis
Scalable Offline ASR for Command-Style Dictation in Courtrooms
Kumarmanas Nethil, Vaibhav Mishra, Kriti Anandan, Kavya Manohar
Towards a dynamical model of transitions between fluent and stuttered speech
Yijing Lu, Khalil Iskarous, Louis Goldstein
Study of vocal fold vibration using M-mode ultrasound: a proof of concept
Juliette Dindart, Agnès Rouxel, Crystal Lin, Trung Kien Bui, Muriel Lefort, Claire Pillot-Loiseau, Christophe Trésallet, Frédérique Frouin
Articulatory Feature Prediction from Surface EMG during Speech Production
Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica Gonzalez-Machorro, Yoonjeong Lee, Björn W. Schuller, Louis Goldstein, Shrikanth Narayanan
Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality
Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Espy-Wilson
Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning
Anna Stein, Kevin Tang
Contextual predictability effects on acoustic distinctiveness in read Polish speech
Zofia Malisz, Jan Foremski, Magłorzata Kul
How do both phonological and syntactic complexity influence speech planning?
Ivan Yuen, Katherine Demuth, Stefanie Shattuck-Hufnagel
When focus shapes the flow: prosodic restructuring in Mandarin complex nominals
Anqi Xu, Yu-yin Hsu
Investigating the Impact of Word Informativeness on Speech Emotion Recognition
Sofoklis Kakouros
Lexical stress affects lenition: The case of Italian palato-alveolar affricates
Bowei Shao, Philipp Buech, Anne Hermes, Maria Giavazzi
Evaluation of a model for sound radiation from the vocal tract wall
Peter Birkholz, Tianyi Zhang
FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents
Satu Hopponen, Tomi Kinnunen, Alexandre Nikolaev, Rosa González Hautamäki, Lauri Tavi, Einar Meister
Modeling Formant Dynamics in Mandarin /ai/: Effects of Speech Style and Speech Rate
Yunzhuo Xiang, Jingyi Sun
Representation of Perceived Prosodic Similarity of Conversational Feedback
Livia Qian, Carol Figueroa, Gabriel Skantze
Prolongation in Romanian
Oana Niculescu, Monica Vasileanu
Speech Reduction in French: The Relationship Between Vowel Space and Articulation Dynamics
Kübra Bodur, Corinne Fredouille, Christine Meunier
Stress in Spoken and Whistled Greek
Andre Batchelder-Schwab, Vasileios Michos, Jonathan Barnes
Leveraging Text and Speech Processing for Suicide Risk Classification in Chinese Adolescents
Justyna Krzywdziak, Bartłomiej Eljasiak, Joanna Stępień, Michał Świątek, Agnieszka Pruszek
The 1st SpeechWellness Challenge: Detecting Suicide Risk Among Adolescents
Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, Chao Zhang
Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
Yifan Gao, Jiao Fu, Long Guo, Hong Liu
Predicting Adolescent Suicidal Risk from Multi-task-based Speech: An Ensemble Learning Approach
Xi Chen, Renzhe Yu, Yanshen Tan, Yiyi Li, Quan Qian, Ying Lin
In-context learning capabilities of Large Language Models to detect suicide risk among adolescents from speech transcripts
Filomene Roquefort, Alexandre Ducorroy, Rachid Riad
Language-Agnostic Suicidal Risk Detection Using Large Language Models
June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang
Network of acoustic characteristics for the automatic detection of suicide risk from speech. Contribution to the 2025 SpeechWellness challenge by the Semawave team
Vincent P. Martin, Charles Brazier, Maxime Amblard, Michel Musiol, Jean-Luc Rouas
ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs
Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan
Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models
Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi
Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis
Paul Mayer, Florian Lux, Alejandro Pérez-González-de-Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu
GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens
Tadashi Ogura, Takuma Okamoto, Yamato Ohtani, Erica Cooper, Tomoki Toda, Hisashi Kawai
ExagTTS: An Approach Towards Controllable Word Stress Incorporated TTS for Exaggerated Synthesized Speech Aiding Second Language Learners
Anindita Mondal, Monica Surtani, Anil Kumar Vuppala, Parameswari Krishnamurthy, Chiranjeevi Yarra
Synthetic Data Generation for Phrase Break Prediction with Large Language Model
Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim
Speech Reference Intervals: An Assessment of Feasibility in Depression Symptom Severity Prediction
Lauren White, Ewan Carr, Judith Dineley, Catarina Botelho, Pauline Conde, Faith Matcham, Carolin Oetzmann, Amos Folarin, George Fairs, Agnes Norbury, Stefano Goria, Srinivasan Vairavan, Til Wykes, Richard Dobson, Vaibhav Naraya, Matthew Hotopf, Alberto Abad, Isabel Trancoso, Nicholas Cummins
DepressGEN: Synthetic Data Generation Framework for Depression Detection
Wenrui Liang, Rong Zhang, Xuezhen Zhang, Ying Ma, Wei-Qiang Zhang
Emotion-Guided Graph Attention Networks for Speech-Based Depression Detection under Emotion-Inducting Tasks
Yuqiu Zhou, Yongjie Zhou, Yudong Yang, Yang Liu, Jun Huang, Shuzhi Zhao, Rongfeng Su, Lan Wang, Nan Yan
Explainable Depression Detection using Masked Hard Instance Mining
Patawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich
Test-Time Training for Speech-based Depression Detection
Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore
Leveraging Ordinal Information for Speech-based Depression Classification
Lishi Zuo, Man-Wai Mak
Zero-Shot Speech-Based Depression and Anxiety Assessment with LLMs
Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz
Towards the Objective Characterisation of Major Depressive Disorder Using Speech Data from a 12-week Observational Study with Daily Measurements
Robert Lewis, Szymon Fedor, Nelson Hidalgo Julia, Joshua Curtiss, Jiyeon Kim, Noah Jones, David Mischoulon, Thomas F Quatieri, Nicholas Cummins, Paola Pedrelli, Rosalind Picard
Can Speech Accurately Detect Depression in Patients With Comorbid Dementia? An Approach for Mitigating Confounding Effects of Depression and Dementia
Sophie Young, Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Markus Reuber, Heidi Christensen
Temporal Convolutional Network with Smoothed and Weighted Losses for Distant Voice Activity and Overlapped Speech Detection
Shaojie Li, Qintuya Si, De Hu
Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik
SpeechMLC: Speech Multi-label Classification
Miseul Kim, Seyun Um, Hyeonjin Cha, Hong-Goo Kang
Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced Alignment
Dohyun Kim, Jiwook Hwang
Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis
Anna Seo Gyeong Choi, Alexander Richardson, Ryan Partlan, Sunny X. Tang, Sunghye Cho
Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling
Tahiya Chowdhury, Veronica Romero
HK-GenSpeech: A Generative AI Scene Creation Framework for Speech Based Cognitive Assessment
Vi Jun Sean Yong, Serkan Kumyol, Pau Le Lisa Low, Suk Wai Winnie Leung, Tristan Braud
Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and Assessment
Parismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna
Leveraging Cascaded Binary Classification and Multimodal Fusion for Dementia Detection through Spontaneous Speech
Yin-Long Liu, Yuanchao Li, Rui Feng, Liu He, Jia-Xin Chen, Yi-Ming Wang, Yu-Ang Chen, Yan-Han Peng, Jia-Hong Yuan, Zhen-Hua Ling
Whisper-Based Multilingual Alzheimer's Disease Detection and Improvements for Low-Resource Language
Kaichen Jia, Jinpeng Li, Ke Li, Wei-Qiang Zhang
PPGs-BERT: Leveraging Phoneme Sequence and BERT for Alzheimer’s Disease Detection from Spontaneous Speech
Qi Sun, Ziyue Qiu, Yu Pu, Jinpeng Li, Xuchu Chen, Wei-Qiang Zhang
LLM-based phoneme-to-grapheme for phoneme-based speech recognition
Te Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou
Pinyin-Guided Chinese Speech Recognition with Large Language Model
Jie Zhengjie, Gaofeng Cheng
Text-Enhanced Audio Encoder for Large Language Model based Speech Recognition via Cross-Modality Pre-training with Unpaired Audio-Text Data
Hang Su, Yuxiang Kong, Lichun Fan, Jian Luan
Towards atypical speech transcription using LLM-based ASR
Jinda Zhang, Aanchan Mohan
Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
Jeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke
Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis
Tianyi Xu, Hongjie Chen, Qing Wang, Lv Hang, Jian Kang, Jie Li, Zhennan Lin, Yongxiang Li, Lei Xie
Synonymity-Based Semantic Coding for Efficient Speech Compression
Shanhui Gan, Zijian Liang, Kai Niu, Ping Zhang
Towards an Ultra-Low-Delay Neural Audio Coding with Computational Efficiency
Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang
SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain
Zixiang Wan, Guochang Zhang, Yifeng He, Jianqiang Wei
TS3-Codec: Transformer-Based Simple Streaming Single Codec
Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li
Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ
Yunkee Chae, Kyogu Lee
LSPnet: an ultra-low bitrate hybrid neural codec
Bowen Zhang, Ian McLoughlin, Xiaoxiao Miao, AS Madhukumar
Vision-Integrated High-Quality Neural Speech Coding
Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling
Neural Spectral Band Generation for Audio Coding
Woongjib Choi, Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang
Multi-Channel Acoustic Echo Cancellation Based on Direction-of-Arrival Estimation
Fei Zhao, Xueliang Zhang, Zhong-Qiu Wang
Simultaneous Masked and Unmasked Decoding with Speculative Decoding Masking for Fast ASR without Accuracy Loss
Koji Okabe, Hitoshi Yamamoto
WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection
Hainan Xu, Vladimir Bataev, Lilit Grigoryan, Boris Ginsburg
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding
Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models
Lilit Grigoryan, Vladimir Bataev, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg
Skip-Salsa: Skip Synchronous Fusion of ASR LLM Decoders
Ashish Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi
Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition
Chin Yuen Kwok, Jia Qi Yip
Web-Based Application for Real-Time Biofeedback of Vocal Resonance in Gender-Affirming Voice Training: Design and Usability Evaluation
Tara McAllister, Collin Eagen, Yi Shan, Peter Traver, Daphna Harel, Tae Hong Park, Vesna Novak
On the Production and Perception of a Single Speaker's Gender
Robin Netzorg, Naomi Carvalho, Andrea Guzman, Lydia Wang, Juliana Francis, Klo Vivienne Garoute, Keith Johnson, Gopala Anumanchipalli
Conveying Gender Through Speech: Insights from Trans Men
Alice Ross, Cliodhna Hughes, Eddie L. Ungless, Catherine Lai
Queer Waves: A German Speech Dataset Capturing Gender and Sexual Diversity from Podcasts and YouTube
Ingo Siegert, Jan Marquenie, Sven Grawunder
Reddit FlairShare: A Human-Annotated Dataset of Gender-Progressive Online Discourse
Carlos Hartmann
Voices of `cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies
Maxwell Hope, Éva Székely
Neutral Tone Variation in Beijing Mandarin: Is Neutral Tone Toneless?
Xiao Dong, Fengming Liu, Chien-Jer Lin, Monica Nesbitt, Shuju Shi
The Role of Syntactic Structures in Shaping Directionality in Trisyllabic Tone Sandhi: Evidence from Tianjin Mandarin
Siqi Lu, Hui Feng, Ziyu Xiong
Acoustic Representation and Realization of Weak Elements Subcategories: In the Case of Tianjin Mandarin
Zhijie Li, Hui Feng
Lexical competition in the process of Cantonese tone merging: Diverse Impact Mechanisms Across Different Individuals and Tone Pairs
Lishan Li, Yaolin Zhou, Xiaoying Xu
Tonal Perception in Changde Mandarin
Zhenrui Zhang, Fang Hu
Tonal Contrasts in the Malipo Variety of the Mienic Language
Changhong Du, Fang Hu
Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing
Yanir Marmor, Yair Lifshitz, Yoad Snapir, Kinneret Misgav
A Practitioner’s Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic
Ondřej Klejch, William Lamb, Peter Bell
Automatic Speech Recognition for Low-Resourced Middle Eastern Languages
Razhan Hameed, Sina Ahmadi, Hanah Hadi, Rico Sennrich
In-context Language Learning for Endangered Languages in Speech Recognition
Zhaolin Li, Jan Niehues
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Sai Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe
Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR
Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie
Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages
Tuan Nguyen, Huy Dat Tran
Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition
Leonora Vesterbacka, Faton Rekathati, Robin Kurtz, Justyna Sikora, Agnes Toftgård
Room Impulse Response as a Prompt for Acoustic Echo Cancellation
Fei Zhao, Shulin He, Xueliang Zhang
CAGCRN: Real-Time Speech Enhancement with a Lightweight Model for Joint Acoustic Echo Cancellation and Noise Suppression
Yuyang Wang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He
Exploiting Echo Path Priors for Enhanced Stereo Acoustic Echo Cancellation
Jinfu Wang, Ziteng Wang, Xin Liu, Yang Liu, Qing Shi, Zhengqiang Luo, Feiran Yang
Extended Loss: Incorporating Long Context into Training Models when using Short Audio Frames
Quang Minh Dinh, Hoda Rezaee Kaviani, Mehrdad Hosseinzadeh, Yuanhao Yu
Analysis and Extension of a Near-End Listening Enhancement Method Based on Long-Term Fractile Noise Statistics
Filippo Villani, Wai-Yip Chan, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen
A Novel Deep Learning Framework for Efficient Multichannel Acoustic Feedback Control
Yuan-Kuei Wu, Juan Azcarreta Ortiz, Kashyap Patel, Buye Xu, Jung-Suk Lee, Sanha Lee, Ashutosh Pandey
Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency
Bunlong Lay, Rostilav Makarov, Timo Gerkmann
Addressing Task Conflicts in Stuttering Detection via MMoE-Based Multi-Task Learning
Xiaokang Liu, Xingfeng Li, Yudong Yang, Lan Wang, Nan Yan
Comparison of Acoustic and Textual Features for Dysarthria Severity Classification in Amyotrophic Lateral Sclerosis
Upendra Vishwanath Y. S., Tanuka Bhattacharjee, Deekshitha G, Sathvik Udupa, Kumar Chowdam Venkata Thirumala, Madassu Keerthipriya, Darshan Chikktimmegowda, Dipti Baskar, Yamini Belur, Seena Vengalil, Atchayaram Nalini, Prasanta Kumar Ghosh
StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation
Suhita Ghosh, Melanie Jouaiti, Jan-Ole Perschewski, Sebastian Stober
Physiologically-Informed Feature Analysis of Acquired Speech Disorders for Stroke Assessment
Giulia Sanguedolce, Jón Guðnason, Dragos C. Gruia, Emilie d'Olne, Fatemeh Geranmayeh, Patrick A. Naylor
Robot-assisted Recognition of Vocal Emotions in Pseudospeech for Cochlear Implanted Adolescents
Gloria Araiza-Illan, Luke Meyer, Bert Maat, Deniz Başkent
Using Neurogram Similarity Index Measure (NSIM) to Model Hearing Loss and Cochlear Neural Degeneration
Ahsan Cheema, Sunil Puria
Contrastive Learning-based Syllable-Level Mispronunciation Detection and Diagnosis for Speech Audiometry
Longbin Jin, Donghun Min, Jung Eun Shin, Eun Yi Kim
A Deformable Convolution GAN Approach for Speech Dereverberation in Cochlear Implant Users
Hsin-Tien Chiang, John H.L. Hansen
L3C-DeepMFC: Low-Latency Low-Complexity Deep Marginal Feedback Cancellation with Closed-Loop Fine Tuning for Hearing Aids
Fengyuan Hao, Brian C. J. Moore, Huiyong Zhang, Xiaodong Li, Chengshi Zheng
Semantic Processing During Spoken Word Production by Children with Cochlear Implants
Man Wang, Yixin Ding, Niels Schiller
Linguistic Masking and Its Release in Simulated Electric-acoustic Hearing
Yuting Ding, Xuefei Wang, Fei Chen
Lessons Learned from the URGENT 2024 Speech Enhancement Challenge
Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian
Interspeech 2025 URGENT Speech Enhancement Challenge
Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe
TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network
Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu
Multistage Universal Speech Enhancement System for URGENT Challenge
Xiaohuai Le, Zhuangqi Chen, Siyu Sun, Xianjun Xia, Chuanzeng Huang
Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025
Zhihang Sun, Andong Li, Tong Lei, Rilin Chen, Meng Yu, Chengshi Zheng, Yi Zhou, Dong Yu
DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration
Sanberk Serbest, Tijana Stojkovic, Milos Cernak, Andrew Harper
FUSE: Universal Speech Enhancement using Multi‐Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge
Nabarun Goswami, Tatsuya Harada
Universal Speech Enhancement with Regression and Generative Mamba
Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukic, Szu-Wei Fu, Yu Tsao
Structured pruning for efficient systolic array accelerated cascade Speech-to-Text Translation
Jean-Luc Rouas, Charles Brazier, Leila Ben Letaifa, Rafael Medina, Pedro Palacios, David Atienza, Giovanni Ansaloni
Scaling pseudo-labeling data for end-to-end low-resource speech translation (the case of Kurdish language)
Mohammad Mohammadamini, Aghilas Sini, Marie Tahon, Antoine Laurent
Multilingual Query-by-Example KWS for Indian Languages using Transliteration
Kirandevraj R, Vinod K Kurmi, Vinay Namboodiri, CV Jawahar
Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation
Chenyang Le, Yinfeng Xia, Huiyan Li, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian
A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation
Verena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, Barbara Plank
NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data
Tahir Javed, Kaushal Bhogale, Mitesh M. Khapra
Temporal Modeling of Room Impulse Response Generation via Multi-Scale Autoregressive Learning
Sheng Lyu, Yuemin Yu, Chenshu Wu
Effect of Noise Floor in Room Impulse Response on Speech Perception Under Spherical Harmonics-based Spatial Sound Reproduction
Yunqi C. Zhang, Dhruv Jagmohan, Hong Kit Li, C. T. Justine Hui, Yusuke Hioka
Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses
Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux
AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers
Linya Fu, Yu Liu, Zhijie Liu, Zedong Yang, Zhong-Qiu Wang, Youfu Li, He Kong
SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction
Tuochao Chen, D Shin, Hakan Erdogan, Sinan Hersek
TF-Mamba: A Time-Frequency Network for Sound Source Localization
Yang Xiao, Rohan Kumar Das
Articulatory modeling of the S-shaped F2 trajectories observed in Öhman's spectrographic analysis of VCV syllables
Frédéric Berthommier
The Role of Voiced Consonant Duration in Sung Vowel-Consonant and Consonant-Vowel Recognition
Allan Vurma, Einar Meister, Lya Meister, Jaan Ross, Marju Raju, Veeda Kala, Tuuri Dede
How sibilant spectra shape gender perception in prepubertal children: A voice morphing study
Riccarda Funk, Melanie Weirich, Adrian Simpson
Constrained LDDMM for Dynamic Vocal Tract Morphing: Integrating Volumetric and Real-Time MRI
Tharinda Piyadasa, Joan Glaunès, Amelia Gully, Michael Proctor, Kirrie Ballard, Tünde Szalay, Naeim Sanaei, Sheryl Foster, David Waddington, Craig Jin
2D Immersed Boundary Method in Vocal Tract Acoustics: An Eulerian–Lagrangian Model for Simulation of Diphthongs
Rongshuai Wu, Debasish Ray Mohapatra, Sidney Fels
Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI Data
Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie
Co-registration of real-time MRI and respiration for speech research
Yubin Zhang, Prakash Kumar, Ye Tian, Ziwei Zhao, Xuan Shi, Kevin Huang, Kevin Lee, Haley Hsu, Shrikanth Narayanan, Krishna Nayak, Louis Goldstein
SPEAKtoCOPD: a flashmob study to collect COPD speech
Loes van Bemmel, Lauren G Reinders, Folkert Brijker, Bas Holverda, Frits M.E. Franssen, Hanneke van Helvoort, Visara Urovi, Marieke Spreeuwenberg, Sami O. Simons
Developing a LeFF Transformer Model for Exacerbated Speech Detection in COPD and Asthma
Yuyang Yan, Sami O. Simons, Visara Urovi
Towards Pre-training an Effective Respiratory Audio Foundation Model
Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada
Effect of physical exercise on voice in people living with COPD
Lauren G Reinders, Loes van Bemmel, Alexander Mackay, David Nobbs, Frits M.E. Franssen, Hester Gietema, Simona Schäfer, Sami O. Simons
Adaptive Differential Denoising for Respiratory Sounds Classification
Gaoyang Dong, Zhicheng Zhang, Ping Sun, Minghui Zhang
Disentangling Dual-Encoder Masked Autoencoder for Respiratory Sound Classification
Peidong Wei, Shiyu Miao, Lin Li
Patient-Aware Feature Alignment for Robust Lung Sound Classification: Cohesion-Separation and Global Alignment Losses
Seung Gyu Jeong, Seong Eun Kim
Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from Ensembles
Miika Toikkanen, June-Woo Kim
Theoretical proposal for a unified Bayesian model of adaptation in non-interactive and interactive speech production
Mélen Guillaume, Anahita Basirat, Julien Diard
Self-supervised Optimality-Guided Learning of Speech Articulation
Juraj Šimko, Benjamin Elie, Alice Turk
Extended High-frequency Cues to Phoneme Recognition: Insights from ASR
Zhe-chen Guo, Bharath Chandrasekaran
Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception
Jia-Xin Chen, Yi-Ming Wang, Ziyu Zhang, Jiayang Han, Yin-Long Liu, Rui Feng, Xiuyuan Liang, Zhen-Hua Ling, Jia-Hong Yuan
SSF-DST: A Spectro-Spatial Features Enhanced Deep Spatiotemporal Network for EEG-Based Auditory Attention Detection
Tong Zhu, Xiaoke Yang, Jian Zhou, Lu Li, Zhao Lv, Cunhang Fan
Overestimated performance of auditory attention decoding caused by experimental design in EEG recordings
Yujie Yan, Xiran Xu, Haolin Zhu, Songyi Li, Bo Wang, Xihong Wu, Jing Chen
A real-time MRI study on asymmetry in velum dynamics during VCV production with nasal sounds
Chetan Sharma, Vaishnavi Chandwanshi, Shreya Shrikant Karkun, Aditya Anand Gupta, Prasanta Kumar Ghosh
Exploratory Analysis of Brainstem fMRI Data During Sustained Phonation
Carey Smith, Hu Cheng, Pertti Palo, Daniel Aalto, Steven M. Lulich
Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations
Seongsil Heo, Christi Miller, Calvin Murdock, Michael Proulx
Visual Cues Support Robust Turn-taking Prediction in Noise
Sam O'Connor Russell, Naomi Harte
Backchannel prediction for natural spoken dialog systems using general speaker and listener information
Yoshinori Fukunaga, Ryota Nishimura, Kengo Ohta, Norihide Kitaoka
Rapport-Building Dialogue Strategies for Deeper Connection: Integrating Proactive Behavior, Personalization, and Aizuchi Backchannels
Muhammad Yeza Baihaqi, Angel García Contreras, Seiya Kawano, Koichiro Yoshino
Does effortful speech production indicate communication difficulty caused by noise and hearing aid support?
Lena-Marie Huttner, Jeppe H. Christensen, Gitte Keidser, Tobias May, Torsten Dau, Sergi Rotger-Griful
``Dyadosyncrasy'', Idiosyncrasy and Demographic Factors in Turn-Taking
Julio Cesar Cavalcanti, Gabriel Skantze
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
Theo Lepage, Reda Dehak
ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction
Minu Kim, Kangwook Jang, Hoirin Kim
Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification
Zhe Li, Man-Wai Mak, Jen-Tzung Chien, Mert Pilanci, Zezhong Jin, Helen Meng
Evaluating Deep Speaker Embedding Robustness to Domain, Sampling Rate, and Codec Variations
Alexandre Ferro Filho, Diogo Fernandes Costa Silva, Pedro Elias Engelberg Silva Borges, Arlindo Rodrigues Galvão Filho
Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech Synthesis
Zhiyong Chen, Shuhang Wu, Xinnuo Li, Zhiqi Ai, Shugong Xu
Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing
Jin Li, Man-Wai Mak, Johan Rohdin, Kong Aik Lee, Hynek Hermansky
Switch Conformer with Universal Phonetic Experts for Multilingual ASR
Masato Mimura, Jaeyoung Lee, Tatsuya Kawahara
Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR
Hongli Yang, Sheng Li, Hao Huang, Ayiduosi Tuohan, Yizhou Peng
Efficient Multilingual ASR Finetuning via LoRA Language Experts
Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian
Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition
Raphaël Bagat, Irina Illina, Emmanuel Vincent
Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR
Zheng Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard
Leveraging Geographic Metadata for Dialect-Aware Speech Recognition
Pouya Mehralian, Hugo Van hamme
Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning
Ömer Tarik Özyilmaz, Matt Coler, Matias Valdenegro-Toro
VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen
Open Universal Arabic ASR Leaderboard
Yingzhi Wang, Anas Alhmoud, Muhammad Alqurishi
Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement
Yujie Yang, Bing Yang, Xiaofei Li
A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions
Zheng Wang, Xiaobin Rong, Yu Sun, Tianchi Sun, Zhibin Lin, Jing Lu
ARiSE: Auto-Regressive Multi-Channel Speech Enhancement
Pengjie Shen, Xueliang Zhang, Zhong-Qiu Wang
WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Peservation
Lu Han, Junqi Zhao, Renhua Peng
A Three-Stage Beamforming with Harmonic Guidance for Multi-Channel Speech Enhancement
Nurali Alip, Tianrui Wang, Rui Cao, Meng Ge, Jingru Lin, Longbiao Wang, Jianwu Dang
Speech Enhancement with Dual-path Multi-Channel Linear Prediction Filter and Multi-norm Beamforming
Chengyuan Qin, Wenmeng Xiong, Jing Zhou, Maoshen Jia, Changchun Bao
Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR
Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu
Self-supervised learning of speech representations with Dutch archival data
Nik Vaessen, Roeland Ordelman, David A. van Leeuwen
GigaAM: Efficient Self-Supervised Learner for Speech Recognition
Aleksandr Kutsakov, Alexandr Maximenko, Georgii Gospodinov, Pavel Bogomolov, Fyodor Minkin
DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective
Hyung-gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watanabe, Ahmed Hussen Abdelaziz
Differentiable K-means for Fully-optimized Discrete Token-based ASR
Kentaro Onda, Yosuke Kashiwagi, Emiru Tsunoo, Hayato Futami, Shinji Watanabe
Towards Early Prediction of Self-Supervised Speech Model Performance
Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève
VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion
Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee
TVC-MusicGen: Time-Varying Structure Control for Background Music Generation via Self-Supervised Training
Chenyu Yang, Hangting Chen, Shuai Wang, Haina Zhu, Haizhou Li
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra
Bridging Speech and Singing: Multi-stage Speech-Prompted Singing Voice Conversion with Speaker Embedding Adaptation
Mingda Liu, Jiatong Shi
Neurodyne: Neural Pitch Manipulation with Representation Learning and Cycle-Consistency GAN
Yicheng Gu, Chaoren Wang, Zhizheng Wu, Lauri Juvela
VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge
Zijing Zhao, Kai Wang, Hao Huang, Ying Hu, Liang He, Jichen Yang
DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching
Wei Chen, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu
Simple and Effective Content Encoder for Singing Voice Conversion via SSL-Embedding Dimension Reduction
Wangjin Zhou, Tianjiao Du, Chenglin Xu, Sheng Li, Yi Zhao, Tatsuya Kawahara
Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee
Multitalker Babble in English Vowel Perception Training: A Comparison between Humans and Neural Models
Wenwei Dong, Alif Silpachai, Catia Cucchiarini, Helmer Strik
Speech stimulus design to study the neural coding of speech and the impact of cochlear synaptopathy
Etienne Gaudrain, Sarah Verhulst, Deniz Başkent
Prediction of listening effort ratings for habitual and clear-Lombard speech presented in noise
Esther Janse, Chen Shen, Martin Cooke
Language and Accent Familiarity Effects on the Use of Acoustic Cues in Talker Identification
Shengyue Xiong, Zhe-chen Guo, Bharath Chandrasekaran
Characterization of voice cue sensitivity and vocal emotion recognition across the adult lifespan
Laura Rachman, Deniz Başkent
Creaky Voice Facilitates More Efficient Phonological Processing of Mandarin Tone 3
Zixia Fan, Ronny Ibrahim, Joshua Penney, Felicity Cox
Training Onset-and-Offset-Aware Sound Event Detection on a Heterogeneous Dataset via Probabilistic Sequential Modeling
Tomoya Yoshinaga, Yoshiaki Bando, Keitaro Tanaka, Keisuke Imoto, Masaki Onishi, Shigeo Morishima
Multi-view Fusion and Parameter Perturbation for Few-Shot Class-Incremental Audio Classification
Yulu Fang, Mingyue He, Qisheng Xu, Jianqiao Zhao, Cheng Yang, Kele Xu, Yong Dou
Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier
Yongjie Si, Yanxiong Li, Jiaxin Tan, Qianhua He, Il-Youp Kwak
Beyond Conventional Metrics: using Entropic Triangles to Explain Balancing Methods in Acoustic Scene Classification
Claudia Montero-Ramírez, Alba Martínez-Serrano, Jorge Garcelán-Gómez, Francisco J. Valverde-Albacete, Carmen Peláez-Moreno
Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound Classification
Emiliano Acevedo, Martín Rocamora, Magdalena Fuentes
Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation
Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
Chris Emezue, The NaijaVoices Community, Busayo Awobade, Abraham Toluwase Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal
FaiST: A Benchmark Dataset for Fairness in Speech Technology
Maliha Jahan, Yinglun Sun, Priyam Mazumdar, Zsuzsanna Fagyal, Thomas Thebaud, Jesus Villalba, Mark Hasegawa-Johnson, Najim Dehak, Laureano Moro Velazquez
On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs
Kemal Altwlkany, Amar Kuric, Emanuel Lacic
Evaluating Speech Enhancement Performance Across Demographics and Language
Jose Giraldo, Alex Peiró-Lilja, Carme Armentano-Oller, Rodolfo Zevallos, Cristina España-Bonet
Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
Şeymanur Akti, Tuan-Nam Nguyen, Alexander Waibel
Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora
Hitoshi Suda, Shinnosuke Takamichi, Satoru Fukayama
REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv R. Shah
Training-Free Voice Conversion with Factorized Optimal Transport
Alexander Lobashev, Assel Yermekova, Maria Larchenko
E2E-BPVC: End-to-End Background-Preserving Voice Conversion via In-Context Learning
Yihan Liu, Zhengyang Chen, Leying Zhang, Yanmin Qian
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li
ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization
Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li
In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion
Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu
LinearVC: Linear Transformations of Self-Supervised Features Through the Lens of Voice Conversion
Herman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau
Speaker Normalization and Content Restoration for Zero-Shot Voice Conversion with Attention-Enhanced Discriminator
Desheng Hu, Yang Xiang, Jian Lu, Xinhui Hu, Xinkang Xu
Optimizing Pause Context in Fine-Tuning Pre-trained Large Language Models for Dementia Detection
Xiaoquan Ke, Man-Wai Mak, Helen Meng
WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper
Emmanuel Akinrintoyo, Nadine Abdelhalim, Nicole Salomons
Acoustic and Linguistic Biomarkers for Cognitive Impairment Detection from Speech
Catarina Botelho, David Gimeno-Gómez, Francisco Teixeira, John Mendonça, Patrícia Pereira, Diogo A. P. Nunes, Thomas Rolland, Anna Pompili, Rubén Solera-Ureña, Maria Ponte, David Martins de Matos, Carlos-D. Martínez-Hinarejos, Isabel Trancoso, Alberto Abad
Alzheimer’s Dementia Detection Using Perplexity from Paired Large Language Models
Yao Xiao, Heidi Christensen, Stefan Goetze
Understanding Dementia Speech Alignment with Diffusion-Based Image Generation
Mansi, Anastasios Lepipas, Dominika C Woszczyk, Yiying Guan, Soteris Demetriou
ClaritySpeech: Dementia Obfuscation in Speech
Dominika C Woszczyk, Ranya Aloufi, Soteris Demetriou
Quadruple Path Modeling with Latent Feature Transfer for Permutation-free Continuous Speech Separation
Jihyun Kim, Doyeon Kim, Hyewon Han, Jinyoung Lee, Jonguk Yoo, Chang Woo Han, Jeongook Song, Hoon-Young Cho, Hong-Goo Kang
End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios
Kangqi Jing, Wenbin Zhang, Yu Gao
Speaker Separation for an Unknown Number of Speakers with Encoder-Decoder-Based Contextual Information Module
Xue Yang, Guiru Shen, Yu Yang
Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers
Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen
ReSepNet: A Unified-Light Model for Recursive Speech Separation with Unknown Speaker Count
Hadi Alizadeh, Rahil Mahdian Toroghi, Hassan Zareian
Deep-Simplex Multichannel Speech Separation
Tzlil Avidan, Bracha Laufer-Goldshtein
FLASepformer: Efficient Speech Separation with Gated Focused Linear Attention Transformer
Haoxu Wang, Yiheng Jiang, Gang Qiao, Pengteng Shi, Biao Tian
Power Spectral Density Estimation for Acoustic Source Separation Using A Spherical Microphone Array
Liang Tao, Maoshen Jia, Yonggang Hu
Exploring Efficient Directional and Distance Cues for Regional Speech Separation
Yiheng Jiang, Haoxu Wang, Yafeng Chen, Gang Qiao, Biao Tian
Teacher-Free Knowledge Distillation for Improving Short-Utterance Spoken Language Identification
Spandan Dey, Hirak Mondal, Sanjay Kumar Kurmi
LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech
Niyati Bafna, Matthew Wiesner
Analyzing the Impact of Accent on English Speech: Acoustic and Articulatory Perspectives
Gowtham Premananth, Vinith Kugathasan, Carol Espy-Wilson
A Study of Speech Embedding Similarities Between Australian Aboriginal and High-Resource Languages
Eliathamby Ambikairajah, Jingyao Wu, Ting Dang, Vidhyasaharan Sethu
An Investigative Study on Recent Sharpness- and Flatness-Based Optimizers for Enhanced Self-Supervised Speaker Verification
Abderrahim Fathan, Jahangir Alam, Xiaolin Zhu
Privacy-Preserving Speaker Verification via End-to-End Secure Representation Learning
Chenguang Hu, Yaqian Hao, Fulin Zhang, Xiaoxue Luo, Yao Shen, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng
Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy
Elvir Karimov, Alexander Varlamov, Danil Ivanov, Dmitrii Korzh, Oleg Rogov
Federated Learning with Feature Space Separation for Speaker Recognition
Ying Meng, Zhihua Fang, Liang He
Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification
Pierre Falez, Tony Marteau, Damien Lolive, Arnaud Delhay
Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion
Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alumäe, Mathew Magimai Doss
Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy
Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes
Adriana Stan, David Combei, Dan Oneata, Horia Cucu
Source Verification for Speech Deepfakes
Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro
STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution
Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka
Synthetic Speech Source Tracing using Metric Learning
Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Panagakis, Themos Stafylakis
Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing
Yang Xiao, Rohan Kumar Das
VIB-based Real Pre-emphasis Audio Deepfake Source Tracing
Thien-Phuc Doan, Kihun Hong, Souhwan Jung
Defending Unauthorized Voice Cloning with Watermark-Aware Codecs
Jiankun Zhao, Lingwei Meng, Chengxi Deng, Helen Meng, Xixin Wu
Open-Set Source Tracing of Audio Deepfake Systems
Nicholas Klein, Hemlata Tak, Elie Khoury
Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization
Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Černocký, Lukáš Burget
Count Your Speakers! Multitask Learning for Multimodal Speaker Diarization
Prabhav Singh, Jesus Villalba, Najim Dehak
End-to-End Diarization utilizing Attractor Deep Clustering
David Palzer, Matthew Maciejewski, Eric Fosler-Lussier
SDBench: A Comprehensive Benchmark Suite for Speaker Diarization
Berkin Durmus, Blaise Munyampirwa, Eduardo Pacheco, Atila Orhon, Andrey Leonov
Enhancing Serialized Output Training for Multi-Talker ASR with Soft Monotonic Alignment and Utterance-level Timestamp
Fengyun Tan, Tao Wei, Kun Zou, Ning Cheng, Shaojun Wang, Jing Xiao
Pretraining Multi-Speaker Identification for Neural Speaker Diarization
Shota Horiguchi, Atsushi Ando, Naohiro Tawara, Marc Delcroix
Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual Learning
Ki-Joong Kwon, Jun-Ho So, Sang-Hoon Lee
Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data
Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li
LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models
Chang Liu, Zhen-Hua Ling, Yu Gu
Developing High-Quality TTS for Punjabi and Urdu: Benchmarking against MMS Models
Fatima Naseem, Maham Sajid, Farah Adeeba, Sahar Rauf, Asad Mustafa, Sarmad Hussain, Faisal Kamiran
Synthesizing Speech with Selected Perceptual Voice Qualities – A Case Study with Creaky Voice
Frederik Rautenberg, Fritz Seebauer, Jana Wiechmann, Michael Kuhlmann, Petra Wagner, Reinhold Haeb-Umbach
Intrasentential English in Swedish TTS: perceived English-accentedness
Christina Tånnander, David House, Jonas Beskow, Jens Edlund
Parameter-Efficient Fine-tuning with Instance-Aware Prompt and Parallel Adapters for Speaker Verification
Shengyu Peng, Wu Guo, Jie Zhang, Yu Guan, Lipeng Dai, Zuoliang Li
Unified Text and Speaker Verification using SSL model for Text-Dependent Speaker Verification
Nathan Griot, Driss Matrouf, Raphael Blouet, Jean-François Bonastre, Ana Mantecon
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM
Zhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie
Towards Secure User Authentication for Headphones via In-Ear or In-Earcup Microphones
N Shashaank, Xiao Quan, Andrew Kaluzny, Leonard Varghese, Marko Stamenovic, Chuan-Che Huang
Mimic Blocker: Self-Supervised Adversarial Training for Voice Conversion Defense with Pretrained Feature Extractors
Gwangyeol Yu, Junhyeok Lee, Seoryeong Kim, Jimin Lee, Jehyuk Lee
A Siamese Network-Based Framework for Voice Mimicry Proficiency Assessment Using X-Vector Embeddings
Bhasi K.C., Rajeev Rajan
Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages
Rishabh Ranjan, Likhith Ayinala, Mayank Vatsa, Richa Singh
Joint Target-Speaker ASR and Activity Detection
Chikara Maeda, Muhammad Shakeel, Yui Sudo
DLF-EEND: Dynamic Layer Fusion for End-to-End Speaker Diarization
Wooil Kim, Bongsu Jung
Analysis of Avian Biphonic Vocalization Using Computational Modelling
Noumida A, Rajeev Rajan
Dog2vec: Self-Supervised Pre-Training for Canine Vocal Representation
Xingyuan Li, Kenny Zhu, Mengyue Wu
Improving Bird Classification with Primary Color Additives
Ezhini Rasendiran R, Chandresh Kumar Maurya
Exploring the Power of Empirical Mode Decomposition for Sensing the Sound of Silence: A Pilot Study on Mice Autism Detection via Ultrasonic Vocalisation
Chenhao Wu, Xiangjun Cai, Haojie Zhang, Tianrui Jia, Yilu Deng, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto, Jiang Liu
Exploring Pre-trained models on Ultrasound Modeling for Mice Autism Detection with Uniform Filter Bank and Attentive Scoring
Yuchen Song, Yucong Zhang, Ming Li
MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge
Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto
Significance of Time-Frequency preprocessing for automatic Ultrasonic Vocalization classification in Autism Spectrum Disorder model detection
Szymon Szmajdziński, Juliusz Wójtowicz-Kruk, Ivan Ryzhankow, Łukasz Łazarski, Jakub Żak, Władysław Średniawa
Robust Vocal Intensity Prediction: Overcoming Dataset Bias with Pretrained Deep Models
Quentin Le Tellier, Marc Evrard, Albert Rilliard, Jean-Sylvain Liénard
SLASH: Self-Supervised Speech Pitch Estimation Leveraging DSP-derived Absolute Pitch
Ryo Terashima, Yuma Shirahata, Masaya Kawamura
From Speech Science to Language Transparence
Alexander Waibel
PruneSLU: Efficient On-device Spoken Language Understanding through Vocabulary and Structural Pruning
Truong Do, Minh-Phuong Nguyen, Le -Minh Nguyen
Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State Tracking
Haris Gulzar, Monikka Roslianna Busto, Akiko Masaki, Takeharu Eda, Ryo Masumura
Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs
Šimon Sedláček, Bolaji Yusuf, Ján Švec, Pradyoth Hegde, Santosh Kesiraju, Oldřich Plchot, Jan Černocký
What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems
Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel García Contreras, Koichiro Yoshino
SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation
Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Who, When, and What: Leveraging the ``Three Ws'' Concept for Emotion Recognition in Conversation
Xiaohan Shi, Xingfeng Li, Tomoki Toda
``Alexa, can you forget me?'' Machine Unlearning Benchmark in Spoken Language Understanding
Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis
Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
Ebru Arisoy, Merve Unlu Menevse, Yusufcan Manav, Arzucan Ozgur
I want a horror – comedy – movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance
Maria Teleki, Lingfeng Shi, Chengkai Liu, James Caverlee
Towards a Japanese Full-duplex Spoken Dialogue System
Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee
Continual Speech Learning with Fused Speech Features
Guitao Wang, Jinming Zhao, Hao Yang, Guilin Qi, Tongtong Wu, Gholamreza Haffari
Uni-VERSA: Versatile Speech Assessment with a Unified Network
Jiatong Shi, Hye-jin Shim, Shinji Watanabe
Evaluating ASR Robustness to Spontaneous Speech Errors: A Study of WhisperX Using a Speech Error Database
John Alderete, Macarious Kin Fung Hui, Aanchan Mohan
Is Synthetic Data Truly Effective for Training Speech Language Models?
Tomoya Mizumoto, Atsushi Kojima, Yusuke Fujita, Lianbo Liu, Yui Sudo
How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sebastien Bratières, Paolo Merialdo, Simone Scardapane
Text Entry for All: Towards Speech-based Multimodal Interaction for Inclusion, Accessibility and the Preservation of the World’s Linguistic Heritage
Julian Zapata, Lara Hanna
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti
Cocktail-Party Audio-Visual Speech Recognition
Thai-Binh Nguyen, Ngoc-Quan Pham, Alexander Waibel
Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition
Zhengyang Li, Pascal Reichert, Thomas Graave, Patrick Blumenberg, Tim Fingscheidt
Unified Audio-Visual Modeling for Recognizing Which Face Spoke When and What in Multi-Talker Overlapped Speech and Video
Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura
Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection
Shangkun Huang, Jing Deng, Jintao Kang, Rong Zheng
Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis
Zongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli
Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection
Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Gorno-Tempini, Gopala Anumanchipalli
Fine-tuning Strategies for Automatic Speech Recognition of Low-Resource Speech with Autism Spectrum Disorder
Yeseul Park, Bowon Lee
Identification of Pathological Pronunciation Profiles in ASR Transcription Errors
Margot Masson, Isabelle Ferrané, Julie Mauclair
A simple method for predicting Clinical Scores in Huntington’s Disease by leveraging ASR's uncertainty on spontaneous speech
Hadrien Titeux, Quang Tuan Rémy Nguyen, Andres Gil-Salcedo, Anne-Catherine Bachoud-Levi, Emmanuel Dupoux
Introducing EMOPARKNZ: the Emotional Speech Database from New Zealand English Speakers with Parkinson’s Disease
Itay Ben-Dom, Catherine I. Watson, Clare M. McCann
Revisiting WFST-based Hybrid Japanese Speech Recognition System for Individuals with Organic Speech Disorders
Naoki Hojo, Ryoichi Takashima, Chihiro Sugiyama, Nobukazu Tanaka, Kanji Nohara, Kazunori Nozaki, Tetsuya Takiguchi
Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge
Longjie Luo, Shenghui Lu, Lin Li, Qingyang Hong
The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition
Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg
Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge
Zhaoyang Li, Haodong Zhou, Longjie Luo, XiaoXiao Li, Yongxin Chen, Lin Li, Qingyang Hong
Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 Challenge
Ming Cheng, Fei Su, Cancan Li, Juan Liu, Ming Li
Leveraging Self-Supervised Learning Based Speaker Diarization for MISP 2025 AVSD Challenge
Zeyan Song, Tianchi Sun, Ronghui Hu, Kai Chen, Jing Lu
Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge
Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng
Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling
Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasis Drosou, Junho Heo, Hyuk Oh, Seokyeong Jeong
Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction
Wang Dai, Archontis Politis, Tuomas Virtanen
REAL-T: Real Conversational Mixtures for Target Speaker Extraction
Shaole Li, Shuai Wang, Jiangyu Han, Ke Zhang, Wupeng Wang, Haizhou Li
Online Audio-Visual Autoregressive Speaker Extraction
Zexu Pan, Wupeng Wang, Shengkui Zhao, Chong Zhang, Kun Zhou, Yukun Ma, Bin Ma
Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction
Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma
SardinianVoxes: A Speech Recognition Dataset for the Sardinian Languages
Salvatore Carta, Alessandro Giuliani, Marco Manolo Manca, Mirko Marras, Leonardo Piano
Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection
Griffin Smith, Dianna Yee, Jennifer King Chen, Leah Findlater
Automated evaluation of children's speech fluency for low-resource languages
Bowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin
Cantonese Punctuation Restoration using LLM Annotated Data
King Yiu Suen, Rudolf Chow, Albert Y.S. Lam
Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody
David Sasu, Benedict Quartey, Kweku Andoh Yamoah, Natalie Schluter
Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASR
Abhijit Sinha, Hemant Kumar Kathania, Mikko Kurimo
Spoken Language Modeling with Duration-Penalized Self-Supervised Units
Nicol Visser, Herman Kamper
Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision
Zhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu
Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
Zhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu
Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates
Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu
Ultra-Low Bit Post-Training Quantization of Large Speech Models via K-Means Clustering and Mixed Precision Allocation
Tianteng Gu, Bei Liu, Haoyu Wang, Yanmin Qian
Context-Driven Dynamic Pruning for Large Speech Foundation Models
Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe
Analyzing the Importance of Blank for CTC-Based Knowledge Distillation
Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter
Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages
Seraphina Fong, Marco Matassoni, Alessio Brutti
A Chinese Heart Failure Status Speech Database with Universal and Personalised Classification
Yue Pan, Liwei Liu, Changxin Li, Xingyao Wang, Yili Xia, Hanyue Zhang, Ming Chu
Heart Rate as a Proxy Measure to Assess Human Confidence in Spoken Speech
Harish Battula, Gauri Deshpande, Yagna Gudipalli, Sachin Patel
Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation
Jingping Nie, Tien Dung Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendano, Erdrin Azemi, Vikramjit Mitra
Towards Fusion of Neural Audio Codec-based Representations with Spectral for Heart Murmur Classification via Bandit-based Cross-Attention Mechanism
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma
Perception of Emotional Speech by Individuals with High Borderline Personality Features
Yizhou Chen, Xiyu Wu
Visual features of the oral region in Polish sibilants produced by children with various sibilance patterns
Agata Sage, Zuzanna Miodońska, Michał Kręcichwost, Ewa Kwaśniok, Paweł Badura
Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models
Roseline Polle, Agnes Norbury, Alexandra Livia Georgescu, Nicholas Cummins, Stefano Goria
Decoding Alzheimer’s: Interpretable Visual and Logical Attention in Picture Description Tasks
Ning Wang, Bingyang Wen, Minghui Wu, Yang Sun, Zongru Shao, Haojie Zhou, K.P. Subbalakshmi
Defending Speech-enabled LLMs Against Adversarial Jailbreak Threats
Antonios Alexos, Raghuveer Peri, Sai Muralidhar Jayanthi, Metehan Cekic, Srikanth Vishnubhotla, Kyu J. Han, Srikanth Ronanki
Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach
Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Dariia Puhach, Amir H. Payberah, Éva Székely
Evaluating Speech Foundation Models for Automatic Speech Recognition in the Low-Resource Kanyen’kéha Language
Mengzhe Geng, Patrick Littell, Aidan Pine, Robbie Jimerson, Gilles Boulianne, Vishwa Gupta, Rolando Coto-Solano, Anna Kazantseva, Marc Tessier, Delaney Lothian, Akwiratékha' Martin, Eric Joanis, Samuel Larkin, Roland Kuhn
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Chun-Yi Kuan, Hung-yi Lee
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul
Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC
Qingzheng Wang, Jiancheng Sun, Yifan Peng, Shinji Watanabe
The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties
William Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe
TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge
Tanel Alumäe, Artem Fedorchenko
Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition
Tao Zhong, Mengzhe Geng, Shujie Hu, Guinan Li, Xunying Liu
Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning
Yejin Jeon, Solee Im, Youngjae Kim, Gary Geunbae Lee
DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model
Xueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu, Helen Meng
Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching
Shoutrik Das, Nishant Singh, Arjun Gangwar, S Umesh
Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches
Ahmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen
Mitigating Overfitting During Speech Foundation Model Fine-tuning: Applications to Dysarthric Speech Detection
Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti
Towards Temporally Explainable Dysarthric Speech Clarity Assessment
Seohyun Park, Chitralekha Gupta, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara
Code Mix TTS: An Approach to Infer Human Like Speech for Multi-Lingual Input Texts
Vishal Gourav, Phanindra Mankale
Turing's Echo: Investigating Linguistic Sensitivity of Deepfake Voice Detection via Gamification
Binh Nguyen, Thai Le
Unleashing the Inner Monster: Demonstrating High-Fidelity Human to Non-Human Voice Conversion
Namhyun Cho, Sunmin Kim, Minsu Kang, Seolhee Lee, Choonghyeon Lee, Yangsun Lee
Tungnaá In Live Performance: An Implementation Of Interactive Artistic Text-To-Voice
Victor Shepardson, Jonathan Reus, Thor Magnusson
Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI
Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely
Vocal-tract model with two directions: Static design for a dummy head and dynamic design for a speaking machine
Takayuki Arai
Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi
Arnav Rustagi, Satvik Bajpai, Nimrat Kaur, Siddharth Siddharth
Evaluating Wav2Vec2-Bert for Computer-Assisted Pronunciation Training for isiZulu
Alexandra Fort, Francis Tyers
Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
Lubos Marcinek, Jonas Beskow, Joakim Gustafsson
Harnessing Text-to-Speech Voice Cloning Models for Improved Audiological Speech Assessment
Lidea Shahidi, Erdem Baha Topbas, Thu Ngan Dang, Tobias Goehring
75-Speaker Annot-16: A benchmark dataset for speech articulatory rt-MRI annotation with articulator contours and phonetic alignment
Xuan Shi, Yubin Zhang, Yijing Lu, Marcus Ma, Tiantian Feng, Asterios Toutios, Haley Hsu, Louis Goldstein, Shrikanth Narayanan
Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel Yamins
Reasoning-Based Approach with Chain-of-Thought for Alzheimer’s Detection Using Speech and Large Language Models
Chanwoo Park, Anna Seo Gyeong Choi, Sunghye Cho, Chanwoo Kim
Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings
Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, Julia Schwarz
Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora
Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu
Weight Factorization and Centralization for Continual Learning in Speech Recognition
Enes Ugan, Ngoc-Quan Pham, Alexander Waibel
Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection
Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Peter Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli
Dysarthric Speech Recognition Using Curriculum Learning and Multi-stream Architecture
I-Ting Hsieh, Chung-Hsien Wu
DYNAC: Dynamic Vocabulary-based Non-Autoregressive Contextualization for Speech Recognition
Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe
Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts
Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
Yifan Peng, Muhammad Shakeel, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe
Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems
Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam
BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention
Yassine El Kheir, Tim Polzehl, Sebastian Möller
Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes
Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya
Replay Attacks Against Audio Deepfake Detection
Nicolas Müller, Piotr Kawa, Wei-Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, Philip Sperl
Enhancing Audio Deepfake Detection by Improving Representation Similarity of Bonafide Speech
Seung-bin Kim, Hyun-seo Shin, Jungwoo Heo, Chan-yeong Lim, Kyo-Won Koo, Jisoo Son, Sanghyun Hong, Souhwan Jung, Ha-Jin Yu
Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphere
Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang
VoiceNet: Multilingual On-Device Phoneme-To-Audio Alignment
Kun Jin, Siva Penke, Srinivasa Algubelli
Nosey: Open-Source Hardware for Acoustic Nasalance
Maya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton, Sam Kirkham
Automatic classification of stop realisation with wav2vec2.0
James Tanner, Morgan Sonderegger, Jane Stuart-Smith, Jeff Mielke, Tyler Kendall
Acquiring Pronunciation from Speech Audio via Multi-task Learning
Siqi Sun, Korin Richmond
Intelligibility of Text-to-Speech Systems for Mathematical Expressions
Sujoy Roychowdhury, Ranjani H.G., Sumit Soman, Nishtha Paul, Subhadip Bandyopadhyay, Siddhanth Iyengar
The State Of TTS: A Case Study with Human Fooling Rates
Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M S, Suvrat Bhooshan, Mitesh M. Khapra
Pairwise Evaluation of Accent Similarity in Speech Synthesis
Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond
VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
Harm Lameris, Joakim Gustafsson, Éva Székely
Towards Frame-level Quality Predictions of Synthetic Speech
Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach
Perception of Long and Short Vowel Contrast in Te Reo Māori in Clean and Everyday Listening Environments
C. T. Justine Hui, Jenice Kuzhikombil, Isabella Shields, Hiraia Haami-Wells, Catherine I. Watson, Peter [J.] Keegan
The function of creaky voice in South Korean: A perception study
Patrik Hrabánek, Michaela Watkins, Silke Hamann
Talker Normalization in Chinese Bilinguals: A Comparative Study
Mingxi LU, Ran Tao, Yujia Tian
Coping with segmental–prosodic incongruity in spoken word recognition in Japanese
Terumichi Ariga
What the Filler? Both ASR Systems and Humans Struggle More With Other Kinds of Disfluencies Than With Filler Particles
Saskia Wepner, Lucas Eckert, Gernot Kubin, Barbara Schuppler
Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech
Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann
SQ-AST: A Transformer-Based Model for Speech Quality Prediction
Wafaa Wardah, Robert P. Spang, Vincent Barriac, Jan Reimes, Anna Llagostera, Jens Berger, Sebastian Möller
AttentiveMOS: A Lightweight Attention-Only Model forSpeech Quality Prediction
Imran E Kibria, Donald S. Williamson
Universal Preference-Score-based Pairwise Speech Quality Assessment
Yu-Fei Shi, Yang Ai, Zhen-Hua Ling
FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification
Enjamamul Hoq, Nikhil Gupta, Danielle Omondi, Ifeoma Nwogu
SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit
Wen-Chin Huang, Erica Cooper, Tomoki Toda
Investigating continuous autoregressive generative speech enhancement
Haici Yang, Gordon Wichern, Ryo Aihara, Yoshiki Masuyama, Sameer Khurana, François G. Germain, Jonathan Le Roux
Dynamic Layer Gating for Speech Enhancement
Venkatesh Parvathala, K. Sri Rama Murty
Model as Loss: A Self-Consistent Training Paradigm
Saisamarth Rajesh Phaye, Milos Cernak, Andrew Harper
Test-Time Training for Speech Enhancement
Avishkar Behera, Riya Ann Easow, Venkatesh Parvathala, K. Sri Rama Murty
Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement
Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee
Exploiting Bispectral Features for Single-Channel Speech Enhancement
Venkatesh Parvathala, Ramesh Gundluru, Sreekanth Sankala, K. Sri Rama Murty
Automatic Dialectal Transcription: An Evaluation on Finnish and Norwegian
Olli Kuparinen
Can ASR generate valid measures of child reading fluency?
Wieke Harmsen, Roeland van Hout, Catia Cucchiarini, Helmer Strik
SGED-Probe: Probing E2E ASR decoder and aligner for spoken grammar error detection under three speaking practice conditions
Chowdam Venkata Thirumala Kumar, Chiranjeevi Yarra
Evaluating Logit-Based GOP Scores for Mispronunciation Detection
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
Towards a Unified Benchmark for Arabic Pronunciation Assessment: Qur’anic Recitation as Case Study
Yassine El Kheir, Omnia Ibrahim, Amit Meghanani, Nada Almarwani, Hawau Toyin, Sadeen Alharbi, Modar Alfadly, Lamya Alkanhal, Ibrahim Selim, Shehab Elbatal, Salima Mdhaffar, Thomas Hain, Yasser Hifny, Mostafa Shahin, Ahmed Ali
OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global Learners
Wen-Wei Hsieh, Hao-Wei Chi, Kuan-Chen Wang, Ping-Cheng Yeh, Te-hsin Liu, Chen-Yu Chiang
A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater’s Shadowing and Sequence-to-sequence Voice Conversion
Haopeng Geng, Daisuke Saito, Nobuaki Minematsu
Multimodal and Multitask Learning for Predicting Multiple Scores in L2 English Speech
Sehyun Oh, Sunhee Kim, Minhwa Chung
Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu
Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish
Nhan Phan, Mikko Kuronen, Maria Kautonen, Riikka Ullakonoja, Anna von Zansen, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching
Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen
Differentiable Reward Optimization for LLM based TTS system
Changfeng Gao, Zhihao Du, Shiliang Zhang
Long-Context Speech Synthesis with Context-Aware Memory
Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu
Monotonic Attention for Robust Text-to-Speech Synthesis in Large Language Model Frameworks
Yike Zhang, Yiming Li, Jie Chen, Qinghua Wu, Songjun Cao, Long Ma
Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising
Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling
Bridging the Training–Inference Gap in TTS: Training Strategies for Robust Generative Postprocessing for Low-Resource Speakers
Frank Zalkow, Paolo Sani, Kishor Kayyar Lakshminarayana, Emanuël A. P. Habets, Nicola Pia, Christian Dittmar
Robust Neural Codec Language Modeling with Phoneme Position Prediction for Zero-Shot TTS
Chunhui Lu, Xue Wen, Liming Song, Junkwang Oh
SepVAC: Multitask Learning of Speaker Separation, Speaker Localization, Microphone Array Localization, and Room Acoustic Parameter Estimation in Various Acoustic Conditions
Roland Hartanto, Sakriani Sakti, Koichi Shinoda
TA-RIR: Topology-Aware Neural Modeling of Acoustic Propagation for Room Impulse Response Synthesis
Junhui Zhao, Hang Chen, Qing Wang, Jun Du, Yanhui Tu, Feng Ma
Spatially Weighted Contrastive Learning for Robust Sound Source Localization
Hyun-Soo Kim, Da-Hee Yang, Joon-Hyuk Chang
Efficient and Microphone-Fault-Tolerant 3D Sound Source Localization
Yiyuan Yang, Shitong Xu, Niki Trigoni, Andrew Markham
Joint Reference Microphone Selection and Filter Order Determination in Multi-channel Active Noise Control
De Hu, Shuyao Liu, Yanrong He
Direct-path Relative Harmonic Coefficients Detection for Multi-source Direction-of-Arrival Estimation in Reverberant Environments
Liang Tao, Maoshen Jia, Yonggang Hu
D-GAT: Dual Graph Attention Network for Global HRTF Interpolation
Junsheng Hu, Shaojie Li, Qintuya Si, De Hu
Deep learning based spatial aliasing reduction in beamforming for audio capture
Mateusz Guzik, Giulio Cengarle, Daniel Arteaga
SonarGuard2: Ultrasonic Face Liveness Detection Based on Adaptive Doppler Effect Feature Extraction
Xiaoming Zhang, Ke-Yue Zhang, Taiping Yao, Songjun Cao, Shouhong Ding, Long Ma
Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning
Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto
Non-Standard Accent TTS Support via Large Multi-Accent Frontend Pronunciation Knowledge Transfer
Noe Berger, Siqi Sun, Korin Richmond
Speech-guided Grapheme-to-Phoneme Conversion for Cantonese Text-to-Speech
Timothy Shin Heng Mak, King Yiu Suen, Albert Y.S. Lam
Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation
Rui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan
Enabling the replicability of speech synthesis perceptual evaluations
Sébastien Le Maguer, Gwénolé Lecorvé, Damien Lolive, Naomi Harte, Juraj Šimko
When The MOS Predictor Asks For Training Annotation In Cross Lingual/Domain Adaptation
Natacha Miniconi, Meysam Shamsi, Anthony Larcher
Assessment of the synthetic quality and controllability of laughing onset in speech-laugh synthesis
Ryo Setoguchi, Yoshiko Arimoto
Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach
Nick Rossenbach, Benedikt Hilmes, Leon Brackmann, Moritz Gunz, Ralf Schlüter
Word Level Timestamp Generation for Automatic Speech Recognition and Translation
Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg
Directional Speech Recognition with Full-Duplex Capability
Ju Lin, Yiteng Huang, Ming Sun, Frank Seide, Florian Metze
CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models
Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda
Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
Hongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie
Improving Audio Classification by Transitioning from Zero- to Few-Shot
James Taylor, Wolfgang Mack
Zero-Shot Learning for Acoustic Event Classification Using an Attribute Vector and Conditional GAN
Kohei Uehara, Ryoichi Takashima, Tetsuya Takiguchi
Leveraging Multi-Level Features of ATST with Conformer-Based Dual-Branch Network for Sound Event Detection
Lipeng Dai, Qing Wang, Jie Zhang, Shengyu Peng, Yu Guan, Wu Guo
Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features
Tatsuya Komatsu, Hokuto Munakata, Yuchi Ishikawa
Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Yuchi Ishikawa, Shota Nakada, Hokuto Munakata, Kazuhiro Saito, Tatsuya Komatsu, Yoshimitsu Aoki
AC/DC: LLM-based Audio Comprehension via Dialogue Continuation
Yusuke Fujita, Tomoya Mizumoto, Atsushi Kojima, Lianbo Liu, Yui Sudo
Anomalous Sound Detection Based Feature Fusion and Dual-path Non-linear Independent Components Estimation
Yawei Wang, Qiaoling Zhang, Yi Zhang, Junyao Hu
An Effective Anomalous Sound Detection Method Based on Global and Local Attribute Mining
Nan Jiang, Yan Song, Qing Gu, Haoyu Song, Lirong Dai, Ian McLoughlin
Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment
Long-Vu Hoang, Tuan Nguyen, Huy Dat Tran
Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval
Anup Singh, Kris Demuynck, Vipul Arora
H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing
Akanksha Singh, Yi-Ping Phoebe Chen, Vipul Arora
Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval
Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jin
Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting
Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho
GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale Tokenizer
Minghui Fang, Shengpeng Ji, Jialong Zuo, Xize Cheng, Wenrui Liu, Xiaoda Yang, Ruofan Hu, Jieming Zhu, Zhou Zhao
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
Changin Choi, Sungjun Lim, Wonjong Rhee
On Retrieval of Long Audios with Complex Text Queries
Ruochu Yang, Milind Rao, Harshavardhan Sundar, Anirudh Raju, Aparna Khare, Srinath Tankasala, Di He, Venkatesh Ravichandran
SIDC-KWS: Efficient Spiking Inception-Dilated Conformer with Self-Attention for Keyword Spotting
Jin Gyo Lim, Seong Eun Kim
Multichannel Keyword Spotting for Noisy Conditions
Dzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov
LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting
Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge
GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples
Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs
Firoj Alam, Md Arid Hasan, Shammur Absar Chowdhury
CAMER: Contribution-Aware Multimodal Emotion Recognition
Sun-Kyung Lee, Jong-Hwan Kim
GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints
Jiajun He, Jinyi Mi, Tomoki Toda
SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer
Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang
PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association
Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman
Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
U-SAM: An Audio Language Model for Unified Speech, Audio, and Music Understanding
Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie
Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data
Yun Tang, Eesung Kim, Vijendra Raj Apsingekar
The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models
Yi Wang, Oli Danyi Liu, Peter Bell
Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication
Éva Székely, Péter Mihajlik, Máté Soma Kádár, László Tóth
Objective and Subjective Evaluation of Diffusion-Based Speech Enhancement for Dysarthric Speech
Dimme de Groot, Tanvina Patel, Devendra Kayande, Odette Scharenborg, Zhengjun Yue
Data Augmentation using Speech Synthesis for Speaker-Independent Dysarthria Severity Classification
Minseop Kim, Minsu Han, Seokyoung Hong, Myoung-wan Koo
Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS
Anuprabha M, Krishna Gurugubelli, Anil Kumar Vuppala
Synthetic Dysarthric Speech: A Supplement, Not a Substitute for Authentic Data in Dysarthric Speech Recognition
Jingting Li, Keyi Feng, Xinran Zhao, Yan Wang, Su-Jing Wang
Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech
Karl El Hajal, Enno Hermann, Sevada Hovsepyan, Mathew Magimai Doss
Audio-Based Classification and Geographic Regression of Austrian Dialects
Lorenz Gutscher, Michael Pucher
Jointly Improving Dialect Identification and ASR in Indian Languages using Multimodal Feature Fusion
Saurabh Kumar, Amartyaveer, Prasanta Kumar Ghosh
ADI-20: Arabic Dialect Identification dataset and models
Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares
Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion
Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek
Effects of Prosodic Information on Dialect Classification Using Whisper Features
Phoebe Parsons, Heming Strømholt Bremnes, Knut Kvale, Torbjørn Svendsen, Giampiero Salvi
Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification
Badr M. Abdullah, Matthew Baas, Bernd Möbius, Dietrich Klakow
Band-Split Self-supervised Mamba for Infant-centered Audio Analysis
Xulin Fan, Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain
Subtyping Speech Errors in Childhood Speech Sound Disorders with Acoustic-to-Articulatory Speech Inversion
Nina R Benway, Saba Tabatabaee, Benjamin Munson, Jonathan Preston, Carol Espy-Wilson
PERCEPT-US: A Multimodal American English Child Speech Corpus Specialized for Articulatory Feedback
Amanda Eads, Heather Kabakoff, Nina Benway, Elaine Hitchcock, Jonathan Preston, Tara McAllister
Children's Voice Privacy: First Steps and Emerging Challenges
Ajinkya Kulkarni, Francisco Teixeira, Enno Hermann, Thomas Rolland, Isabel Trancoso, Mathew Magimai Doss
FT-Boosted SV: Towards Noise Robust Speaker Verification for English Speaking Classroom Environments
Saba Tabatabaee, Jing Liu, Carol Espy-Wilson
Examining Test-Time Adaptation for Personalized Child Speech Recognition
Zhonghao Shi, Xuan Shi, Anfeng Xu, Tiantian Feng, Harshvardhan Srivastava, Shrikanth Narayanan, Maja Mataric
Employing self-supervised learning models for cross-linguistic child speech maturity classification
Theo Zhang, Madurya Suresh, Anne Warluamont, Kasia Hitczenko, Alejandrina Cristia, Margaret Cychosz
On Enhancing the Performance of Children's ASR Task in Limited Data Scenario
Ankita Ankita, Shambhavi Shambhavi, Syed Shahnawazuddin
Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling
Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan
Large Language Models based ASR Error Correction for Child Conversations
Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan
Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
Tarek Kunze, Marianne Métais, Hadrien Titeux, Lucas Elbert, Joseph Coffey, Emmanuel Dupoux, Alejandrina Cristia, Marvin Lavechin
Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts
Lingyun Gao, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
Improving Automatic Speech Recognition for Children's Reading Assessment with Disfluency-aware Language Models
Jazmín Vidal, Luciana Ferrer, Juan Esteban Kamienkowski, Pablo Riera
Oral Reading Errors by Grade 3 Children in Indian Schools: A Hindi-English Perspective
Sneha Raman, Preeti Rao
Grammatical Error Detection on Spontaneous Children's Speech Using Iterative Pseudo Labeling
Christopher Gebauer, Lars Rumberg, Lars Köhn, Hanna Ehlert, Edith Beaulac, Jörn Ostermann
Why is children's ASR so difficult? Analyzing children's phonological error patterns using SSL-based phoneme recognizers
Koharu Horii, Naohiro Tawara, Atsunori Ogawa, Shoko Araki
Automatic detection of speech sound disorders in German-speaking children: augmenting the data with typically developed speech
Darline Monika Marx, Marco Matassoni, Alessio Brutti
Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence
Edem Ahadzi, Vishwanath Pratap Singh, Tomi Kinnunen, Ville Hautamaki
Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech Recognition
Thomas Rolland, Alberto Abad
Advancing Pediatric ASR: The Role of Voice Generation in Disordered Speech
Karen Rosero, Ali N Salman, Shreeram Chandra, Berrak Sisman, Cortney Van’t Slot, Alex Kane, Rami R Hallac, Carlos Busso
CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR
Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan
Causal Structure Discovery for Error Diagnostics of Children's ASR
Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen
Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain
Omer Moussa, Mariya Toneva
Enhancing Syllabic Recognition via Speech-EEG Phase Analysis and Non-Activity State Modeling
Rini Sharon, Hema A. Murthy
Functional Connectivity and Hilbert-Based Features for Covert Speech EEG Variability Analysis and Classification
Saravanakumar Duraisamy, Maurice Rekrut, Luis A. Leiva
Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG
Siavash Shams, Richard Antonello, Gavin Mischler, Stephan Bickel, Ashesh Mehta, Nima Mesgarani
Selective Auditory Attention Decoding in Naturalistic Conversations Using EEG-Based Speech Envelope Tracking in Multi-Speaker Environments
Gabriel Ivucic, Saurav Pahuja, Dashanka Da Silva, Tanja Schultz
MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction
Mohammed Al-Radhi, Géza Németh, Branislav Gerazov
Probing Prosodic Differences Between Two Regional Varieties of Brazilian Portuguese
Gustavo Silveira, Aviad Albert, Martine Grice
Data-driven approaches to pitch modelling in two Mexican Spanish ethnolects: K-means Clustering & GAMMs
Gilly Marchini, Jeremy Steffman
Tracking /r/ Deletion: Forced Alignment of Pronunciation Variants and Sociophonetic Insights into Post-Obstruent Final /r/ in French
Anisia Popescu, Lori Lamel, Marc Evrard, Ioana Vasilescu
Agent-based modelling, sound change, and metaphony in Southern Italian varieties of Italo-Romance.
Lilian von Bressensdorf, Pia Greca, Jonathan Harrington
Modeling Vowel System Typology Using Iterated Confusion Minimization
John McGahay
Investigating Glottal Stop Coda Loss During Sound Change of Checked Syllables Based on Speech-EGG Voice Offset Alignment
Bingliang Zhao, Xiyu Wu
FlowTSE: Target Speaker Extraction with Flow Matching
Aviv Navon, Aviv Shamsian, Yael Segal-Feldman, Neta Glazer, Gil Hetz, Joseph Keshet
MTSE: Multi-Target Speaker Extraction for Conversation Scenarios
Thomas Serre, Mathieu Fontaine, Eric Benhaim, Slim Essid
Location-Aware Target Speaker Extraction for Hearing Aids
Daniel-José Alcala Padilla, Nils L. Westhausen, Swati Vivekananthan, Bernd T. Meyer
ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment
Shengkui Zhao, Zexu Pan, Bin Ma
Online AV-CrossNet: a Causal and Efficient Audiovisual System for Speech Enhancement and Target Speaker Extraction
Cheng Yu, Vahid Ahmadi Kalkhorani, Buye Xu, DeLiang Wang
Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios
Jakob Kienegger, Timo Gerkmann
RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval
Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast
Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman
Modality-Agnostic Multimodal Emotion Recognition using a Contrastive Masked Autoencoder
Georgios Chochlakis, Turab Iqbal, Woo Hyun Kang, Zhaocheng Huang
Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion
Maxim Markitantov, Elena Ryumina, Heysem Kaya, Alexey Karpov
Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems
Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara
Continuous prediction of backchannel timing for human-robot interaction
Michael Paierl, Martin Hagmüller, Barbara Schuppler
Impact of Background Noise on Turn-Taking Dynamics in Triadic Conversations
Valeska Slomianka, Tobias May, Torsten Dau
Multimodal Dynamics of Hand Gestures and Pauses in Multiparty Interactions
Delphine Charuau, Naomi Harte
TinyClick: Single-Turn Agent for Empowering GUI Automation
Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz
Improving User Impression of Spoken Dialogue Systems by Controlling Para-linguistic Expression Based on Intimacy
Shoki Kawanishi, Akinori Ito, Yuya Chiba, Takashi Nose
Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model
Kiyotada Mori, Seiya Kawano, Angel García Contreras, Koichiro Yoshino
Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?
Aditya Kommineni, Digbalay Bose, Tiantian Feng, So Hyun Kim, Helen Tager-Flusberg, Somer Bishop, Catherine Lord, Sudarsana Kadiri, Shrikanth Narayanan
A Cascaded Multimodal Framework for Automatic Social Communication Severity Assessment in Children with Autism Spectrum Disorder
Jihyun Mun, Sunhee Kim, Minhwa Chung
Accessible Real-time Eye-gaze Tracking for Neurocognitive Health Assessment: A Multimodal Web-based Approach
Daniel Tisdale, Jackson Liscombe, David Paulter, Michael Neumann, Vikram Ramanarayanan
Multimodal Biomarkers for Schizophrenia: Towards Individual Symptom Severity Estimation
Gowtham Premananth, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Espy-Wilson
Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization
Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen
Evaluating Automatic Speech Recognition Pipelines for Mandarin-English Bilingual Child Language Assessment in Telehealth
Hongchen Wu, Yao Du, Zirong Li, Yixin Gu, Disha Thotappala Jayaprakash, Li Sheng
Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss
Jiawen Huang, Felipe Sousa, Emir Demirel, Emmanouil Benetos, Igor Gadelha
Tonality-Based Accompaniment-Guided Automatic Singing Evaluation
Pei-Chin Hsieh, Yih-Liang Shen, Ngoc-Son Tran, Tai-Shih Chi
Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
Focal Modulation Network: A Novel Solution for Polyphonic Music Instrument Recognition without Attention and Aggregation Strategy
Lekshmi C R, Rajeev Rajan
A Joint Network for Singing Melody Extraction from Polyphonic Music with Attention Aggregation and Self-Consistency Training
Jiabo Jing, Ying Hu, Hao Huang, Liang He, Zhijian Ou
Position also matters! Separating Same Instruments in String Quartet using Timbral and Positional Cues
Yuetonghui Xu, Yiwen Wang, Xihong Wu, Xiaobing Li, Feng Yu
WhisperMSS: A Two-Stage Framework for Mandarin Singing Transcription and Segmentation Using Pretrained Models
Ruoxuan Liang, Xiangjian Zeng, Zhen Liu, Qingqiang Wu, RuiChen Zhang, Le Ren
Low Complex IIR Adaptive Hear-Through Ambient Filtering for Overcoming Practical Constraints in Earbuds
Rishabh Gupta, MLNS Karthik, Yughendaran P
Sub-band based Adaptive IIR Algorithm with Biquad Filter Stability Constraints for Feedforward Hear-Through Equalization
Rishabh Gupta, MLNS Karthik, Chelamkuri Omsrinath
Discrete Audio Representations for Automated Audio Captioning
Jingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer
Daiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada
Temp4Cap: Temporally-aligned Automated Audio Captioning
Ho-Young Choi, Jae-Heung Cho, Pil Moo Byun, Won-Gook Choi, Joon-Hyuk Chang
Optimizing CLAP Reward with LLM Feedback for Semantically Aligned and Diverse Automated Audio Captioning
Seyun Ahn, Pil Moo Byun, Won-Gook Choi, Joon-Hyuk Chang
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models
Seung-jae Lee, Paul Hongsuck Seo
DiffStereo: End-to-End Mono-to-Stereo Audio Generation with Diffusion Transformer
Suqi Zhang, Zheqi Dai, Yongyi Zang, Yin Cao, Qiuqiang Kong
RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio
Yusuke Kanamori, Yuki Okamoto, Taisei Takano, Shinnosuke Takamichi, Yuki Saito, Hiroshi Saruwatari
Crowdsourcing MUSHRA Tests in the Age of Generative Speech Technologies: A Comparative Analysis of Subjective and Objective Testing Methods
Laura Lechler, Chamran Moradi, Ivana Balic
SMARTMOS: Modeling Subjective Audio Quality Evaluation for Real-Time Applications
Sivakumar Balasubramanian, Jose Antonio Jimenez Amador, Kaustubh Kalgaonkar, King Wei Hor, Sriram Srinivasan
Effect of Loudspeaker Emitted Speech on ASR performance
Vikram C M, Sanjoy Pal, Nidhi Mantri, Gopal Kumar Agrawal
Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation
Zhennan Lin, Kaixun Huang, Wei Ren, Linju Yang, Lei Xie
Character Error Rate Estimation for Semi-Supervised Training of Speech Recognition for Arabic Dialects
Chanho Park, Oscar Saz
Unified Semi-Supervised Pipeline for Automatic Speech Recognition
Nune Tadevosyan, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Ante Jukic
Scaling Laws for Synthetic Speech for Model Training
Christoph Minixhofer, Ondřej Klejch, Peter Bell
R2S: Real-to-Synthetic Representation Learning for Training Speech Recognition Models on Synthetic Data
Minh Tran, Debjyoti Paul, Yutong Pang, Laxmi Pandey, Jinxi Guo, Ke Li, Shun Zhang, Xuedong Zhang, Xin Lei
Context is all you need? Low-resource conversational ASR profits from context, coming from the same or from the other speaker
Julian Linke, Jana Winkler, Barbara Schuppler
Automatic Speech Recognition Biases in Newcastle English: an Error Analysis
Dana Serditova, Kevin Tang, Jochen Steffens
Speech Unlearning
Jiali Cheng, Hadi Amiri
Unlearning LLM-Based Speech Recognition Models
Zhe Liu
EASY: Emotion-aware Speaker Anonymization via Factorized Distillation
Jixun Yao, Hexin Liu, Eng Siong Chng, Lei Xie
Private kNN-VC: Interpretable Anonymization of Converted Speech
Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller
Legally validated evaluation framework for voice anonymization
Nathalie Vauquier, Brij Mohan Lal Srivastava, Seyed Ahmad Hosseini, Emmanuel Vincent
Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models
Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, Hung-yi Lee
Speechless: Speech Instruction Training Without Speech for Low Resource Languages
Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli
Cryfish: On deep audio analysis with Large Language Models
Anton Mitrofanov, Sergey Novoselov, Tatiana Prisyach, Vladislav Marchevskiy, Arseniy Karelin, Nikita Khmelev, Dmitry Dutov, Stepan Malykh, Igor Agafonov, Aleksandr Nikitin, Oleg Petrov
Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning
Long Mai, Julie Carson-Berndsen
OpusLM: A Family of Open Unified Speech Language Models
Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe
CAPR: Confidence-Aware Prompt Refinement in Large Language Models
Jen-Tzung Chien, Po-Chun Huang
The Interspeech 2025 Speech Accessibility Project Challenge
Xiuwen Zheng, Bornali Phukon, Jonghwan Na, Ed Cutrell, Kyu J. Han, Mark Hasegawa-Johnson, Pan-Pan Jiang, Aadhrik Kuila, Colin Lea, Bob MacDonald, Gautam Mantena, Venkatesh Ravichandran, Leda Sari, Katrin Tomanek, Chang D. Yoo, Chris Zwilling
Towards Inclusive and Fair ASR: Insights from the SAPC Challenge for Optimizing Disordered Speech Recognition
Nada Gohider, Otman Basir
Robust fine-tuning of speech recognition models via model merging: application to disordered speech
Alexandre Ducorroy, Rachid Riad
Exploring Generative Error Correction for Dysarthric Speech Recognition
Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi
Pathology-Aware Speech Encoding and Data Augmentation for Dysarthric Speech Recognition
Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet
Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition
Dominik Wagner, Ilja Baumann, Natalie Engert, Seanie Lee, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet
A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition
Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin
Fine-tuning Parakeet-TDT for Dysarthric Speech Recognition in the Speech Accessibility Project Challenge
Kaito Takahashi, Keigo Hojo, Toshimitsu Sakai, Yukoh Wakabayashi, Norihide Kitaoka
CBA-Whisper: Curriculum Learning-Based AdaLoRA Fine-Tuning on Whisper for Low-Resource Dysarthric Speech Recognition
Tianyi Tan, Xinan Chen, Xiaohuai Le, Wenzhi Fan, Xianjun Xia, Chuanzeng Huang, Jing Lu
Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition
Shunsuke Mitsumori, Sara Kashiwagi, Keitaro Tanaka, Shigeo Morishima
Spot and Merge: A Hybrid Context Biasing Approach for Rare Word and Out of Vocabulary Recognition
Jatin Agrawal, Bramhendra Koilakuntla, Srikanth Konjeti
Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR
Martin Ratajczak, Jean-Philippe Robichaud, Jennifer Drexler Fox
Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition
Changhan Oh, Kiyoung Park, Jeomja Kang, Woo Yong Choi, Hwa Jeon Song
Improving End-to-end Mixed-case ASR with Knowledge Distillation and Integration of Voice Activity Cues
Sashi Novitasari, Takashi Fukuda, Gakuto Kurata
Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR
Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Age-related changes in multisensory integration of emotions in an audiovisual face-prosody-semantics Stroop task
Yi Lin, Shumeng Ni, Yangfan Lu
Investigating effects of sex hormones, cycle phases and age on female fundamental frequency
Melanie Weirich, Adrian Simpson
Pre-aspiration in Iceland Is Conditioned by Gender/Sex
Meike Rommel, Míša Hejná, Nicole Dehé
Transcribing Diverse Voices: Using Whisper for ICE corpora
Andreas Weilinghoff
Is it all about race?: A Cross-examination of /s/ in a Multilingual (Nigerian) Context
Oluwasegun Amoniyan
Investigating Gender Bias in Text-to-Audio Generation Models
Aarish Shah Mohsin, Mohammad Nadeem, Shahab Saquib Sohail, Tughrul Arsalan, Mandar Gogate, Nasir Saleem, Amir Hussain
Dual Orthogonality Sub-center Loss for Enhanced Anomalous Sound Detection
Dong Wang, Jiqing Han, Tieran Zheng, Guibin Zheng, Yongjun He
Adaptive Across-Subcenter Representation Learning for Imbalanced Anomalous Sound Detection
Dong Wang, Jiqing Han, Guibin Zheng, Tieran Zheng, Yongjun He
Towards Few-Shot Training-Free Anomaly Sound Detection
Ho-Hsiang Wu, Wei-Cheng Lin, Abinaya Kumar, Luca Bondi, Shabnam Ghaffarzadegan, Juan Pablo Bello
Finetune Large Pre-Trained Model Based on Frequency-Wise Multi-Query Attention Pooling for Anomalous Sound Detection
Nan Jiang, Yan Song, Qing Gu, Haoyu Song, Lirong Dai, Ian McLoughlin
Acoustic Detection of UAV Abnormality Using One Ground-Based Acoustic Vector Sensor
Dengjian Zhou, Jianghan Hai, Sijia Liao, Yue Ivan Wu, Kainam Thomas Wong, Xiujuan Zheng
StarGAN-Aug: A Cross-domain Fault Audio Generation Method for High-performance Fault Diagnosis of Power Transformers
Ben Niu, Yangjie Wei, Gang Yang, Yuqiao Wang, Shengling Yu
SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition
Longjie Luo, Lin Li, Qingyang Hong
Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning
Siyi Zhao, Wei Wang, Yanmin Qian
Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli
HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization
Hyebin Ahn, Kangwook Jang, Hoirin Kim
MOVER: Combining Multiple Meeting Recognition Systems
Naoyuki Kamo, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani
EmbedAug: An Augmentation Scheme for End-to-End Automatic Speech Recognition
Ashish Panda, Sunil Kumar Kopparapu
Attention Models and Auditory Transduction Features for Noise Robustness
Cathal Ó Faoláin, Andrew Hines
Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform
Xiangzhu Kong, Hao Huang, Zhijian Ou
On the Design of a Robust Superdirective Beamformer and Topology Parameter Optimization with Frustum-Shaped Microphone Arrays Featuring Multiple Rings
Kunlong Zhao, Gongping Huang, Xudong Zhao, Jingdong Chen, Jacob Benesty, Zoran Cvetkovic
Efficient Neural and Numerical Methods for High-QualityOnline Speech Spectrogram Inversion via Gradient Theorem
Andres Fernandez, Juan Azcarreta Ortiz, Çağdaş Bilen, Jesus Monge Alvarez
Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback
Jingyi Chen, Ju Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault
Accelerating Diffusion-based Text-to-Speech Model Trainingwith Dual Modality Alignment
Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Xie Chen
SpeechSEC: A Unified Multi-Task Framework for Speech Synthesis, Editing, and Continuation
Liming Liang, Dongchao Yang, Xianwei Zhuang, Yuxin Xie, Luo Chen, Yuehan Jin, Yuexian Zou
VoiceNoNG: Robust High-Quality Speech Editing Model without Hallucinations
Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Pin-Jui Ku, Ante Jukic, Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu
A Watermark for Auto-Regressive Speech Generation Models
Yihan Wu, Ruibo Chen, Georgios Milis, Junfeng Guo, Heng Huang
Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications
Carol Espy-Wilson
Influence of wall coverings of 3D-printed vocal tract models on measured transfer functions
Peter Birkholz, Dominik Schäfer, Patrick Häsner, Jihyeon Yun, Iris Kruppke, Rémi Blandin
Supralaryngeal Kinematics of Implosives in Central Vietnamese: An EMA Study
Paul McGuire, Kye Shibata, Thanh Viet Cao, Feng-fan Hsieh, Yueh-chin Chang
Lateral Channel Formation in Australian English /l/: Insights from Magnetic Resonance Imaging
Tünde Szalay, Michael Proctor, Amelia Gully, Tharinda Piyadasa, Craig Jin, David Waddington, Naeim Sanaei, Sheryl Foster, Kirrie Ballard
Articulatory variations in Apical Vowels in Southwestern Mandarin
Jing Huang, Feng-fan Hsieh, Yueh-chin Chang
Rhotic Articulation in Australian English: Insights from MRI
Michael Proctor, Tünde Szalay, Tharinda Piyadasa, Craig Jin, Naeim Sanaei, Amelia Gully, David Waddington, Sheryl Foster, Kirrie Ballard
Articulatory Strategy in Vowel Production as a Basis for Speaker Discrimination
Justin J. H. Lo, Patrycja Strycharczuk, Sam Kirkham
PAST: Phonetic-Acoustic Speech Tokenizer
Nadav Har-Tuv, Or Tal, Yossi Adi
Factorized RVQ-GAN For Disentangled Speech Tokenization
Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Boboš, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, François G. Germain, Gordon Wichern, Jonathan Le Roux
EnCodecMAE: leveraging neural codecs for universal audio representation learning
Leonardo Pepino, Pablo Riera, Luciana Ferrer
AxLSTMs: learning self-supervised audio representations with xLSTMs
Sarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan
Real-time TSE demonstration via SoundBeam with KD
Keigo Wakayama, Tomoko Kawase, Takafumi Moriya, Marc Delcroix, Hiroshi Sato, Tsubasa Ochiai, Masahiro Yasuda, Shoko Araki
Real-Time Diffusion Buffer for Speech Enhancement On A Laptop
Bunlong Lay, Rostilav Makarov, Timo Gerkmann
Co-Speech Motion for Virtual Agents in Dialogue Using LLM-Driven Primitive Action Selection
Muhammad Yeza Baihaqi, Angel García Contreras, Seiya Kawano, Koichiro Yoshino
TargetVoice: Single Channel Low-Latency Target Speaker Extraction
Arun Kumar Pallala, Nivedita Chennupati, Balaji Padmanaban, Rakesh Pogula, Uma Subhashini Ravuri, Naveen Ellanki, Harish Rajamani, Naveen Ambati
Rollback Speech: Smart Feedback Prompts for Lost Utterances in Unstable Online Calls
Yuni Amaloa Quintero Villalobos, Wafaa Wardah, Sebastian Möller, Robert P. Spang
Simultaneous Speech Translation Integrated Compact Multiple Sound Spot Synthesis System On A Laptop Carried Out With A Backpack
Takuma Okamoto, Michiyo Kono
GenECA: A General-Purpose Framework for Real-Time Adaptive Multimodal Embodied Conversational Agents
Santosh Patapati, Aashrith Tatineni, Trisanth Srinivasan
Hybrid Expert Knowledge and Self-Supervised Learning for Diagnostic Modeling of Adductor Spasmodic and Primary Myotonic Dysphonia
Zhou Du, Hang Chen, Huijun Ding, Jun Du, Zhen Chen
MVP: Multi-source Voice Pathology detection
Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis
Phonetic Posteriorgram-Based Phoneme Selection for Vocal Cord Disorder Classification in Continuous Mandarin Speech
Chih-Ning Chen, Yu-Lan Chuang, Ming-Jhang Yang, Wei-Cheng Hsu, Yung-An Tsou, Yi-Wen Liu
Articulatory clarity and variability before and after surgery for tongue cancer
Thomas Tienkamp, Fleur van Ast, Roos van der Veen, Teja Rebernik, Raoul Buurke, Nikki Hoekzema, Katharina Polsterer, Hedwig Sekeres, Rob van Son, Martijn Wieling, Max Witjes, Sebastiaan de Visscher, Defne Abur
Hybrid HMM-SVM classifier using frication-based features for detection of non-normative sibilant articulation patterns in Polish children’s speech
Zuzanna Miodonska
Fine-Tuning ASR for Stuttered Speech: Personalized vs. Generalized Approaches
Dena Mujtaba, Nihar R. Mahapatra
SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment
SooHwan Eom, Mark Hasegawa-Johnson, Chang D. Yoo
Improving Generalization of End-to-End ASR through Diversity and Independence Regularization
Ye-Eun Ko, Mun-Hak Lee, Dong-Hyun Kim, Joon-Hyuk Chang
Exploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASR
Carlos Carvalho, Jinchuan Tian, William Chen, Yifan Peng, Alberto Abad, Shinji Watanabe
Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces
Takafumi Moriya, Masato Mimura, Kiyoaki Matsui, Hiroshi Sato, Kohei Matsuura
Thinking Fast and Slow: Robust Speech Recognition via Deep Filter-Tuning
Dianwen Ng, Kun Zhou, Bin Ma, Eng Siong Chng
Towards Efficiently Whisper Fine-tuning with Monotonic Alignments
Ziyang Zhuang, Tao Wei, Ming Fang, Ning Cheng, Shaojun Wang, Jing Xiao
Dynamic Acoustic Model Architecture Optimization in Training for ASR
Jingjing Xu, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlüter, Hermann Ney
Knowledge Distillation Method for Pruned RNN-T Models via Pruning Bounds Sharing and Losses Confusion
Xiaocan Zhang, Weiwei Jiang, Guibin Zheng, Chenhao Jing, Jiqing Han, Tieran Zheng
An Effective Training Framework for Light-Weight Automatic Speech Recognition Models
Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman
Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Andrés Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esaú Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention
Aditya Srinivas Menon, Raj Prakash Gohil, Kumud Tripathi, Pankaj Wasnik
SCD-Conformer: Semantic Content Disentanglement for Text-Independent Speaker Verification
Shanshan Yao, Dianlong Liu, Tian Li
Universal Semantic Disentangled Privacy-preserving Speech Representation Learning
Biel Tura-Vecino, Subhadeep Maji, Aravind Varier, Antonio Bonafonte, Ivan Valles, Michael Owen, Constantinos Papayiannis, Leif Radel, Grant Strimel, Oluwaseyi Feyisetan, Roberto Barra-Chicote, Ariya Rastrow, Volker Leutnant, Trevor Wood
Using gender, phonation and age to interpret automatically discovered speech attributes for explainable speaker recognition
Carole Millot, Clara Ponchard, Cédric Gendrot, Jean-François Bonastre, Orane Dufour
Do you read me? - flow of speech effect on speaker recognition systems
Alicja Martinek, Joanna Gajewska, Ewelina Bartuzi-Trokielewicz
VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
Zhiqi Ai, Meixuan Bao, Zhiyong Chen, Zhi Yang, Xinnuo Li, Shugong Xu
LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context
Natsuo Yamashita, Masaaki Yamamoto, Hiroaki Kokubo, Yohei Kawaguchi
ASR Confidence Estimation using True Class Lexical Similarity Score
Nagarathna Ravi, Thishyan Raj T, Ravi Teja Chaganti, Vipul Arora
Semi-Supervised Learning for Automatic Speech Recognition with Word Error Rate Estimation and Targeted Domain Data Selection
Chanho Park, Thomas Hain
Voice Activity-based Text Segmentation for ASR Text Denormalization
Sashi Novitasari, Takashi Fukuda, Gakuto Kurata
Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction
Christophe Van Gysel, Maggie Wu, Lyan Verwimp, Caglar Tirkaz, Marco Bertola, Zhihong Lei, Youssef Oualil
From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data
Ahmed Attia, Dorottya Demszky, Jing Liu, Carol Espy-Wilson
Boundary-Conscious Pruning: Hard Set-Aware Model Compression for Efficient Speaker Recognition
Seongkyu Mun, Jubum Han
Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization
Yafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang
A Domain Robust Pre-Training Method with Local Prototypes for Speaker Verification
Qing Gu, Yan Song, Haoyu Song, Nan Jiang, Lirong Dai, Ian McLoughlin
Clustering-based Hard Negative Sampling for Supervised Contrastive Speaker Verification
Piotr Masztalski, Michał Romaniuk, Jakub Żak, Mateusz Matuszewski, Konrad Kowalczyk
MASV: Speaker Verification with Global and Local Context Mamba
Yang Liu, Li Wan, Yiteng Huang, Ming Sun, Xinhao Mei, Xubo Liu, Yangyang Shi, Florian Metze
ATMM-SAGA: Alternating Training for Multi-Module with Score-Aware Gated Attention SASV system
Amro Asali, Yehuda Ben-Shimol, Itshak Lapidot
Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
SEED: Speaker Embedding Enhancement Diffusion Model
Kihyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung
A Copula-Based Generative Score-Level Fusion Model for Speaker Verification
Sandro Cumani
Acoustic similarities, articulatory uniqueness: Speech production mechanisms in individuals with congenital lip paralysis
Anne Hermes, Ivana Didirková, Philipp Buech, Gilles Vannuscorps
Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer
Bence Mark Halpern, Thomas Tienkamp, Teja Rebernik, Rob J.J.H. van Son, Martijn Wieling, Defne Abur, Tomoki Toda
Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson's Disease Classifiers
Terry Yi Zhong, Esther Janse, Cristian Tejedor-Garcia, Louis ten Bosch, Martha Larson
Multimodal Assessment of Speech Impairment in Amyotrophic Lateral Sclerosis Using Audio-Visual and Machine Learning Approaches
Francesco Pierotti, Andrea Bandini
Development and Validation of a Wav2Vec 2.0-Based Cross-Language Methodology for Measurement of Articulatory Precision
Tanya Talkar, Kan Kawabata, Connor Higgins, Sean Tobyne
J-j-j-just Stutter: Benchmarking Whisper's Performance Disparities on Different Stuttering Patterns
Charan Sridhar, Shaomei Wu
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
Junjie Zheng, Zihao Chen, Chaofan Ding, Yunming Liang, Yihan Fan, Huan Yang, Lei Xie, Xinhan Di
Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation
Hyung Kyu Kim, Hak Gu Kim
Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
Fang Kang, Yin Cao, Haoyu Chen
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic
VisualSpeech: Enhancing Prosody Modeling in TTS Using Video
Shumin Que, Anton Ragni
LightL2S: Ultra-Low Complexity Lip-to-Speech Synthesis for Multi-Speaker Scenarios
Yifan Liang, Kang Yang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng
Processing of grammatical information in cochlear implant simulated speech by German adult listeners
Atty Schouwenaars, Esther Ruigendijk
A Gradient Effect of Hand Beat Timing on Spoken Word Recognition
Chengjia Ye, James M. McQueen, Hans Rutger Bosker
The Effect of Word Predictability on Spoken Cross-Language Intelligibility
Wei Xue, Iuliia Zaitova, Bernd Möbius
Sentence-Final Particles in Mandarin Child-Directed Speech: Frequency and Impact on Speech Rate
Yizhi Liu, Luyuan Geng, Yan Gu, Mengru Han
Bilingual Speakers Exhibit Cognitive Fatigue: A Speech Disfluencies Case Study on Research Talks
Ashwin Ram, Marisol Muñoz, Zoi Gkalitsiou, Alexandros G. Dimakis
Boosting StoRM Convergence with Metric Guidance and Non-uniform State-Sampling for Optimal Dereverberation
Chandra Mohan Sharma, Arnab Kumar Roy, Anupam Mandal, Prasanta Kumar Ghosh, Prasanna Kumar Kr
Unified Variational and Physics-aware Model for Room Impulse Response Estimation
Louis Lalay, Mathieu Fontaine, Roland Badeau
MelRe: Vision-Based Mel-Spectrogram Restoration
Kaixuan Luan, Xiaoda Yang, Shile Cai, Ruofan Hu, Minghui Fang, Wenrui Liu, Jialong Zuo, Jiaqi Duan, Yuhang Ma, Junyu Lu
SpeechRefiner: Towards Perceptual Quality Refinement for Front-End Algorithms
Sirui Li, Shuai Wang, Zhijun Liu, Zhongjie Jiang, Yannan Wang, Haizhou Li
Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone Framework
Yunsik Kim, Yoonyoung Chung
Joint Rate Allocation and Sensor Selection for Speech Enhancement in Wireless Acoustic Sensor Networks
De Hu, Qilong Li
Individualized speech enhancement for hearing-impaired listeners
Chuan Wen, Sarah Verhulst
First Analyze Then Enhance: A Task-Aware System for Speech Separation, Denoising, and Dereverberation
Shaoxiang Dang, Li Li, Shogo Seki, Hiroaki Kudo
A Robust Hybrid ACC-PM Approach for Personal Sound Zones
Yaqi Zhu, Lei Zhou, Hongqing Liu, Liming Shi, Lu Gan
Distilling a speech and music encoder with task arithmetic
Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H.M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee
MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR
Dimitrios Damianos, Georgios Paraskevopoulos, Alexandros Potamianos
REB-former: RWKV-enhanced E-branchformer for Speech Recognition
Jie Song, Wang Xiang, Jian Zhou, Cunhang Fan, Zhao Lv
PredTrAD – Prediction-based Transformer for Anomaly Detection in Multivariate Time Series Data
Jan Schuster, Alexander Wölfel, Fabian Brunner, Christian Bergler
FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition
Jongsuk Kim, Jaemyung Yu, Minchan Kwon, Junmo Kim
Automatic Speech Recognition of African American English: Lexical and Contextual Effects
Hamid Mojarad, Kevin Tang
Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng
SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker Diarization
Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura
Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
Jiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun, Florian Metze
A Study of Real-world Audio-Visual Corpus Design and Production: A Perspective from MISP Challenges
Hang Chen, Jun Du, Qing Wang, Juan Xie, Shi-Fu XIong
VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset
Yuxi Wang, Yikang Wang, Qishan Zhang, Hiromitsu Nishizaki, Ming Li
J-SPAW: Japanese speaker verification and spoofing attacks recorded in-the-wild dataset
Sayaka Shiota, Suzuka Horie, Kouta Kanno, Shinnosuke Takamichi
CommissionsQC: a Québec French Speech Corpus for Automatic Speech Recognition
Coralie Serrand, Amira Morsli, Gilles Boulianne
Granary: Speech Recognition and Translation Dataset in 25 European Languages
Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov#, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg
Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges
Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik
Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis
Miao Zhang, Aref Farhadipour, Annie Baker, Jiachen Ma, Bogdan Pricop, Eleanor Chodroff
The Speech Accessibility Project: Best Practices for Collection and Curation of Disordered Speech
Chris Zwilling, Mark Hasegawa-Johnson, Heather Hodges, Lorraine Ramig, Adina Bradshaw, Clarion Mendes, Heejin Kim, Alexandria Barkhimer, Laura Mattie, Meg Dickinson, Shawnise Carter, Marie Moore Channell
Challenges and practical guidelines for atypical speech data collection, annotation, usage and sharing: A multi-project perspective
Zhengjun Yue, Mara Barberis, Tanvina Patel, Judith Dineley, Willemijn Doedens, Lottie Stipdonk, YuanYuan Zhang, Elke de Witte, Erfan Loweimi, Hugo Van hamme, Djaina Satoer, Marina Ruiter, Laureano Moro Velazquez, Nicholas Cummins, Odette Scharenborg
Fifteen Years of Child-Centered Long-Form Recordings: Promises, Resources, and Remaining Challenges to Validity
Loann Peurey, Marvin Lavechin, Tarek Kunze, Manel Khentout, Lucas Gautheron, Emmanuel Dupoux, Alejandrina Cristia
Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation
Qiongqiong Wang, Hardik B. Sailor, Tianchi Liu, Ai Ti Aw
Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech Corpus
Kalle Lahtinen, Einari Vaaras, Liisa Mustanoja, Okko Räsänen
Scalable Spontaneous Speech Dataset (SSSD): Crowdsourcing Data Collection to Promote Dialogue Research
Zaid Sheikh, Shuichiro Shimizu, Siddhant Arora, Jiatong Shi, Samuele Cornell, Xinjian Li, Shinji Watanabe
A Multimodal Chinese Dataset for Cross-lingual Sarcasm Detection
Xiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler
Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection
Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler
Analysis of the ABC Classification Backends for NIST SRE24
Sandro Cumani, Anna Silnova, Sara Barahona, Ladislav Mošner, Oldřich Plchot, Johan Rohdin
STCON NIST SRE24 System: Composite Speaker Recognition Solution for Challenging Scenarios
Stepan Malykh, Alexander Anikin, Nikita Khmelev, Anastasia Korenevskaya, Anastasia Zorkina, Sergey Novoselov, Vladislav Marchevskiy, Vladimir Volokhov, Andrey Shulipa, Alexander Kozlov, Alexander Melnikov, Vasiliy Galyuk, Timur Pekhovskiy
Vo-Ve: An Explainable Voice-Vector for Speaker Identity Evaluation
Jaejun Lee, Kyogu Lee
Variability in performance across four generations of automatic speaker recognition systems
Lauren Harrington, Vincent Hughes, Philip Harrison, Paul Foulkes, Jessica Wormald, Finnian Kelly, David van der Vloed
On the influence of language similarity in non-target speaker verification trials
Paul M. Reuter, Michael Jessen
The Sub-3Sec Problem: From Text-Independent to Text-Dependent Corpus
Ruichen Zuo, Kong Aik Lee, Zilong Huang, Man-Wai Mak
ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality
Yu-Xiang Luo, Yi-Cheng Lin, Ming-To Chuang, Jia-Hung Chen, I-Ning Tsai, Pei Xing Kiew, Yueh-Hsuan Huang, Chien-Feng Liu, Yu-Chen Chen, Bo-Han Feng, Wenze Ren, Hung-yi Lee
ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech Utterances
Huy Ba Do, Vy Le-Phuong Huynh, Luan Thanh Nguyen
Self-Supervised Models of Speech Processing for Haitian Creole
William N. Havard, Renauld Govain, Benjamin Lecouteux, Emmanuel Schang
AfriHuBERT: A self-supervised speech representation model for African languages
Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi
The Faetar Speech Recognition Benchmark
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking
Jaume Santamaría-Jordà, Pablo Segovia-Martínez, Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà, Adrià Giménez, Rubén Gaspar Aparicio, René Fernández Sánchez, Jorge Civera, Albert Sanchis, Alfons Juan
Towards High-Quality LLM-Based Data for French Spontaneous Speech Simplification: an Exo-Refinement Approach
Lucía Ormaechea, Nikos Tsourakis, Pierrette Bouillon, Benjamin Lecouteux, Didier Schwab
BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM
Xun Gong, Anqi Lv, Wangyou Zhang, Zhiming Wang, Huijia Zhu, Yanmin Qian
SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription
Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use
Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier C. van Dalen
CEREALES : a new dataset of Quebec French accented speech with applications to speech recognition
Lucas Maison, Thomas Soulas, Marie-Jean Meurs
A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion Model
Yang Xiang, Canan Huang, Desheng Hu, Jingguang Tian, Xinhui Hu, Chao Zhang
Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework
Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon
SNR-Aligned Consistent Diffusion for Adaptive Speech Enhancement
Yonghyeon Jun, Beomjun Woo, Myeonghun Jeong, Namsoo Kim
MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech Enhancement
Nan Xu, Zhaolong Huang, Xiaonan Zhi
A Neural Codec Approach for Noise-Robust Bandwidth Expansion
Xi Liu, Mu Yang, Szu-Jui Chen, John H.L. Hansen
HWB-Net: A Novel High-Performance and Efficient Hybrid Waveform Bandwidth Extension Method
Xin Liu, Shulin He, Xueliang Zhang
Frequency-Domain Enhanced Extreme Bandwidth Extension Network with ICCRN for Superior Speech Quality
Hongtao Bao, Xueliang Zhang
QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Spoken Language Understanding on Unseen Tasks With In-Context Learning
Neeraj Agrawal, Sriram Ganapathy
Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning
Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset
Modeling Multi-Turn Spoken Language Understanding with Dynamic Graph Convolutional Networks
Yi Huang, Si Chen, Jingyu Yao, Junlan Feng
DRI-GAN: A Novel Dual Real Input GAN with Triplet Loss for Cross-Lingual and Noisy SLU
Ankit Kumar, Munir Georges
“KAN you hear me?” Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding
Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis
Rasmalai : Resources for Adaptive Speech Modeling in IndiAn Languages with Accents and Intonations
Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M. Khapra
Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages
Utkarsh Pathak, Chandra Sai Krishna Gunda, Anusha Prakash, Keshav Agarwal, Hema A. Murthy
Can We Reconstruct a Dysarthric Voice with the Large Speech Model Parler TTS?
Ariadna Sanchez, Simon King
Voice Adaptation for Swiss German
Samuel Stucki, Jan Deriu, Mark Cieliebak
Gradual modeling of the Lombard effect by modifying speaker embeddings from a Text-To-Speech model
Thiago Henrique Gomes Lobato, Magnus Schäfer
When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds
Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho
EEG-based Voice Conversion : Hearing the Voice of Your Brain
Yizhong Geng, Wenxin Fu, Qihang Lu, Bingsong Bai, Cong Wang, Yingming Gao, Ya Li
Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement
Tuan-Nam Nguyen, Ngoc-Quan Pham, Şeymanur Akti, Alexander Waibel
Zero-Shot Mono-to-Binaural Speech Synthesis
Alon Levkovitch, Julian Salazar, Soroosh Mariooryad, RJ Skerry-Ryan, Nadav Bar, Bastiaan Kleijn, Eliya Nachmani
Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models
Parismita Gogoi, Sishir Kalita, Wendy Lalhminghlui, Viyazonuo Terhiija, Moakala Tzudir, Priyankoo Sarmah, S. R. M. Prasanna
Corpus-Based Insights into Mandarin Neutral Tone: Effects of Tonal Context and Structural Patterns in Spontaneous Speech
Jingyi Sun, Nicolas Audibert, Yaru Wu, Martine Adda-Decker
Tonal Variation and Word Meaning in Taiwanese
Yu-Ying Chuang, Sheng-Fu Wang
Sounding Like a Winner? Prosodic Differences in Post-Match Interviews
Sofoklis Kakouros, Haoyu Chen
Exploratory Study of Filled Pauses in Ukrainian Language: Phonetic Properties of Filled Pauses
Anna Havras, Carlos Mendes, Helena Moniz, Gueorgui Hristovsky, João Miranda
Evaluating the suitability of acoustic parameters for capturing breathy voice in non-pathological female speakers
Chloe Patman, Paul Foulkes, Kirsty McDougall
Robustness of F0 Ratio as a Diagnostic: Comparing Creaky Voice in Danish and Seoul Korean
Michaela Watkins, Rasmus Puggaard-Rode, Paul Boersma, Silke Hamann
Discovering Directions of Uncertainty in Speech Inpainting
Kfir Cohen, Lior Wolf, Bracha Laufer-Goldshtein
InfiniteAudio: Infinite-Length Audio Generation with Consistency
Chaeyoung Jung, Hojoon Ki, Ji-Hoon Kim, Junmo Kim, Joon Son Chung
FoleyMaster: High-Quality Video-to-Audio Synthesis via MLLM-Augmented Prompt Tuning and Joint Semantic-Temporal Adaptation
Liming Liang, Luo Chen, Yuehan Jin, Xianwei Zhuang, Yuxin Xie, Yongkang Yin, Yuexian Zou
Video-to-Audio Generation with Fine-grained Temporal Semantics
Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu
TTMBA: Towards Text To Multiple Sources Binaural Audio Generation
Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu
You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks
Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters
Recognizing Every Voice: Towards Inclusive ASR for Rural Bhojpuri Women
Sakshi Joshi, Eldho Ittan George, Tahir Javed, Kaushal Bhogale, Nikhil Narasimhan, Mitesh M. Khapra
Augment Mandarin to Cantonese Speech Databases via Retrieval-Augmented Generation and Speech Synthesis
Fan Liu, Cheng Gong, Boyu Zhu, Ruihao Jing, Chunyu Qiang, Tianrui Wang, Xiao-Lei Zhang, Xuelong Li
An Exploratory Framework for LLM-assisted Human Annotation of Speech Datasets
Alexander Johnson, Harsh Deshpande, Emmy Phung, Ahmad Emami
Automatic Labeling and Correction of Noisy Labels for Robust Self-Supervised Speaker Verification
Abderrahim Fathan, Jahangir Alam
Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction
Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tünde Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps
AusKidTalk: Using Strategic Data Collection and Out-of-Domain Tools to Semi-Automate Novel Corpora Annotation
Tünde Szalay, Mostafa Shahin, Tharmakulasingam Sirojan, Zheng Nan, Renata Huang, Kirrie Ballard, Beena Ahmed
ASR-based segmentation for the analysis of larger child-speech datasets: Performance evaluation on vowels from Australian-English speaking children aged 4 to 11 years
Rui Cai, Titia Benders
A semi-automatic pipeline for transcribing and segmenting child speech
Polychronia Christodoulidou, James Tanner, Jane Stuart-Smith, Michael McAuliffe, Mridhula Murali, Amy Smith, Lauren Taylor, Joanne Cleland, Anja Kuschmann
Hybrid Data Sampling for ASR: Integrating Acoustic Diversity and Transcription Uncertainty
Komei Hiruta, Yosuke Yamano, Hideaki Tamori
Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification
William Ravenscroft, George Close, Kit Bower-Morris, Jamie Stacey, Dmitry Sityaev, Kris Y. Hong
Adapting Whisper for low-resource Hindi-English Code-Mix speech with on-the-fly Augmentation & LLM-Synthesised Data
Astik Biswas, Oleg Shevelev, Amine Abdaoui, Vivek Tyagi, Abdelmoumene Boumadane
Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies
Carlos Mena, Pol Serra, Jacobo Romero, Abir Messaoudi, Jose Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Ivan Meza, Javier Hernando
From Scarcity to Sufficiency: Speech Recognition Pipeline for Zero-resource Language
Nikolay Karpov, Sofia Kostandian, Nune Tadevosyan, Alexan Ayrapetyan, Andrei Andrusenko, Ara Yeroyan, Mher Yerznkanyan, Vitaly Lavrukhin
MIKU-PAL: An Automated and Standardized Multimodal Method for Speech Paralinguistic and Affect Labeling
Yifan Cheng, Ruoyi Zhang, Jiatong Shi
Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience
Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman
Clinical Annotations for Automatic Stuttering Severity Assessment
Ana Valente, Rufael Marew, Hawau Toyin, Hamdan Al-Ali, Anelise Bohnen, Inma Becerra, Elsa Soares, Gonçalo Leal, Hanan Aldarmaki
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli
AA-SLLM: An Acoustically Augmented Speech Large Language Model for Speech Emotion Recognition
Jialong Mai, Xiaofen Xing, Weidong Chen, Yuanbo Fang, Xiangmin Xu
Speaker-Aware Multi-Task Learning for Speech Emotion Recognition
Xiaohan Shi, Xingfeng Li, Tomoki Toda
Multimodal Emotion Diarization: Frame-Wise Integration of Text and Audio Representations
Ziv Tamir, Thomas Thebaud, Jesus Villalba, Najim Dehak, Oren Kurland
Analysis of Phonetic Level Similarities Across Languages in Emotional Speech
Pravin Mote, Abinay Reddy Naini, Donita Robinson, Elizabeth Richerson, Carlos Busso
Label Semantic-Driven Contrastive Learning for Speech Emotion Recognition
Jiaxi Hu, Leyuan Qu, Haoxun Li, Taihao Li
Pitch Contour Model (PCM) with Transformer Cross-Attention for Speech Emotion Recognition
Minji Ryu, Ji-Hyeon Hur, Sung Heuk Kim, Gahgene Gweon
EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech
Jingyuan Xing, Zhipeng Li, Shuaiqi Chen, Xiaofen Xing, Xiangmin Xu
Voice Impression Control in Zero-Shot TTS
Kenichi Fujita, Shota Horiguchi, Yusuke Ijima
EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis
Haoxun Li, Leyuan Qu, Jiaxi Hu, Taihao Li
DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee
Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Nam-Gyu Kim, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee
Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity Control
Masato Murata, Koichi Miyazaki, Tomoki Koriyama
SA-RAS: Speaker-Aware Style Retrieval Augmented Generation for Expressive Zero-Shot Text-to-Speech Synthesis
Xueru Li, Jingyuan Xing, Xiaofen Xing, Zhipeng Li, Xiangmin Xu
DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice Conversion
Xiaosu Su, BoWen Yang, Xiaowei Yi, Yun Cao
ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism
Hsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao, Chi-Chun Lee
MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt
Zhichao Wu, Yueteng Kang, Songjun Cao, Long Ma, Qiulin Li, Qun Yang
MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition
Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian
Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR
Longhao Li, Yangze Li, Hongfei Xue, Jie Liu, Shuai Fang, Kai Wang, Lei Xie
Parameter-efficient Fine-tuning of Conformer-based Streaming Speech Recognition into Non-streaming Models
Yunjae Nam, Jeong U Han, Kiyeon Kim, Jaemin Lim
On-device Streaming Discrete Speech Units
Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe
Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding
Haoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, Michele M. Franceschini
Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization
Luong Ho, Khanh Le, Vinh Pham, Bao Nguyen, Tan Tran, Duc Chau
Evaluating Progress of CALL System Users on Accentedness and Comprehensibility: An Acoustic and ASR-Based Approach
Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik
Does English fish sound like French fiche? Perceptual similarity judgments versus acoustic similarity
Rory Turnbull, Elisa Kiefer, Sharon Peperkamp
Acoustic Features of Mandarin Tone Production in Noise: A Comparison Between Chinese Native Speakers and Korean L2 Learners
Jinxin Ji, Yiying Hu, Xiaohu Yang, Gang Peng
The Role of Contextual Variation in Learning Cantonese Tones from Naturalistic Speech
Fengyue Lisa Zhao, Jennifer Kuo
Pitch Target Realization in Putonghua Tone Production of Children from Dialect-Speaking Regions
Mengxue Cao, Tianxin Zheng, Jiewen Zheng
The Development of Speech Rhythm in Putonghua-Learning Preschool Children in South Xinjiang Uyghur Autonomous Region of China
Aijun LI, Zhiwei Wang, Jun Gao, Xin Zhou
PARROT: Synergizing Mamba and Attention-based SSL Pre-Trained Models via Parallel Branch Hadamard Optimal Transport for Speech Emotion Recognition
Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Jaya Sai Kiran Patibandla, Arun Balaji Buduru, Rajesh Sharma
Towards Machine Unlearning for Paralinguistic Speech Processing
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Shubham Singh, Swarup Ranjan Behera, Vandana Rajan, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma
Infant Cry Emotion Recognition Using Improved ECAPA-TDNN with Multi-scale Feature Fusion and Attention Enhancement
Junyu Zhou, Yanxiong Li, Haolin Yu
Speech Mutil-label Emotion Recognition Using Asymmetric Class Loss Function Based on Effective Samples
Shanshan Xiang, Hankiz Yilahun, Askar Hamdulla
EmoDB 2.0: A Database of Emotional Speech in a World that is not Black or White but Grey
Felix Burkhardt, Oliver Schrüfer, Uwe Reichel, Hagen Wierstorf, Anna Derington, Florian Eyben, Björn W. Schuller
Cross-corpus open-set Speech Emotion Recognition Method Based on Spatiotemporal Features with Inverse-Entropy Regularization
ZhaoHui Zhou, Hui Luo
CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning
Jiacheng Shi, Yanfu Zhang, Ye Gao
Leveraging Unlabeled Audio-Visual Data in Speech Emotion Recognition using Knowledge Distillation
Varsha Pendyala, Pedro Morgado, William Sethares
Who knows best? Effects of speech disfluencies on incentivized decision-making
Ambika Kirkland, Jens Edlund
Enhancing Transcripts of Open-Source Automatic Speech Recognition Models Through Fine-Tuning with Laughter and Speech-Laugh
Phuoc Hoang Ho, Dragoș Alexandru Bălan, Dirk K. J. Heylen, Khiet P. Truong
Investigating the Reasoning Abilities of Large Language Models for Understanding Spoken Language in Interpersonal Interactions
Pranjal Aggarwal, Ghritachi Mahajani, Pavan Kumar Malasani, Vaibhav Jamadagni, Caroline J. Wendt, Ehsanul Haque Nirjhar, Theodora Chaspari
A Naturally Elicited Multimodal Stress Database and Speech Breathing Based Stress Detection
Karumannil Mohamed Ismail Yasar Arafath, Mohammed Abeer K. C., Aurobinda Routray
From Context to Code-switching: Examining the Interplay of Language Proficiency and Multilingualism in Speech
Debasmita Bhattacharya, Aanya Tolat, Julia Hirschberg
Extending the Fongbe to French Speech Translation Corpus: resources, models and benchmark
D. Fortuné Kponou, Salima Mdhaffar, Fréjus A. A. Laleye, Eugène C. Ezin, Yannick Estève
On the Relationship between Accent Strength and Articulatory Features
Kevin Huang, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan
A Multi-Stream Framework Utilizing 3D Human Reconstruction for Cued Speech Recognition
Katerina Papadimitriou, Gerasimos Potamianos
On the cross-modal makeup of charisma: Insights from a field-data analysis
Oliver Niebuhr
Generalizable Audio Spoofing Detection using Non-Semantic Representations
Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller
Adversarial Attacks on Text-dependent Speaker Verification System
Sreekanth Sankala, Venkatesh Parvathala, Ramesh Gundluru, K. Sri Rama Murty
Beyond Attacks: Advancing Fake Speech Detection with Attack-Agnostic Methods
Shilpa Chandra, Akansha Tyagi, Shiven Patel, Padmanabhan Rajan
ASVspoof2019 vs. ASVspoof5: Assessment and Comparison
Avishai Weizman, Yehuda Ben-Shimol, Itshak Lapidot
Evaluating Parameter Sharing for Spoofing-Aware Speaker Verification: A Case Study on the ASVspoof 5 Dataset
Aykut Büker, Oğuzhan Kurnaz, Şule Bekiryazıcı, Selim Can Demirtaş, Cemal Hanilçi
Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?
Bikash Dutta, Rishabh Ranjan, Shyam Sathvik, Mayank Vatsa, Richa Singh
ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech
Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao
PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts
Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng
StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion
Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu
FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo
Evaluating the Effectiveness of Pre-Trained Audio Embeddings for Classification of Parkinson's Disease Speech Data
Emmy Postma, Cristian Tejedor-Garcia
On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition
Shujie Hu, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu
Lightweight Speech Enhancement for Mandarin Esophageal Speech
Jia-Jyu Su, Yen-Ting Lin, Wu-Hao Li, Chao-Kai Chang, Yan-Zhi Chen, Chen-Yu Chiang
VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
Yubin Kim, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park
A Cookbook for Community-driven Data Collection of Impaired Speech in Low-Resource Languages
Sumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful
Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect
Jaya Narain, Vasudha Kowtha, Colin Lea, Lauren Tooley, Dianna Yee, Vikramjit Mitra, Zifang Huang, Miquel Espi Marques, Jon Huang, Carlos Avendano, Shirley Ren
Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition
Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu
From Pretraining to Performance: Benchmarking Self-Supervised Speech Models for Interspeech-25 SER Challenge
Drishya Uniyal, Vinayak Abrol
Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices
Tiantian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan
Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction
Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan
EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee
Explainable Speech Emotion Recognition Through Attentive Pooling: Insights from Attention-Based Temporal Localization
Tahitoa Leygue, Astrid Sabourin, Christian Bolzmacher, Sylvain Bouchigny, Margarita Anastassova, Quoc-Cuong Pham
ABHINAYA - A System for Speech Emotion Recognition In Naturalistic Conditions Challenge
Soumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy
The Interspeech 2025 Challenge on Speech Emotion Recognition in Naturalistic Conditions
Abinay Reddy Naini, Lucas Goncalves, Ali N. Salman, Pravin Mote, Ismail R. Ulgen, Thomas Thebaud, Laureano Moro Velazquez, Leibny Paola Garcia, Najim Dehak, Berrak Sisman, Carlos Busso
MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
Hyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim, Donghun Min, Eun Yi Kim
Multi-task learning for speech emotion recognition in naturalistic conditions
Bartłomiej Zgórzyński, Juliusz Wójtowicz-Kruk, Piotr Masztalski, Władysław Średniawa
Medusa: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions
Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos
Interactive Fusion of Multi-View Speech Embeddings via Pretrained Large-Scale Speech Models for Speech Emotional Attribute Prediction in Naturalistic Conditions
Yuyun Liu, Yujia Gu, Jiahao Luo, Wenming Zheng, Cheng Lu, Yuan Zong
Advancing Emotion Recognition via Ensemble Learning: Integrating Speech, Context, and Text Representations
Xiaohan Shi, Jinyi Mi, Xingfeng Li, Tomoki Toda
Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking Model
Lucas Ueda, João Lima, Leonardo Marques, Paula Costa
EmoJudge: LLM Based Post-Hoc Refinement for Multimodal Speech Emotion Recognition
Prabhav Singh, Jesus Villalba
Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild
Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee
Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion
Honghong Wang, Jing Deng, Fanqin Meng, Rong Zheng
WhiStress: Enriching Transcriptions with Sentence Stress Detection
Iddo Yosha, Dorin Shteyman, Yossi Adi
Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning
Sarenne Wallbridge, Christoph Minixhofer, Catherine Lai, Peter Bell
Learning Optimal Prosody Embedding Codebook based on F0 and Energy
David Porteš, Aleš Horák
Pitch Accent Detection improves Pretrained Automatic Speech Recognition
David Sasu, Natalie Schluter
Towards Accurate Phonetic Error Detection Through Phoneme Similarity Modeling
Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Tejas Prabhune, Shuhe Li, William Li, Rodrigo Ortiz, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli
Exploring auditory feedback mechanisms in speech recognition
Louise Coppieters de Gibson, Philip N. Garner
French schwa is not acoustically distinct from its two lexical neighbors /ø/ and /œ/
Mathilde Hutin, Mélanie Lancien, Noam Faust
Apical vs. Regular Vowel Duration: A Corpus-based Analysis of Contextual Influences in Standard Mandarin
Jingyi Sun, Bowei Shao, Martine Adda-Decker
On Apical Vowels in Eastern Zhenjiang Mandarin
Xuying Wang, Fang Hu
Equivalence and differences: Formant patterns of labialization and pharyngealization in Tashlhiyt
Philipp Buech, Anne Hermes, Rachid Ridouane
Temporal organization of prenuclear glides in Hefei Mandarin
Yifan Yang, Zhiheng Qian
Speaker-specific Patterns of Phonetic Covariation in Korean Word-medial Stops and the Role of Phonological and Morphological Contexts
Chloe D. Kwon
HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset
Ryan Langman, Xuesong Yang, Paarth Neekhara, Shehzeen Hussain, Edresson Casanova, Evelina Bakhturina, Jason Li
JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko
FaVC: A Validated, Transcribed, Parallel Farsi Speech Dataset for Voice Conversion
Mina Serajian, Saeed Najafzadeh Rahaghi, Hadi Veisi, Saman Haratizadeh
SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching
Vasista Sai Lodagala, Lamya Alkanhal, Daniel Izham, Shivam Mehta, Shammur Absar Chowdhury, Aqeelah Makki, Hamdy S. Hussein, Gustav Eje Henter, Ahmed Ali
The Text-to-speech in the Wild (TITW) Database
Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset
Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li
ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis
Hawau Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki
A Dataset for Automatic Assessment of TTS Quality in Spanish
Alejandro Sosa Welford, Leonardo Pepino
Factors affecting the in-context learning abilities of LLMs for dialogue state tracking
Pradyoth Hegde, Santosh Kesiraju, Ján Švec, Šimon Sedláček, Bolaji Yusuf, Oldřich Plchot, Deepak K T, Jan Černocký
Spoken Question Answering for Visual Queries
Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle
Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
SaD: A Scenario-Aware Discriminator for Speech Enhancement
Xihao Yuan, Siqi Liu, Yan Chen, Hang Zhou, Chang Liu, Hanting Chen, Jie Hu
Listen through the Sound: Generative Speech Restoration Leveraging Acoustic Context Representation
Soo-Whan Chung, Min-Seok Choi
Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders
Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan
Towards Personalised Audio Visual Speech Enhancement
Mandar Gogate, Kia Dashtipour, Amir Hussain
FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching
Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie
Speech Enhancement based on cascaded two flows
Seonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim, Jong Won Shin
X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance
Junbo Zhang, Heinrich Dinkel, Yadong Niu, Chenyu Liu, Si Cheng, Anbei Zhao, Jian Luan
WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing
Oguzhan Baser, Ahmet Ege Tanriverdi, Kaan Kale, Sandeep Chinchali, Sriram Vishwanath
FreeCodec: A Disentangled Neural Speech Codec with Fewer Tokens
Youqiang Zheng, Weiping Tu, Yueteng Kang, Jie Chen, Yike Zhang, Li Xiao, Yuhong Yang, Long Ma
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu
Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments
Reo Yoneyama, Masaya Kawamura, Ryo Terashima, Ryuichi Yamamoto, Tomoki Toda
Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning
Junchuan Zhao, Xintong Wang, Ye Wang
Vocoder-Projected Feature Discriminator
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo
AF-Vocoder: Artifact-Free Neural Vocoder with Global Artifact Filter
Zhuangqi Chen, Xianjun Xia, Xiaohuai Le, Siyu Sun, Chuanzeng Huang
DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec
Peijie Chen, Wenhao Guan, Kaidi Wang, Weijie Wu, Hukai Huang, Qingyang Hong, Lin Li
PeriodCodec: A Pitch-Controllable Neural Audio Codec Using Periodic Signals for Singing Voice Synthesis
Masato Takagi, Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
Enhancing Target-speaker Automatic Speech Recognition Using Multiple Speaker Embedding Extractors with Virtual Speaker Embedding
Ju-Seok Seong, Jeong-Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang
SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition
Yuta Hirano, Sakriani Sakti
Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Pradeep Rangappa, Andrés Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esaú Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition
Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu
Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation
Zhao Yang, Rui Jiang, Yue Heng Yeo, Xiao Fu, Wei Xi, Jizhong Zhao
Regularizing Learnable Feature Extraction for Automatic Speech Recognition
Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney
MMLoRA: Multitask Memory Parameter-Efficient Fine-Tuning for Multimodal SER
Yuanbo Fang, Xiaofen Xing, Xueru Li, Weibin Zhang, Xiangmin Xu
Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes
Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya
SCRIBAL: A Digital Transcription Tool in Higher Education
Javier Román, Pol Pastells, Mauro Vázquez, Clara Puigventós, Montserrat Nofre, Mariona Taulé, Mireia Farrús
From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
Juliana Francis, Joakim Gustafsson, Éva Székely
Concurrent Speech and Auditory Tag Clouds for Non-Visual Web Interaction
Dhia Eddine Merzougui, Nilesh Tete, Fabrice Maurel, Gaël Dias, Mohammed Hasanuzzaman, Aurélien Bournonville, Edgar Madelaine, Thomas Berthelin Le Tellier, François Ledoyen, Laure Poutrain-Lejeune, François Rioult, Jérémie Pantin
Towards Domain-Specific Spoken Language Understanding for a Catalan Voice-Controlled Video Game
Alex Peiró-Lilja, Rodolfo Zevallos, Carme Armentano-Oller, Jose Giraldo, Cristina España-Bonet, Mireia Farrús
Accessible Delivery of Visual-Acoustic Biofeedback for Speech Sound Disorder
Tara McAllister, Peter Traver, Amanda Eads, William Haack, Helen Carey, Yi Shan, Wendy Liang, Tae Hong Park
End-to-End Indian Language Dubbing with Zero-Shot Speaker Preservation
Giri Raju, Sandeep Konam
Band-SCNet: A Causal, Lightweight Model for High-Performance Real-Time Music Source Separation
Junqi Yang, Yuhong Yang, Weiping Tu, Xin Zhao, Cedar Lin
CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-car Speech Separation with Distributed Heterogeneous Arrays
Runduo Han, Yanxin Hu, Yihui Fu, Zihan Zhang, Yukai Jv, Li Chen, Lei Xie
DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization
Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo
Cross-Attention-Based Target Sound Extraction by Fully Leveraging Enrollment in a Shared Latent Space
Xue Yang, Guiru Shen, Yu Yang
DnR-nonverbal: Cinematic Audio Source Separation DatasetContaining Non-Verbal Sounds
Takuya Hasumi, Yusuke Fujita
Neural Speech Extraction with Human Feedback
Malek Itani, Ashton Graves, Sefik Emre Eskimez, Shyamnath Gollakota
Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu
SPCODEC: Split and Prediction for Neural Speech Codec
Liang Wen, Lizhong Wang, Yuxing Zheng, Weijing Shi, Kwang Pyo Choi
Probing the Robustness Properties of Neural Speech Codecs
Wei-Cheng Tseng, David Harwath
LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec
Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu
Bringing Interpretability to Neural Audio Codecs
Samir Sadok, Julien Hauret, Éric Bavu
NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukic, Jason Li, Boris Ginsburg
Beat gestures made by human-like avatars affect speech perception
Matteo Maran, Renske Rötjes, Anna R. E. Schreurs, Hans Rutger Bosker
The mutual exclusivity bias of bilingual visually grounded speech models
Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper
MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers
Kyeongman Park, Seongho Joo, Kyomin Jung
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li
Multimodal Silent Recognition of Phonemes Using Radar and Optopalatographic Silent Speech Interfaces
João Menezes, Aubin Mouras, Arne-Lukas Fietkau, Dani Kazzy, Peter Birkholz
GoP2Vec: A few shot learning for pronunciation assessment with goodness of pronunciation (GoP) based representations from an i-vector framework and augmentation
Meenakshi Sirigiaju, Chiranjeevi Yarra
Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
Multilingual Speech Assessment Using Cross-Attention and Multitask Learning
Sehyun Oh, Minhwa Chung, Sunhee Kim
Assessment of L2 Oral Proficiency using Speech Large Language Models
Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J.F. Gales
Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction
Mengjie Qian, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J.F. Gales
Bidirectional Spoken-Written Text Conversion with Large Language Models
Muyeol Choi, HyunJung Choi, Yohan Lim, Jeonguk Bang, Minkyu Lee, Seonhui Kim, Seung Yun, Donghyun Kim, Minsoo Kim, SangHun Kim
WAKE: Watermarking Audio with Key Enrichment
Yaoxun Xu, Jianwei Yu, Hangting Chen, Zhiyong Wu, Xixin Wu, Dong Yu, Rongzhi Gu, Yi Luo
Defend for Self-Vocoding: A Novel Enhanced Decoder Network for Watermark Recovery
Yu-Sheng Lin, Ching-Yu Yang, Hsing-Hang Chou, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee
Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries
Minyoung Kim, Sehwan Park, Sungmin Cha, Paul Hongsuck Seo
VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents
Haiyun Li, Zhiyong Wu, Xiaofeng Xie, Jingran Xie, Yaoxun Xu, Hanyang Peng
A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?
Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, Yuki Mitsufuji
How to Recover Long Audio Sequences Through Gradient Inversion Attack With Dynamic Segment-based Reconstruction
Xijie Zeng, Frank Rudzicz
First Steps Towards Voice Anonymization for Code-Switching Speech
Sarina Meyer, Ekaterina Kolos, Ngoc Thang Vu
Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems
Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi
Mitigating Language Mismatch in SSL-Based Speaker Anonymization
Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi
MSFNet: A Nested Model for Multi-Sampling-Frequency Speech Enhancement
Venkatesh Parvathala, K. Sri Rama Murty
TF-SkiMNet: Speech Enhancement Based on Inplace Modeling and Skipping Memory in Time-Frequency Domain
Zixuan Li, Shulin He, Jinglin Bai, Xueliang Zhang
xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement
Nikolai Lund Kühne, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan
From KAN to GR-KAN: Advancing Speech Enhancement with KAN-Based Methodology
Haoyang Li, Yuchen Hu, Chen Chen, Sabato Marco Siniscalchi, Songting Liu, Eng Siong Chng
Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement
Jangyeon Kim, Ui-Hyeop Shin, Jaehyun Ko, Hyung-Min Park
Mamba-based Hybrid Model for Speech Enhancement
Se-Ha Kim, Tae-Gyeong Kim, Chang-Jae Chun
Restoring Harmonics: Enhancing Speech Quality with Deep Mask and Harmonic Restoration Network
Yu Zhao, Zengqiang Shang, Mou Wang, Xin Liu, Pengyuan Zhang
GLCLAP: A Novel Contrastive Learning Pre-trained Model for Contextual Biasing in ASR
Yuxiang Kong, Fan Cui, Liyong Guo, Heinrich Dinkel, Lichun Fan, Junbo Zhang, Jian Luan
WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing
Yu Nakagome, Michael Hentschel
Ranking and Selection of Bias Words for Contextual Bias Speech Recognition
Haoxiang Hou, Xun Gong, Wangyou Zhang, Wei Wang, Yanmin Qian
OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary
Yui Sudo, Yusuke Fujita, Atsushi Kojima, Tomoya Mizumoto, Lianbo Liu
Label-Context-Dependent Internal Language Model Estimation for CTC
Zijian Yang, Minh-Nghia Phan, Ralf Schlüter, Hermann Ney
Assessing the Performance and Efficiency of Mamba ASR in Low-Resource Scenarios
Rodolfo Zevallos, Martí Cortada Garcia, Sarah Solito, Carlos Mena, Alex Peiró-Lilja, Javier Hernando
Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning
Hongli Yang, Yizhou Peng, Hao Huang, Sheng Li
Mitigating Non-Target Speaker Bias in Guided Speaker Embedding
Shota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, Naohiro Tawara
Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm
Zhaoyang Li, Jie Wang, XiaoXiao Li, Wangjie Li, Longjie Luo, Lin Li, Qingyang Hong
Pushing the Limits of End-to-End Diarization
Samuel J. Broughton, Lahiru Samarakoon
Spatio-Spectral Diarization of Meetings by Combining TDOA-based Segmentation and Speaker Embedding-based Clustering
Tobias Cord-Landwehr, Tobias Gburrek, Marc Deegen, Reinhold Haeb-Umbach
Selective Channel Attention based Target Speaker Voice Activity Detection for Speaker Diarization under AD-HOC Microphone Array Settings
Hongyu Zhang, Ming Cheng, Jing Feng, Ming Li
Diarization-Guided Multi-Speaker Embeddings
Joonas Kalda, Clément Pagés, Tanel Alumäe, Hervé Bredin
Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering
Ivan Medennikov, Taejin Park, Weiqing Wang, He Huang, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg
A Hybrid Approach to Combining Role Diarization with ASR for Professional Conversations
Bongjun Kim, Arindam Ghosh, Mark C. Fuhs, Anurag Chowdhury, Deblin Bagchi, Monika Woszczyna
An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia
Speech and Text Foundation Models for Depression Detection: Cross-Task and Cross-Language Evaluation
Lucía Gómez-Zaragozá, Javier Marín-Morales, Mariano Alcañiz, Mohammad Soleymani
A Study on The Impact of Foundation Models on Automatic Depression Detection from Speech Signals
Bubai Maji, Monorama Swain, Shazia Nasreen, Debabrata Majumdar, Rajlakshmi Guha, Aurobinda Routray, Anders Søgaard
Identifying Vocal and Facial Biomarkers of Depression in Large-Scale Remote Recordings: A Multimodal Study Using Mixed-Effects Modeling
Nelson Hidalgo Julia, Robert Lewis, Craig Ferguson, Simon Goldberg, Wendy Lau, Caroline Swords, Gabriela Valdivia, Christine Wilson-Mendenhall, Raquel Tartar, Rosalind Picard, Richard Davidson
M3L: A Multi-Modal and Multi-Lingual Depression Detection Framework
Jiajun You, Shuai Wang, Xun Gong, Xiang Wan
Using and comprehending language in face-to-face conversation
Judith Holler
On the Relevance of Clinical Assessment Tasks for the Automatic Detection of Parkinson’s Disease Medication State from Speech
David Gimeno-Gómez, Rubén Solera-Ureña, Anna Pompili, Carlos-D. Martínez-Hinarejos, Rita Cardoso, Isabel Guimarães, Joaquim J. Ferreira, Alberto Abad
Speech power spectra: a window into neural oscillations in Parkinson's disease
Sevada Hovsepyan, Mathew Magimai Doss
Synchronous analysis of abnormal acoustic and linguistic production in Parkinson's speech
Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Sabato Marco Siniscalchi, Adolfo M. Garcia, Yamile Bocanegra, Leonardo Moreno, Elmar Nöth, Juan Rafael Orozco-Arroyave
Automatic Detection and Sub-typing of Primary Progressive Aphasia from Speech: Integrating Task-Specific Features and Spatio-Semantic Graphs
Fritz Peters, W Richard Bevan-Jones, Grace Threlfall, Jenny M Harris, Julie S Snowden, Matthew Jones, Jennifer C Thompson, Daniel J Blackburn, Heidi Christensen
Towards Classification of Typical and Atypical Disfluencies: A Self Supervised Representation Approach
Priyanka Kommagouni, Pragya Khanna, Vamshiraghusimha Narasinga, Anirudh Bocha, Anil Kumar Vuppala
Stuttering Detection Based on Self-Attention Weights of Temporal Acoustic Vector Sequence
Genzo Miyahara, Tsuneo Kato, Akihiro Tamura
Speech-Based Automatic Chronic Kidney Disease Diagnosis via Transformer Fusion of Glottal and Spectrogram Features
Jihyun Mun, Minhwa Chung, Sunhee Kim
Influence of Room Acoustics on Objective Voice Assessment Methods in the Context of Speech and Language Therapy
Sven Franz, Tanja Grewe, Bernd T. Meyer, Jörg Bitzer
Multimodal Speech-Based Biomarkers Outperform the ALS Functional Rating Scale in Predicting Individual Disease Progression in ALS
Hardik Kothare, Michael Neumann, Vikram Ramanarayanan
Naturalness-Aware Curriculum Learning with Dynamic Temperature for Speech Deepfake Detection
Taewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee
Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection
Hoan My Tran, Damien Lolive, David Guennec, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau
A Comparative Study on Proactive and Passive Detection of Deepfake Speech
Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection
Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep Chinchali
From Sharpness to Better Generalization for Speech Deepfake Detection
Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian
Unmasking real-world audio deepfakes: A data-centric approach
David Combei, Adriana Stan, Dan Oneata, Nicolas Müller, Horia Cucu
A Data-Driven Diffusion-based Approach for Audio Deepfake Explanations
Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj
PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing
You Zhang, Baotong Tian, Lin Zhang, Zhiyao Duan
Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection
Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia
The Prosodic Characteristics of Standard Chinese Rhetorical Questions in Naturalistic Settings
Shuwen Chen, Qingke Sun, Yue Huang, Yingyi Luo
ProBiEM: Acoustic and Lexical Correlates of Prosodic Prominence in English-Malayalam Bilingual Speech
Anindita Mondal, Rahul Biju, Anil Kumar Vuppala, Reni K Cherian, Chiranjeevi Yarra
Are You Being Sarcastic? Prosodic Cues to Irony Perception in German
Sophia Fünfgeld, Angelika Braun, Katharina Zahner-Ritter
Can AI Understand Mandarin Speech Prosody? A Framework and Benchmark Showcase
Zilong Wang, Xiaoxue Zhang, Xinyang Jiang, Kaitao Song, Jue Yu
Generating Consistent Prosodic Patterns from Open-Source TTS Systems
Ha Eun Shim, Olivia Yung, Paige Tuttösí, Boey Kwan, Angelica Lim, Yue Wang, H. Henny Yeung
Multimodal Prosody Modeling: A Use Case for Multilingual Sentence Mode Prediction
Bogdan Vlasenko, Mathew Magimai Doss
HASRD: Hierarchical Acoustic and Semantic Representation Disentanglement
Amir Hussein, Sameer Khurana, Gordon Wichern, François G. Germain, Jonathan Le Roux
Performance of Montreal Forced Aligner on Cantonese Spontaneous Speech
Ka Ki SO, Chenzi Xu, Grace Wenling Cao, Peggy Mok
Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information
Nicholas Sanders, Yuanchao Li, Korin Richmond, Simon King
AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation
Yang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das
Multivariate Probabilistic Assessment of Speech Quality
Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K.A. Reddy, Christian Schüldt, Saikat Chatterjee
A Study on Speech Assessment with Visual Cues
Shafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain, Hsin-Min Wang, Yu Tsao
Efficient Streaming Speech Quality Prediction with Spiking Neural Networks
Mattias Nilsson, Riccardo Miccini, Julian Rossbroich, Clément Laroche, Tobias Piechowiak, Friedemann Zenke
Unifying Listener Scoring Scales: Comparison Learning Framework for Speech Quality Assessment and Continuous Speech Emotion Recognition
Cheng Hung Hu, Yusuke Yasuda, Akifumi Yoshimoto, Tomoki Toda
EAA: Emotion-Aware Audio Large Language Models with Dual Cross-Attention and Context-Aware Instruction Tuning
Hongfei Du, Sidi Lu, Gang Zhou, Ye Gao
Chain-of-Thought Distillation with Fine-Grained Acoustic Cues for Speech Emotion Recognition
Jialong Mai, Xiaofen Xing, Yangbiao Li, Xiangmin Xu
Exploring the Limits of Conformer CTC-Encoder for Speech Emotion Recognition using Large Language Models
Edmilson Morais, Hagai Aronowitz, Aharon Satt, Ron Hoory, Avihu Dekel, Brian Kingsbury, George Saon
Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition
Jule Valendo Halim, Siyi Wang, Hong Jia, Ting Dang
Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions
Ankush Raut, Projna Paromita, Sydney Begerowski, Suzanne Bell, Theodora Chaspari
A-SMiLE: Affective Sparse Mixture-of-Experts Adapter with Multi-Task Learning for Spoken Dialogue Models
Yi-Wen Chao, Yizhou Peng, Dianwen Ng, Yukun Ma, Chongjia Ni, Eng Siong Chng, Eng Siong Chng
Non-Intrusive Binaural Speech Intelligibility Prediction Using Mamba for Hearing-Impaired Listeners
Katsuhiko Yamamoto, Koichi Miyazaki
No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction
Haoshuai Zhou, Changgeng Mo, Boxuan Cao, Linkai Li, Shan Xiang Wang
Feature Importance across Domains for Improving Non-Intrusive Speech Intelligibility Prediction in Hearing Aids
Ryandhimas E. Zezario, Sabato M. Siniscalchi, Fei Chen, Hsin-Min Wang, Yu Tsao
Intelligibility Prediction for Time-Modified Speech Signals Using Spectro-Temporal Modulation Features
Aymen Bashir, Haolan Wang, Amin Edraki, Wai-Yip Chan, Jesper Jensen
French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement
Thomas Joubaud, Julien Hauret, Véronique Zimpfer, Éric Bavu
Benchmarking Neural Speech Codec Intelligibility with SITool
Anna Leschanowsky, Kishor Kayyar Lakshminarayana, Anjana Rajasekhar, Lyonel Behringer, Ibrahim Kilinc, Guillaume Fuchs, Emanuël A. P. Habets
AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition
Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen
Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR
Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition
Asahi Sakuma, Hiroaki Sato, Ryuga Sugano, Tadashi Kumano, Yoshihiko Kawai, Tetsuji Ogawa
Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong
Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding Strategies in a Mamba Framework
Joun Yeop Lee, Sangjun Park, Byoung Jin Choi, Ji-Hyun Lee, Min-Kyung Kim, Hoon-Young Cho
APTTS: Adversarial Post-training in Latent Flow Matching for Fast and High-fidelity Text-to-Speech
Hyungchan Yoon, Chanwoo Lee, Hoodong Lee, Stanley Jungkyu Choi
Eigenvoice Synthesis based on Model Editing for Speaker Generation
Masato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda
Score-Based Training for Energy-Based TTS Models
Wanli Sun, Anton Ragni
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Zijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu
BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing
Masaya Kawamura, Takuya Hasumi, Yuma Shirahata, Ryuichi Yamamoto
GTAnet: Geometry-Guided Temporal Attention for EEG-Based Sound Source Tracking in Cocktail Party Scenarios
Saurav Pahuja, Gabriel Ivucic, Siqi Cai, Dashanka Da Silva, Haizhou Li, Tanja Schultz
Decoding Listener's Identity: Person Identification from EEG Signals Using a Lightweight Spiking Transformer
Zheyuan Lin, Siqi Cai, Haizhou Li
Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings
Owais Mujtaba Khanday, Pablo Rodríguez San Esteban, Zubair Ahmad Lone, Marc Ouellet, Jose A. Gonzalez-Lopez
Towards Sentence Level Imagined Speech Generation from EEG signals
Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, Jasmeet Singh
Word-Level Error Analysis in Decoding Systems: From Speech Recognition to Brain-Computer Interfaces
Jingya Huang, Aashish N. Patel, Sowmya Manojna Narasimha, Gal Mishne, Vikash Gilja
NeuroSpex+: Dual-Task Training of Neuro-Guided Speaker Extraction with Speech Envelope and Waveform
Dashanka Da Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li
DiffMV-ETS: Diffusion-based Multi-Voice Electromyography-to-Speech Conversion using Speaker-Independent Speech Training Targets
Kevin Scheck, Tom Dombeck, Zhao Ren, Peter Wu, Michael Wand, Tanja Schultz
Conformer-based Ultrasound-to-Speech Conversion
Ibrahim Ibrahimov, Csaba Zainkó, Gábor Gosztolya
Training Articulatory Inversion Models for Interspeaker Consistency
Charles McGhee, Mark J.F. Gales, Kate M. Knill
Enhancing Acoustic-to-Articulatory Inversion with Multi-Target Pretraining for Low-Resource Settings
Jesuraj Bandekar, Prasanta Kumar Ghosh
Articulatory Vowel Distinctiveness in Spanish
Kristin Teplansky, Emily Rangel, Mimi LaValley, Jinuk Kwon, Beiming Cao, Jun Wang
EEG-based Speech Decoding Based on Multi-mode Joint Modeling
Peiran Li, Fei Chen, Xixin Wu
A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations
Masakazu Inoue, Motoshige Sato, Kenichi Tomeoka, Nathania Nah, Eri Hatakeyama, Kai Arulkumaran, Ilya Horiguchi, Shuntaro Sasai
NAM-to-Speech Conversion with Multitask-Enhanced Autoregressive Models
Neil Shah, Shirish Karande, Vineet Gandhi
RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling
Long-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen
Layer-Wise Decision Fusion for Fake Audio Detection Using XLS-R
Yixuan Xiao, Ngoc Thang Vu
SynHate: Detecting Hate Speech in Synthetic Deepfake Audio
Rishabh Ranjan, Kishan Pipariya, Mayank Vatsa, Richa Singh
Can Emotion Fool Anti-spoofing?
Aurosweta Mahapatra, Ismail R. Ulgen, Abinay Reddy Naini, Carlos Busso, Berrak Sisman
Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models
Tuan Dat Phuong, Long-Vu Hoang, Huy Dat Tran
Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing
Thanapat Trachu, Thanathai Lertpetchpun, Ekapol Chuangsuwanich
Thai Speech Spoofing Detection Dataset with Variations in Speaking Styles
Ticho Urai, Pachara Boonsarngsuk, Ekapol Chuangsuwanich
CBA: Backdoor Attack on Deep Speech Classification via Audio Compression
Yuheng Huang, Ying Ren, Wenjie Zhang, Diqun Yan
LRBA: Stealthy Backdoor Attacks on Speech Classification via Latent Rearrangement in VITS
Zexin Li, Wenhan Yao, Ye Xiao, Jinsu Yang, Fen Xiao, Weiping Wen
LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric Security
Nidheesh Gorthi, Kartik Thakral, Rishabh Ranjan, Richa Singh, Mayank Vatsa
Pitfalls and Limits in Automatic Dementia Assessment
Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer
On the Within-class Variation Issue in Alzheimer's Disease Detection
Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng
Alzheimer’s Disease Detection Using Co-Attention Mechanism for Acoustic and ASR-Transcribed Text Features
Yongqi Shao, Tao Fang
Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer’s Disease Detection
Yin-Long Liu, Rui Feng, Jia-Xin Chen, Yi-Ming Wang, Jia-Hong Yuan, Zhen-Hua Ling
Voice-Based Dysphagia Detection: Leveraging Self-Supervised Speech Representation
Injune Hwang, Jung-Min Kim, Ju Seok Ryu, Kyogu Lee
ADCeleb: A Longitudinal Speech Dataset from Public Figures for Early Detection of Alzheimer’s Disease
Kunxiao Gao, Anna Favaro, Najim Dehak, Laureano Moro Velazquez
Anne Rowling Neurological Speech Corpus: clinically annotated longitudinal dataset for developing speech biomarkers in neurodegenerative disorders
Johnny Tam, Christine Weaver, Oliver Watts, Siddharthan Chandran, Suvankar Pal, Rowling Speech Consortium
Multitask Learning with Fused Attention for Improved ASR and Mispronunciation Detection in Children's Speech Sound Disorders
Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So
Multimodal Speech, Language and Orofacial Analysis for Remote Assessment of Positive, Negative and Cognitive Symptoms in Schizophrenia
Michael Neumann, Hardik Kothare, Beverly Insel, Anzalee Khan, Danyah Nadim, Jean-Pierre Lindenmayer, Vikram Ramanarayanan
Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches
Bornali Phukon, Xiuwen Zheng, Mark Hasegawa-Johnson
SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
Yixuan Hou, Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
An approach to measuring the performance of Automatic Speech Recognition(ASR) models in the context of Large Language Model(LLM) powered applications
Sujith Pulikodan, Sahapthan K, Prasanta Kumar Ghosh, Visruth Sanka, Nihar Desai
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
Heng-Jui Chang, Hongyu Gong, Changhan Wang, James Glass, Yu-An Chung
Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models
Shunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi
Hearing deficits of transformer-based ASR for anechoic and spatial signals
Dirk Hoffner, Simon Weihe, Thomas Brand, Bernd T. Meyer
TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition
Karen Jones, Kevin Walker, Christopher Caruso, Elliot Singer, Trang Nguyen, Robert Dunn, Stephanie Strassel
EmoSpeechAuth: Emotion-Aware Speaker Verification
Magdalena Gołębiowska, Piotr Syga
The 2024 NIST Speaker Recognition Evaluation
Craig Greenberg, Lukas Diduch, Audrey Tong, Elliot Singer, Trang Nguyen, Robert Dunn, Lisa Mason, Beth Matys
A Simple-Yet-Effective Data Augmentation Method for Speaker Identification in Novels
Wenjie Zhong, Jason Naradowsky, Yusuke Miyao
IDIR: Identifying and Distilling Informative Relations for Speaker Verification
Chong-Xin Gan, Zhe Li, Zezhong Jin, Zilong Huang, Man-Wai Mak, Kong Aik Lee
Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Sara Barahona, Anna Silnova, Ladislav Mošner, Junyi Peng, Oldřich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini, Lukáš Burget, Themos Stafylakis, Sandro Cumani, Dominik Boboš, Miroslav Hlavaček, Martin Kodovsky, Tomaš Pavliček
Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models
Nikola Ljubešić, Ivan Porupski, Peter Rupnik
SupraDoRAL: Automatic Word Prominence Detection Using Suprasegmental Dependencies of Representations with Acoustic and Linguistic Context
Jhansi Mallela, Upendra Vishwanath Y. S., Sankara Bharadwaj Rangavajjala, Bhaskar Bhatt, Chiranjeevi Yarra
LombardTokenizer: Disentanglement and Control of Vocal Effort in a Neural Speech Codec
Maxime Jacquelin, Maëva Garnier, Laurent Girin, Rémy Vincent, Olivier Perrotin
Robust Personal Voice Activity Detection for Mitigating Domain Mismatch and False Acceptance Scenarios
Yuke Lin, Jun Chen, Wenjie Li, Longshuai Xiao, Chao Weng
Adaptive Knowledge Distillation for Device-Directed Speech Detection
Hyung-gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz
Flexible VAD-PVAD Transition: A Detachable PVAD Module for Dynamic Encoder RNN VAD
En-Lun Yu, Chien-Chun Wang, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen
Speaker Conditioning of Voice Activity Detection via Implicit Separation
Matthew Maciejewski
ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
DuRep: Dual-Mode Speech Representation Learning via ASR-Aware Distillation
Prabash Reddy Male, Swayambhu Nath Ray, Harish Arsikere, Akshat Jaiswal, Prakhar Swarup, Prantik Sen, Debmalya Chakrabarty, K V Vijay Girish, Nikhil Bhave, Frederick Weber, Sambuddha Bhattacharya, Sri Garimella
| Article |
|---|