ISCA Archive Interspeech 2025 Sessions Search Website Booklet
  ISCA Archive Sessions Search Website Booklet
×

Click on column names to sort.

Searching uses the 'and' of terms e.g. Smith Interspeech matches all papers by Smith in any Interspeech. The order of terms is not significant.

Use double quotes for exact phrasal matches e.g. "acoustic features".

Case is ignored.

Diacritics are optional e.g. lefevre also matches lefèvre (but not vice versa).

It can be useful to turn off spell-checking for the search box in your browser preferences.

If you prefer to scroll rather than page, increase the number in the show entries dropdown.

top

Interspeech 2025

Rotterdam, The Netherlands
17-21 August 2025

Chairs: Odette Scharenborg, Catharine Oertel, Khiet Truong
doi: 10.21437/Interspeech.2025
ISSN: 2958-1796

Keynote1 - Roger Moore: From Talking and Listening Devices to Intelligent Communicative Machines


From Talking and Listening Devices to Intelligent Communicative Machines
Roger Moore


Spoken Machine Translation 1


Speech transcription from South Tyrolean Dialect to Standard German with Whisper
Luca Ducceschi, Greta H. Franzini

Length Aware Speech Translation for Video Dubbing
Aswin Shanmugam Subramanian, Harveen Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li

ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space
Vishal Kumar, Vinayak Abrol

CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation
Jiale Ou, Hongying Zan

End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model
Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data
Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe

End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju, Anil Kumar Vuppala

Self-Improvement for Audio Large Language Model using Unlabeled Speech
Shaowen Wang, Xinyuan Chen, Yao Xu






Interpretability in Audio and Speech Technology


EnvSDD: Benchmarking Environmental Sound Deepfake Detection
Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution
Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli

Benchmarking Time-localized Explanations for Audio Classification Models
Cecilia Bolaños, Leonardo Pepino, Martin Meza, Luciana Ferrer

Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds
Andrew Chang, Yike Li, Iran R. Roman, David Poeppel

Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: an Analytical Study Towards Accent-robust ASR Only with Native Speech Data
Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu

Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains
Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, Shota Horiguchi

Is your model big enough? Training and interpreting large-scale monolingual speech foundation models
Yaroslav Getman, Tamás Grósz, Tommi Lehtonen, Mikko Kurimo

Semantic-Aware Interpretable Multimodal Music Auto-Tagging
Andreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models
Asim Ersoy, Basel Ahmad Mousi, Shammur Absar Chowdhury, Firoj Alam, Fahim I Dalvi, Nadir Durrani

Effective Context in Neural Speech Models
Yen Meng, Sharon Goldwater, Hao Tang

Word stress in self-supervised speech models: A cross-linguistic comparison
Martijn Bentum, Louis ten Bosch, Tomas O. Lentz

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum

Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0
Robin Huo, Ewan Dunbar

On the reliability of feature attribution methods for speech classification
Gaofei Shen, Hosein Mohebbi, Arianna Bisazza, Afra Alishahi, Grzegorz Chrupala

An Exploration of Interpretable Deep Learning Models for the Assessment of Mild Cognitive Impairment
Emma Cathrine Liisborg Leschly, Oliver Roesler, Michael Neumann, Jackson Liscombe, Abhishek Hosamath, Lakshmi Arbatti, Line H. Clemmensen, Melanie Ganz, Vikram Ramanarayanan









Depression Detection and Assessment 1


Speech Reference Intervals: An Assessment of Feasibility in Depression Symptom Severity Prediction
Lauren White, Ewan Carr, Judith Dineley, Catarina Botelho, Pauline Conde, Faith Matcham, Carolin Oetzmann, Amos Folarin, George Fairs, Agnes Norbury, Stefano Goria, Srinivasan Vairavan, Til Wykes, Richard Dobson, Vaibhav Naraya, Matthew Hotopf, Alberto Abad, Isabel Trancoso, Nicholas Cummins

DepressGEN: Synthetic Data Generation Framework for Depression Detection
Wenrui Liang, Rong Zhang, Xuezhen Zhang, Ying Ma, Wei-Qiang Zhang

Emotion-Guided Graph Attention Networks for Speech-Based Depression Detection under Emotion-Inducting Tasks
Yuqiu Zhou, Yongjie Zhou, Yudong Yang, Yang Liu, Jun Huang, Shuzhi Zhao, Rongfeng Su, Lan Wang, Nan Yan

Explainable Depression Detection using Masked Hard Instance Mining
Patawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich

Test-Time Training for Speech-based Depression Detection
Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore

Leveraging Ordinal Information for Speech-based Depression Classification
Lishi Zuo, Man-Wai Mak

Zero-Shot Speech-Based Depression and Anxiety Assessment with LLMs
Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz

Towards the Objective Characterisation of Major Depressive Disorder Using Speech Data from a 12-week Observational Study with Daily Measurements
Robert Lewis, Szymon Fedor, Nelson Hidalgo Julia, Joshua Curtiss, Jiyeon Kim, Noah Jones, David Mischoulon, Thomas F Quatieri, Nicholas Cummins, Paola Pedrelli, Rosalind Picard

Can Speech Accurately Detect Depression in Patients With Comorbid Dementia? An Approach for Mitigating Confounding Effects of Depression and Dementia
Sophie Young, Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Markus Reuber, Heidi Christensen




























Voice Conversion 1


Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
Şeymanur Akti, Tuan-Nam Nguyen, Alexander Waibel

Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora
Hitoshi Suda, Shinnosuke Takamichi, Satoru Fukayama

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv R. Shah

Training-Free Voice Conversion with Factorized Optimal Transport
Alexander Lobashev, Assel Yermekova, Maria Larchenko

E2E-BPVC: End-to-End Background-Preserving Voice Conversion via In-Context Learning
Yihan Liu, Zhengyang Chen, Leying Zhang, Yanmin Qian

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization
Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li

In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion
Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu

LinearVC: Linear Transformations of Self-Supervised Features Through the Lens of Voice Conversion
Herman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau

Speaker Normalization and Content Restoration for Zero-Shot Voice Conversion with Attention-Enhanced Discriminator
Desheng Hu, Yang Xiang, Jian Lu, Xinhui Hu, Xinkang Xu





Source Tracing: The Origins of Synthetic or Manipulated Speech


Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification
Pierre Falez, Tony Marteau, Damien Lolive, Arnaud Delhay

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion
Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alumäe, Mathew Magimai Doss

Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy
Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes
Adriana Stan, David Combei, Dan Oneata, Horia Cucu

Source Verification for Speech Deepfakes
Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro

STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution
Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka

Synthetic Speech Source Tracing using Metric Learning
Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Panagakis, Themos Stafylakis

Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing
Yang Xiao, Rohan Kumar Das

VIB-based Real Pre-emphasis Audio Deepfake Source Tracing
Thien-Phuc Doan, Kihun Hong, Souhwan Jung

Defending Unauthorized Voice Cloning with Watermark-Aware Codecs
Jiankun Zhao, Lingwei Meng, Chengxi Deng, Helen Meng, Xixin Wu

Open-Set Source Tracing of Audio Deepfake Systems
Nicholas Klein, Hemlata Tak, Elie Khoury




Characterization and Multimodal Approaches for Speaker Recognition


Parameter-Efficient Fine-tuning with Instance-Aware Prompt and Parallel Adapters for Speaker Verification
Shengyu Peng, Wu Guo, Jie Zhang, Yu Guan, Lipeng Dai, Zuoliang Li

Unified Text and Speaker Verification using SSL model for Text-Dependent Speaker Verification
Nathan Griot, Driss Matrouf, Raphael Blouet, Jean-François Bonastre, Ana Mantecon

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM
Zhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie

Towards Secure User Authentication for Headphones via In-Ear or In-Earcup Microphones
N Shashaank, Xiao Quan, Andrew Kaluzny, Leonard Varghese, Marko Stamenovic, Chuan-Che Huang

Mimic Blocker: Self-Supervised Adversarial Training for Voice Conversion Defense with Pretrained Feature Extractors
Gwangyeol Yu, Junhyeok Lee, Seoryeong Kim, Jimin Lee, Jehyuk Lee

A Siamese Network-Based Framework for Voice Mimicry Proficiency Assessment Using X-Vector Embeddings
Bhasi K.C., Rajeev Rajan

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages
Rishabh Ranjan, Likhith Ayinala, Mayank Vatsa, Richa Singh

Joint Target-Speaker ASR and Activity Detection
Chikara Maeda, Muhammad Shakeel, Yui Sudo

DLF-EEND: Dynamic Layer Fusion for End-to-End Speaker Diarization
Wooil Kim, Bongsu Jung


Acoustic Analysis and Bioacoustics


Analysis of Avian Biphonic Vocalization Using Computational Modelling
Noumida A, Rajeev Rajan

Dog2vec: Self-Supervised Pre-Training for Canine Vocal Representation
Xingyuan Li, Kenny Zhu, Mengyue Wu

Improving Bird Classification with Primary Color Additives
Ezhini Rasendiran R, Chandresh Kumar Maurya

Exploring the Power of Empirical Mode Decomposition for Sensing the Sound of Silence: A Pilot Study on Mice Autism Detection via Ultrasonic Vocalisation
Chenhao Wu, Xiangjun Cai, Haojie Zhang, Tianrui Jia, Yilu Deng, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto, Jiang Liu

Exploring Pre-trained models on Ultrasound Modeling for Mice Autism Detection with Uniform Filter Bank and Attentive Scoring
Yuchen Song, Yucong Zhang, Ming Li

MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge
Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto

Significance of Time-Frequency preprocessing for automatic Ultrasonic Vocalization classification in Autism Spectrum Disorder model detection
Szymon Szmajdziński, Juliusz Wójtowicz-Kruk, Ivan Ryzhankow, Łukasz Łazarski, Jakub Żak, Władysław Średniawa

Robust Vocal Intensity Prediction: Overcoming Dataset Bias with Pretrained Deep Models
Quentin Le Tellier, Marc Evrard, Albert Rilliard, Jean-Sylvain Liénard

SLASH: Self-Supervised Speech Pitch Estimation Leveraging DSP-derived Absolute Pitch
Ryo Terashima, Yuma Shirahata, Masaya Kawamura


Keynote2 - Alexander Waibel: From Speech Science to Language Transparence


From Speech Science to Language Transparence
Alexander Waibel


Spoken Dialogue Systems 1


PruneSLU: Efficient On-device Spoken Language Understanding through Vocabulary and Structural Pruning
Truong Do, Minh-Phuong Nguyen, Le -Minh Nguyen

Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State Tracking
Haris Gulzar, Monikka Roslianna Busto, Akiko Masaki, Takeharu Eda, Ryo Masumura

Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs
Šimon Sedláček, Bolaji Yusuf, Ján Švec, Pradyoth Hegde, Santosh Kesiraju, Oldřich Plchot, Jan Černocký

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems
Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel García Contreras, Koichiro Yoshino

SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation
Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

Who, When, and What: Leveraging the ``Three Ws'' Concept for Emotion Recognition in Conversation
Xiaohan Shi, Xingfeng Li, Tomoki Toda

``Alexa, can you forget me?'' Machine Unlearning Benchmark in Spoken Language Understanding
Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis

Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
Ebru Arisoy, Merve Unlu Menevse, Yusufcan Manav, Arzucan Ozgur

I want a horror – comedy – movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance
Maria Teleki, Lingfeng Shi, Chengkai Liu, James Caverlee

Towards a Japanese Full-duplex Spoken Dialogue System
Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka




Speech and Voice Disorders 1


Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection
Shangkun Huang, Jing Deng, Jintao Kang, Rong Zheng

Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis
Zongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection
Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Gorno-Tempini, Gopala Anumanchipalli

Fine-tuning Strategies for Automatic Speech Recognition of Low-Resource Speech with Autism Spectrum Disorder
Yeseul Park, Bowon Lee

Identification of Pathological Pronunciation Profiles in ASR Transcription Errors
Margot Masson, Isabelle Ferrané, Julie Mauclair

A simple method for predicting Clinical Scores in Huntington’s Disease by leveraging ASR's uncertainty on spontaneous speech
Hadrien Titeux, Quang Tuan Rémy Nguyen, Andres Gil-Salcedo, Anne-Catherine Bachoud-Levi, Emmanuel Dupoux

Introducing EMOPARKNZ: the Emotional Speech Database from New Zealand English Speakers with Parkinson’s Disease
Itay Ben-Dom, Catherine I. Watson, Clare M. McCann

Revisiting WFST-based Hybrid Japanese Speech Recognition System for Individuals with Organic Speech Disorders
Naoki Hojo, Ryoichi Takashima, Chihiro Sugiyama, Nobukazu Tanaka, Kanji Nohara, Kazunori Nozaki, Tetsuya Takiguchi






Speech and Language Technology for Health Applications


A Chinese Heart Failure Status Speech Database with Universal and Personalised Classification
Yue Pan, Liwei Liu, Changxin Li, Xingyao Wang, Yili Xia, Hanyue Zhang, Ming Chu

Heart Rate as a Proxy Measure to Assess Human Confidence in Spoken Speech
Harish Battula, Gauri Deshpande, Yagna Gudipalli, Sachin Patel

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation
Jingping Nie, Tien Dung Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendano, Erdrin Azemi, Vikramjit Mitra

Towards Fusion of Neural Audio Codec-based Representations with Spectral for Heart Murmur Classification via Bandit-based Cross-Attention Mechanism
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma

Perception of Emotional Speech by Individuals with High Borderline Personality Features
Yizhou Chen, Xiyu Wu

Visual features of the oral region in Polish sibilants produced by children with various sibilance patterns
Agata Sage, Zuzanna Miodońska, Michał Kręcichwost, Ewa Kwaśniok, Paweł Badura

Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models
Roseline Polle, Agnes Norbury, Alexandra Livia Georgescu, Nicholas Cummins, Stefano Goria

Decoding Alzheimer’s: Interpretable Visual and Logical Attention in Picture Description Tasks
Ning Wang, Bingyang Wen, Minghui Wu, Yang Sun, Zongru Shao, Haojie Zhou, K.P. Subbalakshmi


Responsible Speech Foundation Models + SUPERB Challenge


Defending Speech-enabled LLMs Against Adversarial Jailbreak Threats
Antonios Alexos, Raghuveer Peri, Sai Muralidhar Jayanthi, Metehan Cekic, Srikanth Vishnubhotla, Kyu J. Han, Srikanth Ronanki

Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach
Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Dariia Puhach, Amir H. Payberah, Éva Székely

Evaluating Speech Foundation Models for Automatic Speech Recognition in the Low-Resource Kanyen’kéha Language
Mengzhe Geng, Patrick Littell, Aidan Pine, Robbie Jimerson, Gilles Boulianne, Vishwa Gupta, Rolando Coto-Solano, Anna Kazantseva, Marc Tessier, Delaney Lothian, Akwiratékha' Martin, Eric Joanis, Samuel Larkin, Roland Kuhn

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy

Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Chun-Yi Kuan, Hung-yi Lee

Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC
Qingzheng Wang, Jiancheng Sun, Yifan Peng, Shinji Watanabe

The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties
William Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe

TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge
Tanel Alumäe, Artem Fedorchenko




Databases and Progress in Methodology


Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi
Arnav Rustagi, Satvik Bajpai, Nimrat Kaur, Siddharth Siddharth

Evaluating Wav2Vec2-Bert for Computer-Assisted Pronunciation Training for isiZulu
Alexandra Fort, Francis Tyers

Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
Lubos Marcinek, Jonas Beskow, Joakim Gustafsson

Harnessing Text-to-Speech Voice Cloning Models for Improved Audiological Speech Assessment
Lidea Shahidi, Erdem Baha Topbas, Thu Ngan Dang, Tobias Goehring

75-Speaker Annot-16: A benchmark dataset for speech articulatory rt-MRI annotation with articulator contours and phonetic alignment
Xuan Shi, Yubin Zhang, Yijing Lu, Marcus Ma, Tiantian Feng, Asterios Toutios, Haley Hsu, Louis Goldstein, Shrikanth Narayanan

Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel Yamins

Reasoning-Based Approach with Chain-of-Thought for Alzheimer’s Detection Using Speech and Large Language Models
Chanwoo Park, Anna Seo Gyeong Choi, Sunghye Cho, Chanwoo Kim

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings
Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, Julia Schwarz

Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora
Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu









Language Learning and Assessment


Automatic Dialectal Transcription: An Evaluation on Finnish and Norwegian
Olli Kuparinen

Can ASR generate valid measures of child reading fluency?
Wieke Harmsen, Roeland van Hout, Catia Cucchiarini, Helmer Strik

SGED-Probe: Probing E2E ASR decoder and aligner for spoken grammar error detection under three speaking practice conditions
Chowdam Venkata Thirumala Kumar, Chiranjeevi Yarra

Evaluating Logit-Based GOP Scores for Mispronunciation Detection
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Towards a Unified Benchmark for Arabic Pronunciation Assessment: Qur’anic Recitation as Case Study
Yassine El Kheir, Omnia Ibrahim, Amit Meghanani, Nada Almarwani, Hawau Toyin, Sadeen Alharbi, Modar Alfadly, Lamya Alkanhal, Ibrahim Selim, Shehab Elbatal, Salima Mdhaffar, Thomas Hain, Yasser Hifny, Mostafa Shahin, Ahmed Ali

OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global Learners
Wen-Wei Hsieh, Hao-Wei Chi, Kuan-Chen Wang, Ping-Cheng Yeh, Te-hsin Liu, Chen-Yu Chiang

A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater’s Shadowing and Sequence-to-sequence Voice Conversion
Haopeng Geng, Daisuke Saito, Nobuaki Minematsu

Multimodal and Multitask Learning for Predicting Multiple Scores in L2 English Speech
Sehyun Oh, Sunhee Kim, Minhwa Chung

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu

Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish
Nhan Phan, Mikko Kuronen, Maria Kautonen, Riikka Ullakonoja, Anna von Zansen, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo







Keyword Spotting and Retrieval


Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval
Anup Singh, Kris Demuynck, Vipul Arora

H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing
Akanksha Singh, Yi-Ping Phoebe Chen, Vipul Arora

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval
Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jin

Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting
Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale Tokenizer
Minghui Fang, Shengpeng Ji, Jialong Zuo, Xize Cheng, Wenrui Liu, Xiaoda Yang, Ruofan Hu, Jieming Zhu, Zhou Zhao

Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
Changin Choi, Sungjun Lim, Wonjong Rhee

On Retrieval of Long Audios with Complex Text Queries
Ruochu Yang, Milind Rao, Harshavardhan Sundar, Anirudh Raju, Aparna Khare, Srinath Tankasala, Di He, Venkatesh Ravichandran

SIDC-KWS: Efficient Spiking Inception-Dilated Conformer with Self-Attention for Keyword Spotting
Jin Gyo Lim, Seong Eun Kim

Multichannel Keyword Spotting for Noisy Conditions
Dzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting
Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples
Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs
Firoj Alam, Md Arid Hasan, Shammur Absar Chowdhury


Multimodal Systems


CAMER: Contribution-Aware Multimodal Emotion Recognition
Sun-Kyung Lee, Jong-Hwan Kim

GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints
Jiajun He, Jinyi Mi, Tomoki Toda

SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer
Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association
Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

U-SAM: An Audio Language Model for Unified Speech, Audio, and Music Understanding
Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie

Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data
Yun Tang, Eesung Kim, Vijendra Raj Apsingekar

The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models
Yi Wang, Oli Danyi Liu, Peter Bell




Connecting Speech Science and Speech Technology for Children’s Speech


Band-Split Self-supervised Mamba for Infant-centered Audio Analysis
Xulin Fan, Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

Subtyping Speech Errors in Childhood Speech Sound Disorders with Acoustic-to-Articulatory Speech Inversion
Nina R Benway, Saba Tabatabaee, Benjamin Munson, Jonathan Preston, Carol Espy-Wilson

PERCEPT-US: A Multimodal American English Child Speech Corpus Specialized for Articulatory Feedback
Amanda Eads, Heather Kabakoff, Nina Benway, Elaine Hitchcock, Jonathan Preston, Tara McAllister

Children's Voice Privacy: First Steps and Emerging Challenges
Ajinkya Kulkarni, Francisco Teixeira, Enno Hermann, Thomas Rolland, Isabel Trancoso, Mathew Magimai Doss

FT-Boosted SV: Towards Noise Robust Speaker Verification for English Speaking Classroom Environments
Saba Tabatabaee, Jing Liu, Carol Espy-Wilson

Examining Test-Time Adaptation for Personalized Child Speech Recognition
Zhonghao Shi, Xuan Shi, Anfeng Xu, Tiantian Feng, Harshvardhan Srivastava, Shrikanth Narayanan, Maja Mataric

Employing self-supervised learning models for cross-linguistic child speech maturity classification
Theo Zhang, Madurya Suresh, Anne Warluamont, Kasia Hitczenko, Alejandrina Cristia, Margaret Cychosz

On Enhancing the Performance of Children's ASR Task in Limited Data Scenario
Ankita Ankita, Shambhavi Shambhavi, Syed Shahnawazuddin

Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling
Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan

Large Language Models based ASR Error Correction for Child Conversations
Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan

Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
Tarek Kunze, Marianne Métais, Hadrien Titeux, Lucas Elbert, Joseph Coffey, Emmanuel Dupoux, Alejandrina Cristia, Marvin Lavechin

Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts
Lingyun Gao, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Improving Automatic Speech Recognition for Children's Reading Assessment with Disfluency-aware Language Models
Jazmín Vidal, Luciana Ferrer, Juan Esteban Kamienkowski, Pablo Riera

Oral Reading Errors by Grade 3 Children in Indian Schools: A Hindi-English Perspective
Sneha Raman, Preeti Rao

Grammatical Error Detection on Spontaneous Children's Speech Using Iterative Pseudo Labeling
Christopher Gebauer, Lars Rumberg, Lars Köhn, Hanna Ehlert, Edith Beaulac, Jörn Ostermann

Why is children's ASR so difficult? Analyzing children's phonological error patterns using SSL-based phoneme recognizers
Koharu Horii, Naohiro Tawara, Atsunori Ogawa, Shoko Araki

Automatic detection of speech sound disorders in German-speaking children: augmenting the data with typically developed speech
Darline Monika Marx, Marco Matassoni, Alessio Brutti

Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence
Edem Ahadzi, Vishwanath Pratap Singh, Tomi Kinnunen, Ville Hautamaki

Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech Recognition
Thomas Rolland, Alberto Abad

Advancing Pediatric ASR: The Role of Voice Generation in Disordered Speech
Karen Rosero, Ali N Salman, Shreeram Chandra, Berrak Sisman, Cortney Van’t Slot, Alex Kane, Rami R Hallac, Carlos Busso

CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR
Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan

Causal Structure Discovery for Error Diagnostics of Children's ASR
Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen








Music and Audio Analysis


Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss
Jiawen Huang, Felipe Sousa, Emir Demirel, Emmanouil Benetos, Igor Gadelha

Tonality-Based Accompaniment-Guided Automatic Singing Evaluation
Pei-Chin Hsieh, Yih-Liang Shen, Ngoc-Son Tran, Tai-Shih Chi

Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

Focal Modulation Network: A Novel Solution for Polyphonic Music Instrument Recognition without Attention and Aggregation Strategy
Lekshmi C R, Rajeev Rajan

A Joint Network for Singing Melody Extraction from Polyphonic Music with Attention Aggregation and Self-Consistency Training
Jiabo Jing, Ying Hu, Hao Huang, Liang He, Zhijian Ou

Position also matters! Separating Same Instruments in String Quartet using Timbral and Positional Cues
Yuetonghui Xu, Yiwen Wang, Xihong Wu, Xiaobing Li, Feng Yu

WhisperMSS: A Two-Stage Framework for Mandarin Singing Transcription and Segmentation Using Pretrained Models
Ruoxuan Liang, Xiangjian Zeng, Zhen Liu, Qingqiang Wu, RuiChen Zhang, Le Ren

Low Complex IIR Adaptive Hear-Through Ambient Filtering for Overcoming Practical Constraints in Earbuds
Rishabh Gupta, MLNS Karthik, Yughendaran P

Sub-band based Adaptive IIR Algorithm with Biquad Filter Stability Constraints for Feedforward Hear-Through Equalization
Rishabh Gupta, MLNS Karthik, Chelamkuri Omsrinath






Speech Accessibility Project Challenge


The Interspeech 2025 Speech Accessibility Project Challenge
Xiuwen Zheng, Bornali Phukon, Jonghwan Na, Ed Cutrell, Kyu J. Han, Mark Hasegawa-Johnson, Pan-Pan Jiang, Aadhrik Kuila, Colin Lea, Bob MacDonald, Gautam Mantena, Venkatesh Ravichandran, Leda Sari, Katrin Tomanek, Chang D. Yoo, Chris Zwilling

Towards Inclusive and Fair ASR: Insights from the SAPC Challenge for Optimizing Disordered Speech Recognition
Nada Gohider, Otman Basir

Robust fine-tuning of speech recognition models via model merging: application to disordered speech
Alexandre Ducorroy, Rachid Riad

Exploring Generative Error Correction for Dysarthric Speech Recognition
Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi

Pathology-Aware Speech Encoding and Data Augmentation for Dysarthric Speech Recognition
Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition
Dominik Wagner, Ilja Baumann, Natalie Engert, Seanie Lee, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet

A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition
Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin

Fine-tuning Parakeet-TDT for Dysarthric Speech Recognition in the Speech Accessibility Project Challenge
Kaito Takahashi, Keigo Hojo, Toshimitsu Sakai, Yukoh Wakabayashi, Norihide Kitaoka

CBA-Whisper: Curriculum Learning-Based AdaLoRA Fine-Tuning on Whisper for Low-Resource Dysarthric Speech Recognition
Tianyi Tan, Xinan Chen, Xiaohuai Le, Wenzhi Fan, Xianjun Xia, Chuanzeng Huang, Jing Lu







Keynote3 - Carol Espy-Wilson: Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications


Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications
Carol Espy-Wilson






Neural Network Training Methods 2


SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment
SooHwan Eom, Mark Hasegawa-Johnson, Chang D. Yoo

Improving Generalization of End-to-End ASR through Diversity and Independence Regularization
Ye-Eun Ko, Mun-Hak Lee, Dong-Hyun Kim, Joon-Hyuk Chang

Exploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASR
Carlos Carvalho, Jinchuan Tian, William Chen, Yifan Peng, Alberto Abad, Shinji Watanabe

Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces
Takafumi Moriya, Masato Mimura, Kiyoaki Matsui, Hiroshi Sato, Kohei Matsuura

Thinking Fast and Slow: Robust Speech Recognition via Deep Filter-Tuning
Dianwen Ng, Kun Zhou, Bin Ma, Eng Siong Chng

Towards Efficiently Whisper Fine-tuning with Monotonic Alignments
Ziyang Zhuang, Tao Wei, Ming Fang, Ning Cheng, Shaojun Wang, Jing Xiao

Dynamic Acoustic Model Architecture Optimization in Training for ASR
Jingjing Xu, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlüter, Hermann Ney

Knowledge Distillation Method for Pruned RNN-T Models via Pruning Bounds Sharing and Losses Confusion
Xiaocan Zhang, Weiwei Jiang, Guibin Zheng, Chenhao Jing, Jiqing Han, Tieran Zheng

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models
Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman

Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Andrés Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esaú Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke









Neural Network Training Methods and Architectures


Distilling a speech and music encoder with task arithmetic
Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H.M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee

MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR
Dimitrios Damianos, Georgios Paraskevopoulos, Alexandros Potamianos

REB-former: RWKV-enhanced E-branchformer for Speech Recognition
Jie Song, Wang Xiang, Jian Zhou, Cunhang Fan, Zhao Lv

PredTrAD – Prediction-based Transformer for Anomaly Detection in Multivariate Time Series Data
Jan Schuster, Alexander Wölfel, Fabian Brunner, Christian Bergler

FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition
Jongsuk Kim, Jaemyung Yu, Minchan Kwon, Junmo Kim

Automatic Speech Recognition of African American English: Lexical and Contextual Effects
Hamid Mojarad, Kevin Tang

Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng

SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker Diarization
Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
Jiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun, Florian Metze


Challenges in Speech Data Collection, Curation and Annotation - Part 1


A Study of Real-world Audio-Visual Corpus Design and Production: A Perspective from MISP Challenges
Hang Chen, Jun Du, Qing Wang, Juan Xie, Shi-Fu XIong

VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset
Yuxi Wang, Yikang Wang, Qishan Zhang, Hiromitsu Nishizaki, Ming Li

J-SPAW: Japanese speaker verification and spoofing attacks recorded in-the-wild dataset
Sayaka Shiota, Suzuka Horie, Kouta Kanno, Shinnosuke Takamichi

CommissionsQC: a Québec French Speech Corpus for Automatic Speech Recognition
Coralie Serrand, Amira Morsli, Gilles Boulianne

Granary: Speech Recognition and Translation Dataset in 25 European Languages
Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov#, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg

Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges
Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik

Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis
Miao Zhang, Aref Farhadipour, Annie Baker, Jiachen Ma, Bogdan Pricop, Eleanor Chodroff

The Speech Accessibility Project: Best Practices for Collection and Curation of Disordered Speech
Chris Zwilling, Mark Hasegawa-Johnson, Heather Hodges, Lorraine Ramig, Adina Bradshaw, Clarion Mendes, Heejin Kim, Alexandria Barkhimer, Laura Mattie, Meg Dickinson, Shawnise Carter, Marie Moore Channell

Challenges and practical guidelines for atypical speech data collection, annotation, usage and sharing: A multi-project perspective
Zhengjun Yue, Mara Barberis, Tanvina Patel, Judith Dineley, Willemijn Doedens, Lottie Stipdonk, YuanYuan Zhang, Elke de Witte, Erfan Loweimi, Hugo Van hamme, Djaina Satoer, Marina Ruiter, Laureano Moro Velazquez, Nicholas Cummins, Odette Scharenborg

Fifteen Years of Child-Centered Long-Form Recordings: Promises, Resources, and Remaining Challenges to Validity
Loann Peurey, Marvin Lavechin, Tarek Kunze, Manel Khentout, Lucas Gautheron, Emmanuel Dupoux, Alejandrina Cristia

Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation
Qiongqiong Wang, Hardik B. Sailor, Tianchi Liu, Ai Ti Aw

Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech Corpus
Kalle Lahtinen, Einari Vaaras, Liisa Mustanoja, Okko Räsänen

Scalable Spontaneous Speech Dataset (SSSD): Crowdsourcing Data Collection to Promote Dialogue Research
Zaid Sheikh, Shuichiro Shimizu, Siddhant Arora, Jiatong Shi, Samuele Cornell, Xinjian Li, Shinji Watanabe

A Multimodal Chinese Dataset for Cross-lingual Sarcasm Detection
Xiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection
Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler



Language Resources


ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality
Yu-Xiang Luo, Yi-Cheng Lin, Ming-To Chuang, Jia-Hung Chen, I-Ning Tsai, Pei Xing Kiew, Yueh-Hsuan Huang, Chien-Feng Liu, Yu-Chen Chen, Bo-Han Feng, Wenze Ren, Hung-yi Lee

ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech Utterances
Huy Ba Do, Vy Le-Phuong Huynh, Luan Thanh Nguyen

Self-Supervised Models of Speech Processing for Haitian Creole
William N. Havard, Renauld Govain, Benjamin Lecouteux, Emmanuel Schang

AfriHuBERT: A self-supervised speech representation model for African languages
Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi

The Faetar Speech Recognition Benchmark
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar

LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking
Jaume Santamaría-Jordà, Pablo Segovia-Martínez, Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà, Adrià Giménez, Rubén Gaspar Aparicio, René Fernández Sánchez, Jorge Civera, Albert Sanchis, Alfons Juan

Towards High-Quality LLM-Based Data for French Spontaneous Speech Simplification: an Exo-Refinement Approach
Lucía Ormaechea, Nikos Tsourakis, Pierrette Bouillon, Benjamin Lecouteux, Didier Schwab

BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM
Xun Gong, Anqi Lv, Wangyou Zhang, Zhiming Wang, Huijia Zhu, Yanmin Qian

SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription
Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use
Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier C. van Dalen

CEREALES : a new dataset of Quebec French accented speech with applications to speech recognition
Lucas Maison, Thomas Soulas, Marie-Jean Meurs







Challenges in Speech Data Collection, Curation and Annotation - Part 2


You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks
Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters

Recognizing Every Voice: Towards Inclusive ASR for Rural Bhojpuri Women
Sakshi Joshi, Eldho Ittan George, Tahir Javed, Kaushal Bhogale, Nikhil Narasimhan, Mitesh M. Khapra

Augment Mandarin to Cantonese Speech Databases via Retrieval-Augmented Generation and Speech Synthesis
Fan Liu, Cheng Gong, Boyu Zhu, Ruihao Jing, Chunyu Qiang, Tianrui Wang, Xiao-Lei Zhang, Xuelong Li

An Exploratory Framework for LLM-assisted Human Annotation of Speech Datasets
Alexander Johnson, Harsh Deshpande, Emmy Phung, Ahmad Emami

Automatic Labeling and Correction of Noisy Labels for Robust Self-Supervised Speaker Verification
Abderrahim Fathan, Jahangir Alam

Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction
Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tünde Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps

AusKidTalk: Using Strategic Data Collection and Out-of-Domain Tools to Semi-Automate Novel Corpora Annotation
Tünde Szalay, Mostafa Shahin, Tharmakulasingam Sirojan, Zheng Nan, Renata Huang, Kirrie Ballard, Beena Ahmed

ASR-based segmentation for the analysis of larger child-speech datasets: Performance evaluation on vowels from Australian-English speaking children aged 4 to 11 years
Rui Cai, Titia Benders

A semi-automatic pipeline for transcribing and segmenting child speech
Polychronia Christodoulidou, James Tanner, Jane Stuart-Smith, Michael McAuliffe, Mridhula Murali, Amy Smith, Lauren Taylor, Joanne Cleland, Anja Kuschmann

Hybrid Data Sampling for ASR: Integrating Acoustic Diversity and Transcription Uncertainty
Komei Hiruta, Yosuke Yamano, Hideaki Tamori

Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification
William Ravenscroft, George Close, Kit Bower-Morris, Jamie Stacey, Dmitry Sityaev, Kris Y. Hong

Adapting Whisper for low-resource Hindi-English Code-Mix speech with on-the-fly Augmentation & LLM-Synthesised Data
Astik Biswas, Oleg Shevelev, Amine Abdaoui, Vivek Tyagi, Abdelmoumene Boumadane

Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies
Carlos Mena, Pol Serra, Jacobo Romero, Abir Messaoudi, Jose Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Ivan Meza, Javier Hernando

From Scarcity to Sufficiency: Speech Recognition Pipeline for Zero-resource Language
Nikolay Karpov, Sofia Kostandian, Nune Tadevosyan, Alexan Ayrapetyan, Andrei Andrusenko, Ara Yeroyan, Mher Yerznkanyan, Vitaly Lavrukhin

MIKU-PAL: An Automated and Standardized Multimodal Method for Speech Paralinguistic and Affect Labeling
Yifan Cheng, Ruoyi Zhang, Jiatong Shi

Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience
Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman

Clinical Annotations for Automatic Stuttering Severity Assessment
Ana Valente, Rufael Marew, Hawau Toyin, Hamdan Al-Ali, Anelise Bohnen, Inma Becerra, Elsa Soares, Gonçalo Leal, Hanan Aldarmaki



Emotion and Expressivity in Speech Synthesis and Voice Conversion


EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech
Jingyuan Xing, Zhipeng Li, Shuaiqi Chen, Xiaofen Xing, Xiangmin Xu

Voice Impression Control in Zero-Shot TTS
Kenichi Fujita, Shota Horiguchi, Yusuke Ijima

EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis
Haoxun Li, Leyuan Qu, Jiaxi Hu, Taihao Li

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Nam-Gyu Kim, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity Control
Masato Murata, Koichi Miyazaki, Tomoki Koriyama

SA-RAS: Speaker-Aware Style Retrieval Augmented Generation for Expressive Zero-Shot Text-to-Speech Synthesis
Xueru Li, Jingyuan Xing, Xiaofen Xing, Zhipeng Li, Xiangmin Xu

DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice Conversion
Xiaosu Su, BoWen Yang, Xiaowei Yi, Yun Cao

ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism
Hsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao, Chi-Chun Lee

MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt
Zhichao Wu, Yueteng Kang, Songjun Cao, Long Ma, Qiulin Li, Qun Yang









Speech Emotion Recognition in Naturalistic Conditions Challenge


Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition
Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu

From Pretraining to Performance: Benchmarking Self-Supervised Speech Models for Interspeech-25 SER Challenge
Drishya Uniyal, Vinayak Abrol

Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices
Tiantian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan

Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction
Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Explainable Speech Emotion Recognition Through Attentive Pooling: Insights from Attention-Based Temporal Localization
Tahitoa Leygue, Astrid Sabourin, Christian Bolzmacher, Sylvain Bouchigny, Margarita Anastassova, Quoc-Cuong Pham

ABHINAYA - A System for Speech Emotion Recognition In Naturalistic Conditions Challenge
Soumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy

The Interspeech 2025 Challenge on Speech Emotion Recognition in Naturalistic Conditions
Abinay Reddy Naini, Lucas Goncalves, Ali N. Salman, Pravin Mote, Ismail R. Ulgen, Thomas Thebaud, Laureano Moro Velazquez, Leibny Paola Garcia, Najim Dehak, Berrak Sisman, Carlos Busso

MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition
Hyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim, Donghun Min, Eun Yi Kim

Multi-task learning for speech emotion recognition in naturalistic conditions
Bartłomiej Zgórzyński, Juliusz Wójtowicz-Kruk, Piotr Masztalski, Władysław Średniawa

Medusa: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions
Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

Interactive Fusion of Multi-View Speech Embeddings via Pretrained Large-Scale Speech Models for Speech Emotional Attribute Prediction in Naturalistic Conditions
Yuyun Liu, Yujia Gu, Jiahao Luo, Wenming Zheng, Cheng Lu, Yuan Zong

Advancing Emotion Recognition via Ensemble Learning: Integrating Speech, Context, and Text Representations
Xiaohan Shi, Jinyi Mi, Xingfeng Li, Tomoki Toda

Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking Model
Lucas Ueda, João Lima, Leonardo Marques, Paula Costa

EmoJudge: LLM Based Post-Hoc Refinement for Multimodal Speech Emotion Recognition
Prabhav Singh, Jesus Villalba

Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild
Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion
Honghong Wang, Jing Deng, Fanqin Meng, Rong Zheng








Adaptation and Target-speaker ASR


Enhancing Target-speaker Automatic Speech Recognition Using Multiple Speaker Embedding Extractors with Virtual Speaker Embedding
Ju-Seok Seong, Jeong-Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang

SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition
Yuta Hirano, Sakriani Sakti

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Pradeep Rangappa, Andrés Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esaú Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition
Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu

Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation
Zhao Yang, Rui Jiang, Yue Heng Yeo, Xiao Fu, Wei Xi, Jizhong Zhao

Regularizing Learnable Feature Extraction for Automatic Speech Recognition
Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

MMLoRA: Multitask Memory Parameter-Efficient Fine-Tuning for Multimodal SER
Yuanbo Fang, Xiaofen Xing, Xueru Li, Weibin Zhang, Xiangmin Xu

Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes
Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya












Keynote4 - Judith Holler: Using and comprehending language in face-to-face conversation


Using and comprehending language in face-to-face conversation
Judith Holler


Pathological Speech Analysis 4


On the Relevance of Clinical Assessment Tasks for the Automatic Detection of Parkinson’s Disease Medication State from Speech
David Gimeno-Gómez, Rubén Solera-Ureña, Anna Pompili, Carlos-D. Martínez-Hinarejos, Rita Cardoso, Isabel Guimarães, Joaquim J. Ferreira, Alberto Abad

Speech power spectra: a window into neural oscillations in Parkinson's disease
Sevada Hovsepyan, Mathew Magimai Doss

Synchronous analysis of abnormal acoustic and linguistic production in Parkinson's speech
Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Sabato Marco Siniscalchi, Adolfo M. Garcia, Yamile Bocanegra, Leonardo Moreno, Elmar Nöth, Juan Rafael Orozco-Arroyave

Automatic Detection and Sub-typing of Primary Progressive Aphasia from Speech: Integrating Task-Specific Features and Spatio-Semantic Graphs
Fritz Peters, W Richard Bevan-Jones, Grace Threlfall, Jenny M Harris, Julie S Snowden, Matthew Jones, Jennifer C Thompson, Daniel J Blackburn, Heidi Christensen

Towards Classification of Typical and Atypical Disfluencies: A Self Supervised Representation Approach
Priyanka Kommagouni, Pragya Khanna, Vamshiraghusimha Narasinga, Anirudh Bocha, Anil Kumar Vuppala

Stuttering Detection Based on Self-Attention Weights of Temporal Acoustic Vector Sequence
Genzo Miyahara, Tsuneo Kato, Akihiro Tamura

Speech-Based Automatic Chronic Kidney Disease Diagnosis via Transformer Fusion of Glottal and Spectrogram Features
Jihyun Mun, Minhwa Chung, Sunhee Kim

Influence of Room Acoustics on Objective Voice Assessment Methods in the Context of Speech and Language Therapy
Sven Franz, Tanja Grewe, Bernd T. Meyer, Jörg Bitzer

Multimodal Speech-Based Biomarkers Outperform the ALS Functional Rating Scale in Predicting Individual Disease Progression in ALS
Hardik Kothare, Michael Neumann, Vikram Ramanarayanan









Biosignal-enabled Spoken Communication


GTAnet: Geometry-Guided Temporal Attention for EEG-Based Sound Source Tracking in Cocktail Party Scenarios
Saurav Pahuja, Gabriel Ivucic, Siqi Cai, Dashanka Da Silva, Haizhou Li, Tanja Schultz

Decoding Listener's Identity: Person Identification from EEG Signals Using a Lightweight Spiking Transformer
Zheyuan Lin, Siqi Cai, Haizhou Li

Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings
Owais Mujtaba Khanday, Pablo Rodríguez San Esteban, Zubair Ahmad Lone, Marc Ouellet, Jose A. Gonzalez-Lopez

Towards Sentence Level Imagined Speech Generation from EEG signals
Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, Jasmeet Singh

Word-Level Error Analysis in Decoding Systems: From Speech Recognition to Brain-Computer Interfaces
Jingya Huang, Aashish N. Patel, Sowmya Manojna Narasimha, Gal Mishne, Vikash Gilja

NeuroSpex+: Dual-Task Training of Neuro-Guided Speaker Extraction with Speech Envelope and Waveform
Dashanka Da Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li

DiffMV-ETS: Diffusion-based Multi-Voice Electromyography-to-Speech Conversion using Speaker-Independent Speech Training Targets
Kevin Scheck, Tom Dombeck, Zhao Ren, Peter Wu, Michael Wand, Tanja Schultz

Conformer-based Ultrasound-to-Speech Conversion
Ibrahim Ibrahimov, Csaba Zainkó, Gábor Gosztolya

Training Articulatory Inversion Models for Interspeaker Consistency
Charles McGhee, Mark J.F. Gales, Kate M. Knill

Enhancing Acoustic-to-Articulatory Inversion with Multi-Target Pretraining for Low-Resource Settings
Jesuraj Bandekar, Prasanta Kumar Ghosh

Articulatory Vowel Distinctiveness in Spanish
Kristin Teplansky, Emily Rangel, Mimi LaValley, Jinuk Kwon, Beiming Cao, Jun Wang

EEG-based Speech Decoding Based on Multi-mode Joint Modeling
Peiran Li, Fei Chen, Xixin Wu

A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations
Masakazu Inoue, Motoshige Sato, Kenichi Tomeoka, Nathania Nah, Eri Hatakeyama, Kai Arulkumaran, Ilya Horiguchi, Shuntaro Sasai

NAM-to-Speech Conversion with Multitask-Enhanced Autoregressive Models
Neil Shah, Shirish Karande, Vineet Gandhi

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling
Long-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen



Pathological Speech Analysis 5


Pitfalls and Limits in Automatic Dementia Assessment
Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

On the Within-class Variation Issue in Alzheimer's Disease Detection
Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng

Alzheimer’s Disease Detection Using Co-Attention Mechanism for Acoustic and ASR-Transcribed Text Features
Yongqi Shao, Tao Fang

Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer’s Disease Detection
Yin-Long Liu, Rui Feng, Jia-Xin Chen, Yi-Ming Wang, Jia-Hong Yuan, Zhen-Hua Ling

Voice-Based Dysphagia Detection: Leveraging Self-Supervised Speech Representation
Injune Hwang, Jung-Min Kim, Ju Seok Ryu, Kyogu Lee

ADCeleb: A Longitudinal Speech Dataset from Public Figures for Early Detection of Alzheimer’s Disease
Kunxiao Gao, Anna Favaro, Najim Dehak, Laureano Moro Velazquez

Anne Rowling Neurological Speech Corpus: clinically annotated longitudinal dataset for developing speech biomarkers in neurodegenerative disorders
Johnny Tam, Christine Weaver, Oliver Watts, Siddharthan Chandran, Suvankar Pal, Rowling Speech Consortium

Multitask Learning with Fused Attention for Improved ASR and Mispronunciation Detection in Children's Speech Sound Disorders
Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So

Multimodal Speech, Language and Orofacial Analysis for Remote Assessment of Positive, Negative and Cognitive Symptoms in Schizophrenia
Michael Neumann, Hardik Kothare, Beverly Insel, Anzalee Khan, Danyah Nadim, Jean-Pierre Lindenmayer, Vikram Ramanarayanan




Speech Analysis, Detection and Classification 2


Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models
Nikola Ljubešić, Ivan Porupski, Peter Rupnik

SupraDoRAL: Automatic Word Prominence Detection Using Suprasegmental Dependencies of Representations with Acoustic and Linguistic Context
Jhansi Mallela, Upendra Vishwanath Y. S., Sankara Bharadwaj Rangavajjala, Bhaskar Bhatt, Chiranjeevi Yarra

LombardTokenizer: Disentanglement and Control of Vocal Effort in a Neural Speech Codec
Maxime Jacquelin, Maëva Garnier, Laurent Girin, Rémy Vincent, Olivier Perrotin

Robust Personal Voice Activity Detection for Mitigating Domain Mismatch and False Acceptance Scenarios
Yuke Lin, Jun Chen, Wenjie Li, Longshuai Xiao, Chao Weng

Adaptive Knowledge Distillation for Device-Directed Speech Detection
Hyung-gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz

Flexible VAD-PVAD Transition: A Detachable PVAD Module for Dynamic Encoder RNN VAD
En-Lun Yu, Chien-Chun Wang, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen

Speaker Conditioning of Voice Activity Detection via Implicit Separation
Matthew Maciejewski

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

DuRep: Dual-Mode Speech Representation Learning via ASR-Aware Distillation
Prabash Reddy Male, Swayambhu Nath Ray, Harish Arsikere, Akshat Jaiswal, Prakhar Swarup, Prantik Sen, Debmalya Chakrabarty, K V Vijay Girish, Nikhil Bhave, Frederick Weber, Sambuddha Bhattacharya, Sri Garimella


Search papers
Article
×

Keynote1 - Roger Moore: From Talking and Listening Devices to Intelligent Communicative Machines

Spoken Machine Translation 1

Real-time Speech Enhancement

Multilinguality, Cross-linguistic Studies, L2 Speech

Speech Emotion Recognition 1

Multimodal Resources

Interpretability in Audio and Speech Technology

Summarization

Show and Tell 1: ASR / Tools

Models of Speech Production

Speech and Grammar/Articulatory Analyses

Speaking Styles, Register and Conversational Speech

Emotional Distress in Speech

Prosody in Speech Synthesis

Depression Detection and Assessment 1

Speech Analysis, Detection and Classification 1

Speech-based Cognitive Assessment 1

Large Language Models in Speech Recognition

Speech Coding and Echo Cancellation

Decoding Algorithms

Queer and Trans Speech Science and Technology

Tone

Cross-Lingual and Multilingual Processing

Echo Cancellation, Feedback Control, and Near-end Enhancement

Pathological Speech Analysis 1

Hearing Disorders

Interspeech 2025 URGENT Challenge

Spoken Machine Translation 2

Spatial Audio and Acoustics 1

Articulatory and Vocal Tract Modelling

Acoustic Assessment of Respiratory Health

Advances in Modelling and Imaging

Conversation, Communication and Interaction 1

Robust Speaker Verification

Multilingual ASR

Multi-channel Speech Enhancement

Self-supervised Learning

Singing Voice and Audio Synthesis

Acoustic and Articulatory Cues in Speech Perception

Audio Event Detection and Classification

Inclusivity

Voice Conversion 1

Speech-based Cognitive Assessment 2

Source Separation 1

Language and Accent Identification and Speaker Privacy

Source Tracing: The Origins of Synthetic or Manipulated Speech

Speaker Diarization 1

Multilingual Speech Synthesis and Special Applications 1

Characterization and Multimodal Approaches for Speaker Recognition

Acoustic Analysis and Bioacoustics

Keynote2 - Alexander Waibel: From Speech Science to Language Transparence

Spoken Dialogue Systems 1

Speech Assessment

Audio-Visual ASR and Multimodal System

Speech and Voice Disorders 1

Multimodal Information Based Speech Processing (MISP) 2025 Challenge

Speaker Extraction 1

Low Resource Speech Recognition

Computational Resource Constrained ASR

Speech and Language Technology for Health Applications

Responsible Speech Foundation Models + SUPERB Challenge

Dysarthric Speech Assessment 1

Show and Tell 2: Speech Synthesis

Databases and Progress in Methodology

Novel Architectures for ASR

Deepfake Detection

Tools for Speech Analysis

Text Processing and Evaluation for Speech Synthesis 1

Segmental and Tonal Units

Speech Quality Assessment

Speech Enhancement

Language Learning and Assessment

Speech Synthesis Paradigms and Methods 1

Spatial Audio and Acoustics 2

Text Processing and Evaluation for Speech Synthesis 2

General Topics in ASR

Acoustic Event Detection and Classification

Keyword Spotting and Retrieval

Multimodal Systems

Dysarthric Speech Assessment 2

Dialect Identification in Different Languages

Connecting Speech Science and Speech Technology for Children’s Speech

Brain and Cognition

Regional, Social and Diachronic Variation

Speaker Extraction 2

Multimodal Emotion Recognition

Conversation, Communication and Interaction 2

Multimodal Speech and Language Processing in Healthcare Settings

Music and Audio Analysis

Audio Analysis, Generation and Assessment

Other Topics in Speech Recognition

Privacy and Anonymization

Language Modeling for Conversational Systems

Speech Accessibility Project Challenge

Neural Network Training Methods 1

Diversity: Age, Sex, Gender, Ethnicity, and More

Anomalous Sound Detection

Far-field and Robust Speech Recognition

Speech Synthesis Paradigms and Methods 2

Keynote3 - Carol Espy-Wilson: Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications

Articulatory Analyses

Speech and Audio Analysis and Representation

Show and Tell 3: Signal Processing / Multimodal processing

Speech and Voice Disorders 2

Neural Network Training Methods 2

Disentanglement of Information for Speaker Recognition

Error Correction and Confidence Estimation

Training and Scoring Methods for Speaker Recognition

Pathological Speech Analysis 2

Multimodal and Visual Speech Synthesis

Lexicon and Grammar

Noise Reduction and Dereverberation

Neural Network Training Methods and Architectures

Challenges in Speech Data Collection, Curation and Annotation - Part 1

Evaluation and Forensic Applications of Speaker Recognition

Language Resources

Bandwidth Expansion and Diffusion-based Speech Enhancement

Spoken Language Understanding

Multilingual Speech Synthesis and Special Applications 2

Prosody and Voice Quality

Generative Models for Audio

Challenges in Speech Data Collection, Curation and Annotation - Part 2

Speech Emotion Recognition 3

Emotion and Expressivity in Speech Synthesis and Voice Conversion

Streaming ASR

L1 and L2 Acquisition, Perception and Processing

Speech Emotion Recognition 2

Speaker Traits Recognition

Spoofing and Adversarial Attacks

Voice Conversion 2

Pathological Speech Analysis 3

Speech Emotion Recognition in Naturalistic Conditions Challenge

Prosody, Phoneme and Stress Modeling in ASR

Segments

Datasets and Tools for Speech Synthesis

Spoken Dialogue Systems 2

Speech Enhancement and Representation Learning

Neural Codecs and Vocoders

Adaptation and Target-speaker ASR

Show and Tell 4: Education / Assistive Technology

Source Separation 2

Speech Coding

Multimodality

Speech Assessment and Language Learning

Watermarking and Anonymization

Single-channel Speech Enhancement

Contextual Biasing and Adaptation

Speaker Diarization 2

Depression Detection and Assessment 2

Keynote4 - Judith Holler: Using and comprehending language in face-to-face conversation

Pathological Speech Analysis 4

Speech Deepfakes

Prosody

Speech Analysis and Quality Assessment

Emotions and Foundational Models

Prediction and Evaluation of Speech Quality and Intelligibility

Multi-Talker ASR

Speech Synthesis Paradigms and Methods 3

Biosignal-enabled Spoken Communication

Speech Deepfakes, Antispoofing and Backdoor Attacks

Pathological Speech Analysis 5

ASR Assessment and Foundational Models

Speaker Recognition

Speech Analysis, Detection and Classification 2