ISCA Archive Interspeech 2024 Sessions Search Website Booklet
  ISCA Archive Sessions Search Website Booklet
×

Click on column names to sort.

Searching uses the 'and' of terms e.g. Smith Interspeech matches all papers by Smith in any Interspeech. The order of terms is not significant.

Use double quotes for exact phrasal matches e.g. "acoustic features".

Case is ignored.

Diacritics are optional e.g. lefevre also matches lefèvre (but not vice versa).

It can be useful to turn off spell-checking for the search box in your browser preferences.

If you prefer to scroll rather than page, increase the number in the show entries dropdown.

top

Interspeech 2024

Kos, Greece
1-5 September 2024

Chairs: Itshak Lapidot, Sharon Gannot
doi: 10.21437/Interspeech.2024
ISSN: 2958-1796













Biosignal-enabled Spoken Communication


A multimodal approach to study the nature of coordinative patterns underlying speech rhythm
Jinyu Li, Leonardo Lancia

Towards EMG-to-Speech with Necklace Form Factor
Peter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W Black, Rikky Muller, Gopala Krishna Anumanchipalli

Using articulated speech EEG signals for imagined speech decoding
Chris Bras, Tanvina Patel, Odette Scharenborg

Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals
Jinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang

Optical Flow Guided Tongue Trajectory Generation for Diffusion-based Acoustic to Articulatory Inversion
Yudong Yang, Rongfeng Su, Rukiye Ruzi, Manwa Ng, Shaofeng Zhao, Nan Yan, Lan Wang

Multimodal Segmentation for Vocal Tract Modeling
Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli

Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss
Jesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh

Auditory Attention Decoding in Four-Talker Environment with EEG
Yujie Yan, Xiran Xu, Haolin Zhu, Pei Tian, Zhongshu Ge, Xihong Wu, Jing Chen

ASA: An Auditory Spatial Attention Dataset with Multiple Speaking Locations
Zijie Lin, Tianyu He, Siqi Cai, Haizhou Li

Leveraging Graphic and Convolutional Neural Networks for Auditory Attention Detection with EEG
Saurav Pahuja, Gabriel Ivucic, Pascal Himmelmann, Siqi Cai, Tanja Schultz, Haizhou Li











Contextual Biasing and Adaptation


Keyword-Guided Adaptation of Automatic Speech Recognition
Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor
Nguyen Manh Tien Anh, Thach Ho Sy

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer
Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen

Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition
Chengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation
Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

Prompt Tuning for Speech Recognition on Unknown Spoken Name Entities
Xizi Wei, Stephen McGregor

Improved Factorized Neural Transducer Model For Text-only Domain Adaptation
Junzhe Liu, Jianwei Yu, Xie Chen

Modality Translation Learning for Joint Speech-Text Model
Pin-Yen Liu, Jen-Tzung Chien

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR
Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Factor-Conditioned Speaking-Style Captioning
Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR
Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran

Domain-Aware Data Selection for Speech Classification via Meta-Reweighting
Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang





Speech Disorders 2


Whister: Using Whisper’s representations for Stuttering detection
Vrushank Changawala, Frank Rudzicz

Improving Speech-Based Dysarthria Detection using Multi-task Learning with Gradient Projection
Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti

Cascaded Transfer Learning Strategy for Cross-Domain Alzheimer's Disease Recognition through Spontaneous Speech
Guanlin Chen, Yun Jin

A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech
Loukas Ilias, Dimitris Askounis

Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance
Si-Ioi Ng, Lingfeng Xu, Kimberly D. Mueller, Julie Liss, Visar Berisha

Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning
Katerina Papadimitriou, Gerasimos Potamianos

Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data
Tomas Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L. Prince, Maria Schuster, Elmar Noeth, Jonghye Woo, Andreas Maier

DysArinVox: DYSphonia & DYSarthria mandARIN speech corpus
Haojie Zhang, Tao Zhang, Ganjun Liu, Dehui Fu, Xiaohui Hou, Ying Lv

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli

Automatic Longitudinal Investigation of Multiple Sclerosis Subjects
Gábor Gosztolya, Veronika Svindt, Judit Bóna, Ildikó Hoffmann


TAUKADIAL Challenge: Speech-Based Cognitive Assessment in Chinese and English (Special Session)


Connected Speech-Based Cognitive Assessment in Chinese and English
Saturnino Luz, Sofia De La Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi, Ya-Ning Chang, Chia-Ju Chou, Yi-Chien Liu

Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis
David Ortiz-Perez, Jose Garcia-Rodriguez, David Tomás

Combining Acoustic Feature Sets for Detecting Mild Cognitive Impairment in the Interspeech'24 TAUKADIAL Challenge
Gábor Gosztolya, László Tóth

Pre-trained Feature Fusion and Matching for Mild Cognitive Impairment Detection
Junwen Duan, Fangyuan Wei, Hong-Dong Li, Jin Liu

The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal Approach
Benjamin Barrera-Altuna, Daeun Lee, Zaima Zarnaz, Jinyoung Han, Seungbae Kim

Leveraging Universal Speech Representations for Detecting and Assessing the Severity of Mild Cognitive Impairment Across Languages
Anna Favaro, Tianyu Cao, Najim Dehak, Laureano Moro-Velazquez

Translingual Language Markers for Cognitive Assessment from Spontaneous Speech
Bao Hoang, Yijiang Pang, Hiroko Dodge, Jiayu Zhou

Multilingual Speech and Language Analysis for the Assessment of Mild Cognitive Impairment: Outcomes from the Taukadial Challenge
Paula Andrea Pérez-Toro, Tomas Arias-Vergara, Philipp Klumpp, Tobias Weise, Maria Schuster, Elmar Noeth, Juan Rafael Orozco-Arroyave, Andreas Maier


Show and Tell 1


Production of phrases by mechanical models of the human vocal tract
Takayuki Arai, Ryohei Suzuki, Chandler Earp, Shinya Tsuji, Keiko Ochi

Faster Vocoder: a multi threading approach to achieve low latency during TTS Inference
Vishal Gourav, Ankit Tyagi, Phanindra Mankale

A powerful and modern AAC composition tool for impaired speakers
Aanchan Mohan, Monideep Chakraborti, Katelyn Eng, Nailia Kushaeva, Mirjana Prpa, Jordan Lewis, Tianyi Zhang, Vince Geisler, Carol Geisler

VoxFlow AI: wearable voice converter for atypical speech
Grzegorz P. Mika, Konrad Zieli´nski, Paweł Cyrta, Marek Grzelec

Stress transfer in speech-to-speech machine translation
Sai Akarsh, Vamshiraghusimha Narasinga, Anil Kumar Vuppala

Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis
Takuma Okamoto, Yamato Ohtani, Hisashi Kawai

Multi-speaker and multi-dialectal Catalan TTS models for video gaming
Alex Peiró-Lilja, José Giraldo, Martí Llopart-Font, Carme Armentano-Oller, Baybars Külebi, Mireia Farrús

ConnecTone: a modular AAC system prototype with contextual generative text prediction and style-adaptive conversational TTS
Juliana Francis, Éva Székely, Joakim Gustafson

Reliable dialogue system for facilitating student-counselor communication
Mahdin Rohmatillah, Bryan Gautama Ngo, Willianto Sulaiman, Po-Chuan Chen, Jen-Tzung Chien

CreakVC: a voice conversion tool for modulating creaky voice
Harm Lameris, Joakim Gustafson, Éva Székely

EZTalking: English assessment platform for teachers and students
Yu-Sheng Tsao, Yung-Chang Hsu, Jiun-Ting Li, Siang-Hong Weng, Tien-Hong Lo, Berlin Chen











General Topics in ASR


Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions
Jiwon Suh, Injae Na, Woohwan Jung

A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting
Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
Mario Zusag, Laurin Wagner, Bernhad Thallinger

On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech Recognition
Peter Mihajlik, Yan Meng, Mate S Kadar, Julian Linke, Barbara Schuppler, Katalin Mády

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin

DualPure: An Efficient Adversarial Purification Method for Speech Command Recognition
Hao Tan, Xiaochen Liu, Huan Zhang, Junjian Zhang, Yaguan Qian, Zhaoquan Gu

A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives
Jan Lehečka, Josef V. Psutka, Lubos Smidl, Pavel Ircing, Josef Psutka

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models
Anton de la Fuente, Dan Jurafsky

Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating the Impact of Healthy, Disease-Specific, and Speaker-Specific Data
Spyretta Leivaditi, Tatsunari Matsushima, Matt Coler, Shekhar Nayak, Vass Verkhodanova

Dysarthric Speech Recognition Using Curriculum Learning and Articulatory Feature Embedding
I-Ting Hsieh, Chung-Hsien Wu

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation
Shiyao Wang, Shiwan Zhao, Jiaming Zhou, Aobo Kong, Yong Qin

An efficient text augmentation approach for contextualized Mandarin speech recognition
Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses
Sheng Li, Chen Chen, Chin Yuen Kwok, Chenhui Chu, Eng Siong Chng, Hisashi Kawai

Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping
Lun Wang, Om Thakkar, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan



Speech and Multimodal Resources


BESST Dataset: A Multimodal Resource for Speech-based Stress Detection and Analysis
Jan Pešán, Vojtěch Juřík, Martin Karafiát, Jan Černocký

HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing
Arnon Turetzky, Or Tal, Yael Segal, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi

GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech
Wenbin Wang, Yang Song, Sanjay Jha

STraDa: A Singer Traits Dataset
Yuexuan Kong, Viet-Anh Tran, Romain Hennequin

MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features
Katharina Anderer, Andreas Reich, Matthias Wölfel

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset
Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh

Towards measuring fairness in speech recognition: Fair-Speech dataset
Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

SER Evals: In-domain and Out-of-domain benchmarking for speech emotion recognition
Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem



Speech and Language in Health: from Remote Monitoring to Medical Conversations - 1 (Special Session)


Reference-Free Estimation of the Quality of Clinical Notes Generated from Doctor-Patient Conversations
Mojtaba Kadkhodaie Elyaderani, John Glover, Thomas Schaaf

Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder
Jihyun Mun, Sunhee Kim, Minhwa Chung

Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention
Vladimir Despotovic, Abir Elbéji, Petr V. Nazarov, Guy Fagherazzi

Revealing Confounding Biases: A Novel Benchmarking Approach for Aggregate-Level Performance Metrics in Health Assessments
Stefano Goria, Roseline Polle, Salvatore Fara, Nicholas Cummins

Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI.
Yael Bensoussan, Satrajit Ghosh, Anais Rameau, Micah Boyer, Ruth Bahr, Stephanie Watts, Frank Rudzicz, Don Bolser, Jordan Lerner-Ellis, Shaheen Awan, Maria Powell, Jean-Christophe Belisle-Pipon, Vardit Ravitsky, Alistair Johnson, Alexandros Sigaras, Olivier Elemento, David Dorr, Philip Payne

Self-Supervised Embeddings for Detecting Individual Symptoms of Depression
Sri Harsha Dumpala, Katerina Dikaios, Abraham Nunes, Frank Rudzicz, Rudolf Uher, Sageev Oore

Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction
Daryush D. Mehta, Jarrad H. Van Stan, Hamzeh Ghasemzadeh, Robert E. Hillman

Predicting Acute Pain Levels Implicitly from Vocal Features
Jennifer Williams, Eike Schneiders, Henry Card, Tina Seabrooke, Beatrice Pakenham-Walsh, Tayyaba Azim, Lucy Valls-Reed, Ganesh Vigneswaran, John Robert Bautista, Rohan Chandra, Arya Farahi

Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis
Kubilay Can Demir, Belén Lojo Rodríguez, Tobias Weise, Andreas Maier, Seung Hee Yang

A Multimodal Framework for the Assessment of the Schizophrenia Spectrum
Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L.Kelly, Carol Espy-Wilson



Innovative Methods in Phonetics and Phonology


The Use of Phone Categories and Cross-Language Modeling for Phone Alignment of Panãra
Emily P. Ahn, Eleanor Chodroff, Myriam Lapierre, Gina-Anne Levow

Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN
Sneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma

Phonological Feature Detection for US English using the Phonet Library
Harsha Veena Tadavarthy, Austin Jones, Margaret E. L. Renwick

K-means and hierarchical clustering of f0 contours
Constantijn Kaland, Jeremy Steffman, Jennifer Cole

Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment
Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff

Using wav2vec 2.0 for phonetic classification tasks: methodological aspects
Lila Kim, Cédric Gendrot

The sub-band cepstrum as a tool for locating local spectral regions of phonetic sensitivity: A first attempt with multi-speaker vowel data
Michael Lambropoulos, Frantz Clermont, Shunichi Ishihara

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator
Woo-Jin Chung, Hong-Goo Kang

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Noeth, Bjoern Heismann, Andreas Maier, Seung Hee Yang

Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech
Anna Oura, Hideaki Kikuchi, Tetsunori Kobayashi




Speaker and Language Identification and Diarization


Multi-latency look-ahead for streaming speaker segmentation
Bilal Rahou, Hervé Bredin

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment
Christoph Boeddeker, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings
Théo Mariotte, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas

Hybrid-Diarization System with Overlap Post-Processing for the DISPLACE 2024 Challenge
Gabriel Pîrlogeanu, Octavian Pascu, Alexandru-Lucian Georgescu, Horia Cucu

The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments
Shareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K T, S.R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024
Joonas Kalda, Tanel Alumae, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer

Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification
Yaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng

Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech
Martina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino

AG-LSEC: Audio Grounded Lexical Speaker Error Correction
Rohit Paturi, Xiang Li, Sundararajan Srinivasan

Speaker Change Detection with Weighted-sum Knowledge Distillation based on Self-supervised Pre-trained Models
Hang Su, Yuxiang Kong, Lichun Fan, Peng Gao, Yujun Wang, Zhiyong Wu

SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization
Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura

Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework
Hokuto Munakata, Ryo Terashima, Yusuke Fujita





Speech Synthesis: Expressivity and Emotion


GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
Donghyun Seong, Hoyoung Lee, Joon-Hyuk Chang

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models
Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

Text-aware and Context-aware Expressive Audiobook Speech Synthesis
Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

Controlling Emotion in Text-to-Speech with Natural Language Prompts
Thomas Bott, Florian Lux, Ngoc Thang Vu

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li

Emotion Arithmetic: Emotional Speech Synthesis via Weight Space Interpolation
Pavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder
Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis
Chin-Yun Yu, György Fazekas


Speech Synthesis: Tools and Data


SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark
Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings
Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M Khapra

FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing Huai

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana

1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Sewade Ogun, Abraham T. Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin Adewumi

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis
Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari






Speech and Language in Health: from Remote Monitoring to Medical Conversations - 2 (Special Sessions)


It’s Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson’s Disease
Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Ilja Baumann, Korbinian Riedhammer, Elmar Noeth, Tobias Bocklet, Adolfo M. Garcia, Juan Rafael Orozco-Arroyave

Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models
Malo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard

Macro-descriptors for Alzheimer's disease detection using large language models
Catarina Botelho, John Mendonça, Anna Pompili, Tanja Schultz, Alberto Abad, Isabel Trancoso

Infusing Acoustic Pause Context into Text-Based Dementia Assessment
Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

Towards Scalable Remote Assessment of Mild Cognitive Impairment Via Multimodal Dialog
Oliver Roesler, Jackson Liscombe, Michael Neumann, Hardik Kothare, Abhishek Hosamath, Lakshmi Arbatti, Doug Habberstad, Christiane Suendermann-Oeft, Meredith Bartlett, Cathy Zhang, Nikhil Sukhdev, Kolja Wilms, Anusha Badathala, Sandrine Istas, Steve Ruhmel, Bryan Hansen, Madeline Hannan, David Henley, Arthur Wallace, Ira Shoulson, David Suendermann-Oeft, Vikram Ramanarayanan

Automatic recognition and detection of aphasic natural speech
Mara Barberis, Pieter De Clercq, Bastiaan Tamm, Hugo Van hamme, Maaike Vandermosten

When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech Recognition
Giulia Sanguedolce, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor, Fatemeh Geranmayeh

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass

How Consistent are Speech-Based Biomarkers in Remote Tracking of ALS Disease Progression Across Languages? A Case Study of English and Dutch
Hardik Kothare, Michael Neumann, Cathy Zhang, Jackson Liscombe, Jordi W J van Unnik, Lianne C M Botman, Leonard H van den Berg, Ruben P A van Eijk, Vikram Ramanarayanan

“So . . . my child . . . ” – How Child ADHD Influences the Way Parents Talk
Anika A. Spiesberger, Andreas Triantafyllopoulos, Alexander Kathan, Anastasia Semertzidou, Caterina Gawrilow, Tilman Reinelt, Wolfgang A. Rauch, Björn Schuller

Variability of speech timing features across repeated recordings: a comparison of open-source extraction techniques
Judith Dineley, Ewan Carr, Lauren L. White, Catriona Lucas, Zahia Rahman, Tian Pan, Faith Matcham, Johnny Downs, Richard J. Dobson, Thomas F. Quatieri, Nicholas Cummins

Zero-Shot End-To-End Spoken Question Answering In Medical Domain
Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian





Speaker Recognition 1


Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker Verification
Shengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang

Speaker Conditional Sinc-Extractor for Personal VAD
En-Lun Yu, Kuan-Hsun Ho, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen

Enhancing ECAPA-TDNN with Feature Processing Module and Attention Mechanism for Speaker Verification
Shiu-Hsiang Liou, Po-Cheng Chan, Chia-Ping Chen, Tzu-Chieh Lin, Chung-Li Lu, Yu-Han Cheng, Hsiang-Feng Chuang, Wei-Yu Chen

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms
Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu

Disentangled Representation Learning for Environment-agnostic Speaker Recognition
KiHyun Nam, Hee-Soo Heo, Jee-weon Jung, Joonson Chung

Multi-Channel Extension of Pre-trained Models for Speaker Verification
Ladislav Mošner, Romain Serizel, Lukáš Burget, Oldřich Plchot, Emmanuel Vincent, Junyi Peng, Jan Černocký

Efficient Integrated Features Based on Pre-trained Models for Speaker Verification
Yishuang Li, Wenhao Guan, Hukai Huang, Shiyu Miao, Qi Su, Lin Li, Qingyang Hong

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition
Tianhao Wang, Lantian Li, Dong Wang

DB-PMAE: Dual-Branch Prototypical Masked AutoEncoder with locality for domain robust speaker verification
Wei-lin Xie, Yu-Xuan Xi, Yan Song, Jian-tao Zhang, Hao-yu Song, Ian McLoughlin

Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language
Matthew Maciejewski, Dominik Klement, Ruizhe Huang, Matthew Wiesner, Sanjeev Khudanpur

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition
Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang






Accented Speech, Prosodic Features, Dialect, Emotion, Sound Classification


Improving Self-supervised Pre-training using Accent-Specific Codebooks
Darshan Prabhu, Abhishek Gupta, Omkar Nitsure, Preethi Jyothi, Sriram Ganapathy

Performant ASR Models for Medical Entities in Accented Speech
Tejumade Afonja, Tobi Olatunji, Sewade Ogun, Naome A. Etori, Abraham Owodunni, Moshood Yekini

LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems
Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho George, Kaushal Bhogale, Deovrat Mehendale, Mitesh M. Khapra

LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech
Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition
Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu, Qingyang Hong, Lin Li

Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition
Ying Hu, Huamin Yang, Hao Huang, Liang He

Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning
Arnav Goel, Medha Hira, Anubha Gupta

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios
Hazim Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh

The Processing of Stress in End-to-End Automatic Speech Recognition Models
Martijn Bentum, Louis ten Bosch, Tom Lentz

LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation Detection
Tuan Nguyen, Huy Dat Tran

Learning from memory-based models
Rhiannon Mogridge, Anton Ragni

Towards End-to-End Unified Recognition for Mandarin and Cantonese
Meiling Chen, Pengjie Liu, Heng Yang, Haofeng Wang





Speech Disorders 3


Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design
Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
Neil Shah, Shirish Karande, Vineet Gandhi

PARAN: Variational Autoencoder-based End-to-End Articulation-to-Speech System for Speech Intelligibility
Seyun Um, Doyeon Kim, Hong-Goo Kang

Acoustic changes in speech prosody produced by children with autism after robot-assisted speech training
Si Chen, Bruce Xiao Wang, Yitian Hong, Fang Zhou, Angel Chan, Po-yi Tang, Bin Li, Chunyi Wen, James Cheung, Yan Liu, Zhuoming Chen

Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility
Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech
Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert MacDonald, Katie Seaver, Richard Cave, Marilyn Ladewig, Rus Heywood, Jordan Green

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis
Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

Wav2vec 2.0 Embeddings Are No Swiss Army Knife -- A Case Study for Multiple Sclerosis
Gábor Gosztolya, Mercedes Vetráb, Veronika Svindt, Judit Bóna, Ildikó Hoffmann


Speech Recognition with Large Pretrained Speech Models for Under-represented Languages (Special Session)


Interface Design for Self-Supervised Speech Models
Yi-Jen Shih, David Harwath

Comparing Discrete and Continuous Space LLMs for Speech Recognition
Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang

Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling
Kaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra

Interleaved Audio/Audiovisual Transfer Learning for AV-ASR in Low-Resourced Languages
Zhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt

Adapter pre-training for improved speech recognition in unseen domains using low resource adapter tuning of self-supervised models
Sathvik Udupa, Jesuraj Bandekar, Saurabh Kumar, Deekshitha G, Sandhya B, Abhayjeet S, Savitha Murthy, Priyanka Pai, Srinivasa Raghavan, Raoul Nanavati, Prasanta Kumar Ghosh

Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper
Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie

Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on Northern Sámi
Yaroslav Getman, Tamas Grosz, Katri Hiovain-Asikainen, Mikko Kurimo

Learn and Don't Forget: Adding a New Language to ASR Foundation Models
Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, Mark J.F. Gales












Training Methods, Self-Supervised Learning, Adaptation


MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu, Maja Pantic

Speech and Language Recognition with Low-rank Adaptation of Pretrained Models
Amrutha Prasad, Srikanth Madikeri, Driss Khalil, Petr Motlicek, Christof Schuepbach

Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition
Kwangyoun Kim, Suwon Shon, Yi-Te Hsu, Prashant Sridhar, Karen Livescu, Shinji Watanabe

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks
Amit Meghanani, Thomas Hain

Self-Train Before You Transcribe
Robert Flynn, Anton Ragni

Unsupervised Online Continual Learning for Automatic Speech Recognition
Steven Vander Eeckt, Hugo Van hamme

Dual-path Adaptation of Pretrained Feature Extraction Module for Robust Automatic Speech Recognition
Hao Shi, Tatsuya Kawahara

Hierarchical Multi-Task Learning with CTC and Recursive Operation
Nahomi Kusunoki, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

Boosting CTC-based ASR using inter-layer attention-based CTC loss
Keigo Hojo, Yukoh Wakabayashi, Kengo Ohta, Atsunori Ogawa, Norihide Kitaoka

Self-training ASR Guided by Unsupervised ASR Teacher
Hyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe

Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition
Yue Gu, Zhihao Du, Shiliang Zhang, jiqing Han, Yongjun He

Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank Adaptation
George Joseph, Arun Baby

Online Subloop Search via Uncertainty Quantization for Efficient Test-Time Adaptation
Jae-Hong Lee, Sang-Eon Lee, Dong-Hyun Kim, DoHee Kim, Joon-Hyuk Chang

ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2vec2.0 Based ASR
Vishwanath Pratap Singh, Federico Malato, Ville Hautamäki, Md. Sahidullah, Tomi Kinnunen

Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech Recognition
Jeehye Lee, Hyeji Seo







Speech Science, Speech Technology, and Gender (Special Session)


Challenges of German Speech Recognition: A Study on Multi-ethnolectal Speech Among Adolescents
Martha Schubert, Daniel Duran, Ingo Siegert

Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling Queer Voices.
Atli Sigurgeirsson, Eddie L. Ungless

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis
Valentin Pelloin, Léna Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan

Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses
David Doukhan, Lena Dodson, Manon Conan, Valentin Pelloin, Aurélien Clamouse, Mélina Lepape, Géraldine Van Hille, Cécile Méadel, Marlène Coulomb-Gully

Acoustic Effects of Facial Feminisation Surgery on Speech and Singing: A Case Study
Cliodhna Hughes, Guy Brown, Ning Ma, Nicola Dibben

An inclusive approach to creating a palette of synthetic voices for gender diversity
Eva Szekely, Maxwell Hope

Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli

Voice Quality Variation in AAE: An Additional Challenge for Addressing Bias in ASR Models?
Li-Fang Lai, Nicole Holliday

Articulatory Configurations across Genders and Periods in French Radio and TV archives
Benjamin Elie, David Doukhan, Rémi Uro, Lucas Ondel-Yang, Albert Rilliard, Simon Devauchelle

On the Encoding of Gender in Transformer-based ASR Representations
Aravind Krishnan, Badr M. Abdullah, Dietrich Klakow

















Computational Models of Human Language Acquisition, Perception, and Production (Special Session)


Information-theoretic hypothesis generation of relative cue weighting for the voicing contrast
Annika Heuser, Jianjing Kuang

Neurocomputational model of speech recognition for pathological speech detection: a case study on Parkinson's disease speech detection
Sevada Hovsepyan, Mathew Magimai.-Doss

Simulating articulatory trajectories with phonological feature interpolation
Angelo Ortiz Tandazo, Thomas Schatz, Thomas Hueber, Emmanuel Dupoux

A Pilot Study of GSLM-based Simulation of Foreign Accentuation Only Using Native Speech Corpora
Kentaro Onda, Joonyong Park, Nobuaki Minematsu, Daisuke Saito

Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations
Guillem Bonafos, Clara Bourot, Pierre Pudlo, Jean-Marc Freyermuth, Laurence Reboul, Samuel Tronçon, Arnaud Rey

A data-driven model of acoustic speech intelligibility for optimization-based models of speech production
Benjamin Elie, Juraj Simko, Alice Turk

The Difficulty and Importance of Estimating the Lower and Upper Bounds of Infant Speech Exposure
Joseph Coffey, Okko Räsänen, Camila Scaff, Alejandrina Cristia

Spoken-Term Discovery using Discrete Speech Units
Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, Herman Kamper

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations
Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater


Show and Tell 3


OCEAN-AI: open multimodal framework for personality traits assessment and HR-processes automatization
Elena Ryumina, Dmitry Ryumin, Alexey Karpov

VoxMed: one-step respiratory disease classifier using digital stethoscope sounds
Paridhi Mundra, Manik Sharma, Yashwardhan Chaudhuri, Orchid Chetia Phukan, Arun Balaji Buduru

AVR: synergizing foundation models for audio-visual humor detection
Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma

ASGIR: audio spectrogram transformer guided classification and information retrieval for birds
Yashwardhan Chaudhuri, Paridhi Mundra, Arnesh Batra, Orchid Chetia Phukan, Arun Balaji Buduru

PERSONA: an application for emotion recognition, gender recognition and age estimation
Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma

NeuRO: an application for code-switched autism detection in children
Mohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan, Muskaan Singh

ComFeAT: combination of neural and spectral features for improved depression detection
Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma

The reasonable effectiveness of speaker embeddings for violence detection
Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

ATTEST: an analytics tool for the testing and evaluation of speech technologies
Dmitrii Obukhov, Marcel de Korte, Andrey Adaschik

PhoneViz: exploring alignments at a glance
Margot Masson, Erfan A. Shams, Iona Gessinger, Julie Carson-Berndsen

Gryannote open-source speaker diarization labeling tool
Clément Pages, Hervé Bredin

A toolkit for joint speaker diarization and identification with application to speaker-attributed ASR
Giovanni Morrone, Enrico Zovato, Fabio Brugnara, Enrico Sartori, Leonardo Badino











Cross-Lingual and Multilingual Processing


A Parameter-efficient Language Extension Framework for Multilingual ASR
Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR
Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen

mHuBERT-147: A Compact Multilingual HuBERT Model
Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale
Vasista Sai Lodagala, Abhishek Biswas, Shoutrik Das, Jordan F, S Umesh

A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages
Nikhil Jakhar, Sudhanshu Srivastava, Arun Baby

Integrating Speech Self-Supervised Learning Models and Large Language Models for ASR
Ling Dong, Zhengtao Yu, Wenjun Wang, Yuxin Huang, Shengxiang Gao, Guojiang Zhou

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data
Georgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis, Vassilis Katsouros

Speech Recognition for Greek Dialects: A Challenging Benchmark
Socrates Vakirtzian, Chara Tsoukala, Stavros Bompolas, Katerina Mouzou, Vivian Stamou, Georgios Paraskevopoulos, Antonios Dimakis, Stella Markantonatou, Angela Ralli, Antonios Anastasopoulos

LUPET: Incorporating Hierarchical Information Path into Multilingual ASR
Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges
Per E Kummervold, Javier de la Rosa, Freddy Wetjen, Rolv-Arild Braaten, Per Erik Solberg

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios
Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe

Enhancing Neural Transducer for Multilingual ASR with Synchronized Language Diarization
Amir Hussein, Desh Raj, Matthew Wiesner, Daniel Povey, Paola Garcia, Sanjeev Khudanpur

SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR
Shuaishuai Ye, Shunfei Chen, Xinhui Hu, Xinkang Xu


















Computational Resource Constrained ASR


Dynamic Data Pruning for Automatic Speech Recognition
Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

Mitigating Overfitting in Structured Pruning of ASR Models with Gradient-Guided Parameter Regularization
Dong-Hyun Kim, Joon-Hyuk Chang

SparseWAV: Fast and Accurate One-Shot Unstructured Pruning for Large Speech Foundation Models
Tianteng Gu, Bei Liu, Hang Shao, Yanmin Qian

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model
Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

USM RNN-T model weights binarization
Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
Tzu-Quan Lin, Hung-yi Lee, Hao Tang

RepTor: Re-parameterizable Temporal Convolution for Keyword Spotting via Differentiable Kernel Search
Eunik Park, Daehyun Ahn, Hyungjun Kim

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting
Shuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang

ED-sKWS: Early-Decision Spiking Neural Networks for Rapid, and Energy-Efficient Keyword Spotting
Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li

A Small and Fast BERT for Chinese Medical Punctuation Restoration
Tongtao Ling, Yutao Lai, Lei Chen, Shilei Huang, Yi Liu





Responsible Speech Foundation Models (Special Session)


Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction
Daniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland R. Barnard, David T. Jones, Hugo Botha

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models
Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems
Ajinkya Kulkarni, Atharva Kulkarni, Miguel Couceiro, Isabel Trancoso

Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition
Yi-Cheng Lin, Haibin Wu, Huang-Cheng Chou, Chi-Chun Lee, Hung-yi Lee

On the social bias of speech self-supervised models
Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee

Self-supervised Speech Representations Still Struggle with African American Vernacular English
Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen

Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng





Acoustic Event Detection, Segmentation and Classification


FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani

LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification
Li Xiao, Lucheng Fang, Yuhong Yang, Weiping Tu

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

Robust Laughter Segmentation with Automatic Diverse Data Synthesis
Taisei Omine, Kenta Akita, Reiji Tsuruno

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing
Martin Lebourdais, Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega

Predicting Heart Activity from Speech using Data-driven and Knowledge-based features
Gasser Elbanna, Zohreh Mostaani, Mathew Magimai.-Doss

Measuring acoustic dissimilarity of hierarchical markers in task-oriented dialogue with MFCC-based dynamic time warping
Natalia Morozova, Guanghao You, Sabine Stoll, Adrian Bangerter

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness
Sai Srujana Buddi, Satyam Kumar, Utkarsh Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

Generalized Fake Audio Detection via Deep Stable Learning
Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Xin Qi, Yi Lu, Shuchen Shi

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar, Ognjen Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor
Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He

Multi-label Bird Species Classification from Field Recordings using Mel_Graph-GCN Framework
Noumida A, Rajeev Rajan








Noise, Far-Field, Multi-Talker, Enhancement, Audio Classification


RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios
Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Multi-Channel Multi-Speaker ASR Using Target Speaker’s Solo Segment
Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition
William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Peer, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network
Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement
Daniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, Peter Balazs

DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement Network
Wenjun Wang, Shangbin Mo, Ling Dong, Zhengtao Yu, Junjun Guo, Yuxin Huang

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation
Riyansha Singh, Parinita Nema, Vinod K Kurmi

Bird Whisperer: Leveraging Large Pre-trained Acoustic Model for Bird Call Classification
Muhammad Umer Sheikh, Hassan Abid, Bhuiyan Sanjid Shafique, Asif Hanif, Muhammad Haris Khan

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling
Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered Speech
Marvin Borsdorf, Zexu Pan, Haizhou Li, Tanja Schultz





Connecting Speech-science and Speech-technology for Children’s Speech (Special Session)


Preliminary Investigation of Psychometric Properties of a Novel Multimodal Dialog Based Affect Production Task in Children and Adolescents with Autism
Carly Demopoulos, Linnea Lampinen, Cristian Preciado, Hardik Kothare, Vikram Ramanarayanan

Training speech-breathing coordination in computer-assisted reading
Delphine Charuau, Andrea Briglia, Erika Godde, Gérard Bailly

How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech?
Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha

Examining Vocal Tract Coordination in Childhood Apraxia of Speech with Acoustic-to-Articulatory Speech Inversion Feature Sets
Nina R. Benway, Jonathan L. Preston, Carol Espy-Wilson

Children’s Speech Recognition through Discrete Token Enhancement
Vrunda N. Sukhadia, Shammur Absar Chowdhury

Bridging Child-Centered Speech Language Identification and Language Diarization via Phonetics
Yujia Wang, Hexin Liu, Leibny Paola Garcia

Reading Miscue Detection in Primary School through Automatic Speech Recognition
Lingyun Gao, Cristian Tejedor-Garcia, Helmer Strik, Catia Cucchiarini

Automatic Evaluation of a Sentence Memory Test for Preschool Children
Ilja Baumann, Nicole Unger, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis
Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning
Lucas Block Medin, Thomas Pellegrini, Lucile Gelin

Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models
Ruchao Fan, Natarajan Balaji Shankar, Abeer Alwan

Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children’s Automatic Speech Recognition Adaptation
Thomas Rolland, Alberto Abad

Improving child speech recognition with augmented child-like speech
Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch
Thomas Graave, Zhengyang Li, Timo Lohrenz, Tim Fingscheidt

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan



Search papers
Article
×

Keynote 1 ISCA Medallist

L2 Speech, Bilingualism and Code-Switching

Speaker Diarization 1

Speech and Audio Analysis and Representations

Acoustic Event Detection and Classification 2

Detection and Classification of Bioacoustic Signals

Acoustic Echo Cancellation

Speech Synthesis: Voice Conversion 1

Neural Network Architectures for ASR 2

Decoding Algorithms

Pronunciation Assessment

Spoken Language Processing

Spoken Machine Translation 2

Biosignal-enabled Spoken Communication

Individual and Social Factors in Phonetics

Paralinguistics

Speaker Recognition: Adversarial and Spoofing Attacks

Audio Event Detection and Classification 1

Source Separation 2

Noise Reduction, Dereverberation, and Echo Cancellation

Computationally-Efficient Speech Enhancement

Zero-shot TTS

Noise Robustness, Far-Field, and Multi-Talker ASR

Contextual Biasing and Adaptation

Spoken Language Understanding

Spoken Machine Translation 1

Hearing Disorders

Speech Disorders 2

TAUKADIAL Challenge: Speech-Based Cognitive Assessment in Chinese and English (Special Session)

Show and Tell 1

Keynote 2

Phonetics and Phonology of Second Language Acquisition

Corpora-based Approaches in Automatic Emotion Recognition

Analysis of Speakers States and Traits

Spoofing and Deepfake Detection

Audio Captioning, Tagging, and Audio-Text Retrieval

Generative Speech Enhancement

Speech Synthesis: Evaluation

Multilingual ASR

General Topics in ASR

Spoken Language Understanding

Speech and Multimodal Resources

Pathological Speech Analysis 1

Speech and Language in Health: from Remote Monitoring to Medical Conversations - 1 (Special Session)

Speech and Brain

Innovative Methods in Phonetics and Phonology

Voice, Tones and F0

Emotion Recognition: Resources and Benchmarks

Speaker and Language Identification and Diarization

Audio-Text Retrieval

Speech Enhancement

Speech Coding

Speech Synthesis: Expressivity and Emotion

Speech Synthesis: Tools and Data

Speech Synthesis: Singing Voice Synthesis

LLM in ASR

Vision and Speech

Spoken Document Summarization

Speech and Language in Health: from Remote Monitoring to Medical Conversations - 2 (Special Sessions)

Show and Tell 2

Prosody

Foundational Models for Deepfake and Spoofed Speech Detection

Speaker Recognition 1

Source Separation 1

Audio-Visual and Generative Speech Enhancement

Speech Privacy and Bandwidth Expansion

Speech Synthesis: Prosody

Accented Speech, Prosodic Features, Dialect, Emotion, Sound Classification

Neural Network Adaptation

ASR and LLMs

Pathological Speech Analysis 3

Speech Disorders 3

Speech Recognition with Large Pretrained Speech Models for Under-represented Languages (Special Session)

Speech Processing Using Discrete Speech Units (Special Session)

Keynote 3

Databases and Progress in Methodology

Articulation, Convergence and Perception

Speech Emotion Recognition

Self-Supervised Models in Speaker Recognition

Speech Quality Assessment

Privacy and Security in Speech Communication 1

Speech Synthesis: Voice Conversion 2

Speech Synthesis: Text Processing

Training Methods, Self-Supervised Learning, Adaptation

Novel Architectures for ASR

Multimodality and Foundation Models

Spoken Dialogue Systems and Conversational Analysis 1

Speech Technology

Pathological Speech Analysis 2

Speech Science, Speech Technology, and Gender (Special Session)

Speech Production and Perception

Phonetics and Phonology: Segmentals and Suprasegmentals

Topics in Paralinguistics

Emotion Recognition: Fairness, Variability, Uncertainty

Speaker Verification

Spatial Audio and Acoustics

Generative Models for Speech and Audio

Speech and Audio Modelling

Multi-Channel Speech Enhancement

Speech Synthesis: Paradigms and Methods 1

Speech Synthesis: Paradigms and Methods 2

Neural Network Architectures for ASR 1

Error Correction and Rescoring

Spoken Language Understanding

Spoken Dialogue Systems and Conversational Analysis 2

Computational Models of Human Language Acquisition, Perception, and Production (Special Session)

Show and Tell 3

Phonetics, Phonology and Prosody

Segmentals

New Avenues in Emotion Recognition

Speaker Diarization 2

Speaker Recognition 2

Speech and Audio Analysis

Speech Quality and Intelligibility: Prediction and Enhancement

Speech Synthesis: Vocoders

ASR Model Training Methods

Cross-Lingual and Multilingual Processing

Speech Assessment

Question Answering from Speech and Spoken Dialogue Systems

Spoken Dialogue Systems and Conversational Analysis 3

Dysarthric Speech Assessment

Spoken Language Models for Universal Speech Processing (Special Session)

Keynote 4

L1/L2 Acquisition and Cross-Linguistic Factors

Speaker Stance, Emotion and Language-External Factors

Experimental Phonetics and Laboratory Phonology

Speaker recognition evaluation and resources

Speech Type Classification

Target Speaker Extraction

Speech Synthesis: Voice Conversion 3

Speech Synthesis: Paradigms and Methods 3

Privacy and Security in Speech Communication 2

Streaming ASR

Computational Resource Constrained ASR

Evaluation of Speech Technology Systems

Neural Network Training for Speech Recognition

Leveraging Large Language Models and Contextual Features for Phonetic Analysis (Special Session)

Responsible Speech Foundation Models (Special Session)

Multimodal Paralinguistics

Automatic Emotion Recognition

Self and Weakly-Labelled Speaker Verification

Acoustic Event Detection, Segmentation and Classification

Speech and Audio Modelling

Fake Audio Detection

Deep Learning-Based Speech Enhancement: Approaches, Scalability, and Evaluation

Speech Synthesis: Other Topics 1

Speech Synthesis: Other Topics 2

Speech synthesis: Cross-lingual and multilingual aspects

Noise, Far-Field, Multi-Talker, Enhancement, Audio Classification

Self-Supervised Learning for ASR

Spoken Term Detection and Speech Retrieval

Speech Disorders 1

Connecting Speech-science and Speech-technology for Children’s Speech (Special Session)

Show and Tell 4