ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking

Jaume Santamaría-Jordà, Pablo Segovia-Martínez, Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà, Adrià Giménez, Rubén Gaspar Aparicio, René Fernández Sánchez, Jorge Civera, Albert Sanchis, Alfons Juan

We present LHCP-ASR, an English speech corpus of high-energy particle physics talks, with 235 hours of transcribed speeches extracted from the 2020-2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area of having new open, reliable, real-life and challenging ASR benchmarks.