ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Towards High-Quality LLM-Based Data for French Spontaneous Speech Simplification: an Exo-Refinement Approach

Lucía Ormaechea, Nikos Tsourakis, Pierrette Bouillon, Benjamin Lecouteux, Didier Schwab

This study explores the synthetic data generation capabilities of LLMs for French spontaneous speech simplification (S3), a low-resource NLP task. We introduce the exo-refinement approach, which builds on the self-reflect workflow but differs by using separate models for generation and evaluation. To address the limitations of single-model refinement, it integrates external feedback from distinct LLMs as judges, refining outputs based on 3 task-specific dimensions. Comparing expert-based simplifications from gold transcriptions to LLM synthetic outputs, results show that mistral-large outperforms all benchmarked models, including the MUSS baseline, and mistral-small achieves competitive performance with few refinements. SARI results confirm that iterations improve simplicity gain without compromising semantic meaning, as shown by COMET scores. These findings support exo-refinement as a scalable method for synthetic data generation and future S3 model development.