ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization

Siyuan Chen, Colin A. Grambow, Mojtaba Kadkhodaie Elyaderani, Alireza Sadeghi, Federico Fancellu, Thomas Schaaf

Large-scale pre-training has been a successful strategy for training transformer models. However, maintaining a large clinical dataset for pre-training is not always possible, and access to data in this domain can be time-limited and costly. We explore using synthetic data in pre-training sequence-to-sequence (seq-to-seq) transformer models to generate clinical notes from Doctor-Patient-Conversations (DoPaCos). Using a generative language model fine-tuned on authentic conversations, a synthetic DoPaCo dataset was created and used with a corpus of clinical notes to pre-train a Longformer-Encoder-Decoder (LED) model. Results show that synthetic data leads to comparable performance in the downstream summarization task compared to pre-training with authentic data. Pre-training on synthetic conversations first, followed by clinical notes, yields higher performance across most of our evaluation metrics.