Automatic speech recognition (ASR) is crucial for all users, but adapting it for Alzheimer’s disease (AD) faces challenges due to irregular speech patterns and privacy concerns. Federated learning (FL), a privacy-preserving algorithm, is a solution. However, FL ASR suffers from acoustic and text heterogeneities. While advanced model-based and cluster-based FL methods aim to address the issue, they lack a direct mechanism for high intra-speaker heterogeneity exhibited by AD individuals and ASR-related properties. This study presents cluster-based personalized federated learning (CPFL), a strategy mitigating heterogeneity by clustering ASR output token using the proposed CharDiv, a metric for pause and word usage distributions. Evaluation on the ADReSS challenge dataset shows a 3.6% improvement in word error rate (WER). Analysis of per-cluster WER improvements and CharDiv distributions indicates reduced heterogeneity, emphasizing pause usage as a potential key factor in AD-oriented ASR.