A speech disfluency, such as a filled pause, repetition, or revision, disrupts the typical flow of speech. Disfluency modeling has grown as a research area, as recent work has shown that these disfluencies may help in assessing health conditions. For example, for individuals with cognitive impairment, changes in disfluencies may indicate worsening symptoms. However, work on disfluency modeling has focused heavily on detection and less on categorization. Work that has focused on categorization has suffered with two specific classes: repetitions and revisions. In this paper, we evaluate how BERT (Bidirectional Encoder Representations from Transformers) compares to other models on disfluency detection and categorization. We also propose adding a second fine-tuning task where BERT learns to distance repetitions and revisions from their repairs with triplet loss. We find that BERT and BERT with triplet loss outperform previous work on disfluency detection and categorization, particularly for repetitions and revisions. In this paper we present the first analysis of how these models can be fine-tuned on widely available disfluency data, and then used in an off-the-shelf manner on small corpora of pathological speech.