Research in modeling subjective metrics for quality assessment has led to the development of no-reference speech models that directly operate on utterance waveforms to predict the Mean Opinion Score (MOS). These models often rely on convolutional layers for local feature extraction and embeddings from impractically large pretrained networks to enhance generalization. We propose an attention-only model based on Swin transformer and standard transformer layers to extract local context features and global utterance features, respectively. The self-attention operator excels at processing sequences, and our lightweight design enhances generalization on limited MOS datasets while improving real-world applicability. We train our network using a sequential self-teaching strategy to improve generalization on MOS labels affected by noise in listener ratings. Experiments on three datasets confirm the effectiveness of our design and demonstrate improvement over baseline models.