Spoken language transcripts generated from Automatic speech recognition (ASR) often contain a large portion of disfluency and lack punctuation symbols. Punctuation restoration and disfluency removal of the transcripts can facilitate downstream tasks such as machine translation, information extraction and syntactic analysis [1]. Various studies have shown the influence between these two tasks and thus performed modeling based on a multi-task learning (MTL) framework [2, 3], which learns general representations in the shared layers and separate representations in the task-specific layers. However, task dependencies are normally ignored in the task-specific layers. To model the dependencies of tasks, we propose an attention-based structure in the task-specific layers of the MTL framework incorporating the pretrained BERT (a state-of-art NLP-related model) [4]. Experimental results based on English IWSLT dataset and the Switchboard dataset show the proposed architecture outperforms the separate modeling methods as well as the traditional MTL methods.