Automatic speech quality assessment aims to train a model capable of automatically measuring the performance of synthesis systems. This is a challenging task, especially when the domain of the evaluation data is different to that of the training data. In this paper, we present a multi-task and transfer learning framework for predicting the mean opinion score (MOS) of synthetic speech from different domains. Specifically, the proposed framework consists of a common encoder shared by data from different domains and two domain-specific decoders for in-domain and out-of-domain data, respectively. A wav2vec2 fine-tuned for phone recognition task is utilized as an initialization of the shared encoder to make full use of its learned knowledge from large number of unlabeled data and task-related labeled data. The experiments are conducted on the VoiceMOS Challenge dataset. The results show that the proposed system outperforms the baseline solutions for both in-domain and out-of-domain MOS prediction scenarios. Further, we show that the wav2vec2 encoder fine-tuned for phone recognition can be transferred to boost the performance of the MOS prediction.