Automatically predicting the mean opinion score (MOS) of a synthesized speech without the reference signal with deep learning systems has been studied extensively recently and shown great results. However, previous best systems are mostly based on self-supervised learned (SSL) models consisting of up to hundreds of millions of parameters making them unsuitable for mobile or embedded applications. In this paper, we propose MOSLight, a non-SSL-based lightweight yet powerful system for MOS prediction. We argue that 2D convolutions are inefficient for audio feature processing and not ideal for tasks where training data are scarce. To build MOSLight, we utilized depthwise separable dilated 1D convolutions and incorporated multi-task learning and non-strict frame-level score clipping. We conducted experiments on the Voice Conversion Challenge 2018 (VCC2018) and BVCC. Results show MOSLight achieves great effectiveness despite being a lightweight model trained with limited training data.