The rapid advancements in text-to-speech (TTS) and voice conversion (VC) technologies necessitate evaluating the quality of synthesized speech. In this paper, we propose a novel network, FUSE-MOS, which combines the learned latent representations from raw audio waveforms and their corresponding Log-Mel spectrograms, to estimate the posterior distribution of Mean Opinion Score (MOS). Our method thus learns a broader and more nuanced representation of the speech signal. At inference, it predicts MOS value (point estimate) and also provides a measure of uncertainty of that prediction. By leveraging the combined latent representation, FUSE-MOS achieves significant improvements in performance metrics when compared to other existing approaches on benchmark datasets. We also explore an intelligent form of uncertainty filtering strategy to filter out low-confidence (high-uncertainty) samples. It shows FUSE-MOS's capability to maintain strong performance even with reduced data.