In this paper, we compare and combine different approaches for instrumentally predicting the perceived quality of Text-to-Speech systems. First, a Log-Likelihood is determined by comparing features extracted from synthesized speech signals with features trained on natural speech. Second, parameters are extracted which capture quality-relevant degradations of the synthesized speech signal. Both approaches are combined and evaluated on auditory evaluated synthetic speech databases from the Blizzard Challenges 2008 and 2009. The results show that auditory quality judgments can be predicted with a sufficiently high accuracy and reliability. Especially the possibility to rank different synthesizer systems by their quality comes within reach.