In this work, we compared two different input approaches to estimate autism severity using speech signals. We analyzed 127 audio recordings of young children obtained during the Autism Diagnostic Observation Schedule 2nd edition (ADOS-2) administration. Two different sets of features were extracted from each recording: 1) hand-crafted features, which included acoustic and prosodic features, and 2) log-mel spectrograms, which give the time-frequency representation. We examined two different Convolutional Neural Network (CNN) architectures for each of the two inputs and compared the autism severity estimation performance. We showed that the hand-crafted features yielded lower prediction error (normalized RMSE) in most examined configurations than the log-mel spectrograms. Moreover, fusing the estimated autism severity scores of the two feature extraction methods yielded the best results, where both architectures exhibited similar performance (Pearson R=0.66, normalized RMSE=0.24).