Laughter and fillers are common phenomenon in speech, and play an important role in communication. In this study, we present Deep Neural Network (DNN) and Convolutional Neural Network (CNN) based systems to classify non-verbal cues (laughter and fillers) from verbal speech in naturalistic audio. We propose improvements over a deep learning system proposed in [1]. Particularly, we propose a simple method to combine spectral features with pitch information to capture prosodic and spectral cues for filler/laughter. Additionally, we propose using a wider time context for feature extraction so that the time evolution of the spectral and prosodic structure can also be exploited for classification. Furthermore, we propose to use CNN for classification. The new method is evaluated on conversational telephony speech (CTS, drawn from Switchboard and Fisher) data and UT-Opinion corpus. Our results shows that the new system improves the AUC (area under the curve) metric by 8.15% and 11.9% absolute for laughters, and 4.85% and 6.01% absolute for fillers, over the baseline system, for CTS and UT-Opinion data, respectively. Finally, we analyze the results to explain the difference in performance between traditional CTS data and naturalistic audio (UT-Opinion), and identify challenges that need to be addressed to make systems perform better for practical data.