Sarcasm is a frequently used linguistic device which is expressed in a multitude of ways, both with acoustic cues (including pitch, intonation, intensity, etc.) and visual cues (including facial expression, eye gaze, etc.). While cues used in the expression of sarcasm are well-described in the literature, there is a striking paucity of attempts to perform automatic sarcasm detection in speech. To explore this gap, we elaborate a methodology of implementing Inductive Transfer Learning (ITL) based on pre-trained Deep Convolutional Neural Networks (DCNNs) to detect sarcasm in speech. To those ends, the multimodal dataset MUStARD is used as a target dataset in this study. The two selected pre-trained DCNN models used are Xception and VGGish, which we trained on visual and audio datasets. Results show that VGGish, which is applied as a feature extractor in the experiment, performs better than Xception, which has its convolutional layers and pooling layers retrained. Both models achieve a higher F-score compared to the baseline Support Vector Machines (SVM) model by 7% and 5% in unimodal sarcasm detection in speech.