Speech detection is an important first step for audio analysis on media contents, whose goal is to discriminate the presence of speech from non-speech. It remains a challenge owing to various sound sources included in media audio. In this work, we present a novel audio feature extraction method to reflect the acoustic characteristic of the media audio in the time-frequency domain. Since the degree of combination of harmonic and percussive components varies depending on the type of sound source, the audio features which further distinguish between speech and non-speech can be obtained by decomposing the signal into both components. For the evaluation, we use over 20 hours of drama which manually annotated for speech detection as well as 4 full-length movies with annotations released for a research community, whose total length is over 8 hours. Experimental results with deep neural network show superior performance of the proposed in media audio condition.