We developed a voice-based, self-paced cursor control task to collect corresponding intracranial neural data during isolated utterances of phonemes, namely vowel, nasal and fricative sounds. Two patients implanted with intracranial depth electrodes for clinical epilepsy monitoring performed closed-loop voice-based cursor control from real-time processing of microphone input. In post-hoc data analyses, we searched for neural features that correlated with the occurrence of non-specific speech sounds or specific phonemes. In line with previous studies, we observed onset and sustained responses to speech sounds at multiple recording sites within the superior temporal gyrus. Based on differential patterns of activation in narrow frequency bands up to 200 Hz, we tracked voice activity with 91% accuracy (chance level: 50%) and classified individual utterances into one of five phonemes with 68% accuracy (chance level: 20%). We propose that our framework could be extended to additional phonemes to better characterize neurophysiological mechanisms underlying the production and perception of speech sounds in the absence of language context. In general, our findings provide supplementary evidence and information toward the development of speech brain-computer interfaces using intracranial electrodes.