This paper proposes a novel paradigm of speech recognition where only the supra-segmental features are utilized. Absolute properties of speech events such as formants and spectrums are completely discarded and only the relative and differential properties of the events are extracted as phonic contrasts. The phonic contrasts are considered as supra-segmental features and they are mathematically shown not to carry non-linguistic features such as speaker, age, gender, etc. This fact leads us to expect that speaker-independent speech recognition should be possible with the reference models built only with a single speaker¡¯s speech. Experiments of isolated vowel sequence recognition show that this expectation is correct and that the performance of the new paradigm is better than that of the conventional one using more than four thousand speakers, even in the case of noisy speech. Hearing sounds through capturing only their contrasts and their structure is often done when hearing musical sounds, indicating that the proposed paradigm hears speech as music.