Nowadays speech synthesis has reached levels of voice quality and naturalness close to human. This has been achieved thanks to the rapid evolution of generative architectures deployed for neural text-to-speech (TTS). Many approaches have been proposed to encode speech style --i.e. prosody attributes--leveraging these models in order to transfer it to the generated speech. The most common acoustic features for this purpose are the spectrograms. However, is the whole frequency representation really necessary to learn speech attributes? To answer this question, in this work we propose the sparse pitch matrix (SPM), an sparse and binary representation of the pitch sub band. We assumed that pitch is sufficient to make the model extrapolate the rest of the prosody aspects. To study its impact, we performed an experiment built upon the unsupervised global style tokens conditioning the Tacotron2 decoding. The tokens were fed with the encoded SPMs during training, similarly to the original approach. From the posterior analysis we found that: 1) there are significant differences in many prosody attributes between tokens, and 2) all tokens, in isolation, provide acceptable levels of quality, intelligibility and naturalness, according to human evaluators.