We propose a methodology for information aggregation from the various transformer layer outputs of a generic speech Encoder (e.g. WavLM, HuBERT) for the downstream task of Speech Emotion Recognition (SER). The proposed methodology significantly reduces the dependency of model predictions on linguistic content, while leading to competitive performance without requiring costly Encoder re-training. The proposed paradigm is evaluated via Accuracy, Positive Pointwise Mutual Information, and visualization of the learned attention weights. This methodology generalizes well to a multi-language SER setting in addition to single-language SER, suggesting existing cultural commonalities in the paralinguistic domain between different languages. Experimental results demonstrate this ability by testing our model on unseen languages in a zero-shot fashion, suggesting our proposed method is inclusive in the context of speech and language, thus, making it applicable to a wide audience of speakers.