This work focuses on audio abuse detection from an acoustic cue perspective in a multilingual social media setting. While textual abuse detection has been widely researched, comparatively, abuse detection from audio remains unexplored. Our key hypothesis is based on the fact that abusive behavior leads to distinct acoustic cues. Such cues can help detect abuse directly from audio signals without the need to transcribe them. We first demonstrate that employing a generic large pre-trained acoustic/language model is suboptimal. This proves that incorporating the right acoustic cues might be the way forward to improve performance and achieve generalization in a large-scale setting. Our proposed method explicitly focuses on two modalities: the underlying emotions expressed and the language features of audio. On the recently proposed ADIMA benchmark for this task, our approach achieves the state-of-the-art performance of 96% on the test set and outperforms existing best models by a large margin.