To infer emotions accurately from speech, fusion of audio and text is essential as words carry most information about semantics and emotions. Attention mechanism is essential component in multimodal fusion architecture as it dynamically pairs different regions within multimodal sequences. However, existing architecture lacks explicit structure to model dynamics between fused representations. Thus we propose recurrent multi-head attention in a fusion architecture, which selects salient fused representations and learns dynamics between them. Multiple 2-D attention layers select salient pairs among all possible pairs of audio and text representations, which are combined with fusion operation. Lastly, multiple fused representations are fed into recurrent unit to learn dynamics between fused representations. Our method outperforms existing approaches for fusion of audio and text for speech emotion recognition and achieves state-of-the-art accuracies on benchmark IEMOCAP dataset.