End-to-End speech recognition is a recently proposed approach that directly transcribes input speech to text using a single model. End-to-End speech recognition methods including Connectionist Temporal Classification and Attention-based Encoder Decoder Networks have been shown to obtain state-of-the-art performance on a number of tasks and significantly simplify the modeling, training and decoding procedures for speech recognition. In this paper, we extend our prior work on End-to-End speech recognition focusing on the effectiveness of these models in far-field environments. Specifically, we propose introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources. We evaluate our proposed model on the CHiME-4 task, and show substantial improvement compared to a model optimized for a single microphone input.