A listener listens to one speech stream at a time in a multi-speaker scenario. EEG-based auditory attention detection (AAD) aims to identify to which speech stream the listener has attended using EEG signals. The performance of linear modeling approaches is limited due to the non-linear nature of the human auditory perception. Furthermore, the real-world applications call for low latency AAD solutions in noisy environments. In this paper, we propose to adopt common spatial pattern (CSP) analysis to enhance the discriminative ability of EEG signals. We study the use of convolutional neural network (CNN) as the non-linear solution. The experiments show that it is possible to decode auditory attention within 2 seconds, with a competitive accuracy of 80.2%, even in noisy acoustic environments. The results are encouraging for brain-computer interfaces, such as hearing aids, which require real-time responses, and robust AAD in complex acoustic environments.