Voice activity detection (VAD) is an essential front-end in many speech applications that aims at determining the presence or absence of speech signals in an audio frame. However, traditional VAD methods often suffer from poor performance or non-causality in low signal-to-noise ratio (SNR) environments. In this work, we therefore present a real-time causal VAD model, which mainly consists of a frequency-domain feature generation module, a convolutional-based encoding module and a residual block based decoding module. The exploitation of only current and past frames for feature extraction guarantees the causality. The effectiveness of the proposed model is verified on two datasets under various noise conditions. It is shown that the proposed method can achieve a comparable or even better performance than state-of-the-art non-causal models.