Recent work has shown that self-attention module in Transformer architecture is an effective way of modeling natural languages and images. In this work, we propose a novel way for audio synthesis using Self-Attention Network (SAN). To the best of our knowledge, there is no successful application of Transformer architecture or SAN in high-fidelity waveform generation tasks. The main challenge in adapting SAN to audio generation tasks lies in its quadratic growth of the computational complexity with respect to the input sequence length, making it impractical with high-resolution audio tasks. To tackle this problem, we apply dilated sliding window to vanilla SAN. This technique enables our model to have large receptive field, linear computational complexity and extremely small footprint. We experimentally show that the proposed model archives smaller model size, while producing audio samples with comparable speech quality in comparison with the best publicly available model. In particular, our small footprint model has only 0.57M parameters and can generate 22.05kHz high-fidelity audio 113 times faster than real-time on a NVIDIA V100 GPU without engineered inference kernels.