Most end-to-end neural text-to-speech (TTS) systems generate acoustic features autoregressively from left to right, which still suffer from two problems: 1) low efficiency during inference; 2) the limitation of “exposure bias”. To overcome these shortcomings, this paper proposes a non-autoregressive speech synthesis model which is based on the transformer structure. During training, the ground truth of acoustic features is schedule masked. The decoder needs to predict the entire acoustic features by taking text and the masked ground truth. During inference, we just need a text as input, the network will predict the acoustic features in one step. Additionally, we decompose the decoding process into two stages so that the model can consider the information in the context. Given an input text embedding, we first generate coarse acoustic features, which focus on the meaning of sentences. Then, we fill in missing details of acoustic features by taking into account the text information and the coarse acoustic features. Experiments on a Chinese female corpus illustrate that our approach can achieve competitive results in speech naturalness relative to autoregressive model. Most importantly, our model speed up the acoustic features generation by 296× compared with the autoregressive model based on transformer structure.