Many speech synthesis systems only consider the information within each sentence and ignore the contextual semantic and acoustic features. This makes it inadequate to generate high-expressiveness paragraph-level speech. In this paper, a context-aware speech synthesis system named MaskedSpeech is proposed, which considers both contextual semantic and acoustic features. Inspired by the masking strategy in speech editing research, the acoustic features of the current sentence are masked out and concatenated with those of contextual speech, and further used as additional model input. Furthermore, cross-utterance coarse-grained and fine-grained semantic features are employed to improve the prosody generation.The model is trained to reconstruct the masked acoustic features with the augmentation of both the contextual semantic and acoustic features.Experimental results demonstrate that the MaskedSpeech outperformed the baseline systems significantly in terms of naturalness and expressiveness.