Text-based speech editing systems are developed to enable users to modify speech based on the transcript. Existing state-of-the-art editing systems based on neural networks do partial inferences with no exception, that is, only generate new words that need to be replaced or inserted. This manner usually leads to the prosody of the edited part being inconsistent with the surrounding speech and a failure to handle the alteration of intonation. To address these problems, we propose a cross-utterance conditioned coherent speech editing system, that first does the entire reasoning at the inference time. Our proposed system can generate speech by utilizing speaker information, context, acoustic features, and the mel-spectrogram from the original audio. Experiments conducted on subjective and objective metrics demonstrate that our approach outperforms the baseline on various editing operations regarding naturalness and prosody consistency.