Large language models (LLMs) have been widely used in cross-lingual and emotional speech synthesis, but they require extensive data and retain the drawbacks of previous autoregressive (AR) speech models, such as slow inference speed and lack of robustness and interpretation. In this paper, we propose a cross-lingual emotional speech generation model, X-E-Speech, which achieves the disentanglement of speaker style and cross-lingual content features by jointly training non-autoregressive (NAR) voice conversion (VC) and text-to-speech (TTS) models. For TTS, we freeze the style-related model components and fine-tune the content-related structures to enable cross-lingual emotional speech synthesis. For VC, we improve the emotion similarity between the generated results and the reference speech by introducing the similarity loss between content features for VC and text for TTS.