For achieving fast and high-fidelity neural text-to-speech on edge smartphone devices without network connection, we NICT prototyped Mobile PresenTra by introducing non-autoregressive acoustic model with Transformer encoder and ConvNeXt decoder, and MS-FC-HiFi-GAN neural vocoder. Additionally, the incremental inference is applied only to neural vocoder for low-latency synthesis without performance degradation. Compared with a previous NICT system with Transformer encoder, Transforme decoder and MS-HiFi-GAN neural vocoder, the proposed Mobile PresenTra can realize high-fidelity and fast synthesis on a middle-range smartphone with a real-time factor of about 0.3 for batch inference, and a latency of less than 0.5 s for incremental inference. In the Show & Tell, attendees can freely experience the demonstration of Mobile PresenTra systems implemented on actual smartphones for English, Japanese and Chinese with arbitrary text input.