Singing voice synthesis (SVS) corpora are more costly to collect than TTS corpora. SVS using only a TTS corpus is challenging because the ranges of fundamental frequency (fo) and phoneme duration in SVS are wider than those in TTS. Although a melody-unsupervised method prototyped SVS using only a TTS corpus, some problems remain. To improve duration and fo controllability, this paper proposes a unified TTS and SVS framework. It is based on the FastSpeech-2-based duration-expansion-robust TTS acoustic model with phoneme embedding skip connection (PESC) and FIRNet source-filter neural vocoder with source-filter acoustic features. In the inference for SVS, the input text, fo, and phoneme duration are obtained from lyrics and notes in a musical score. Additionally, input fo shift is proposed. Experiments using the JSUT corpus confirm that the PESC-based acoustic model using input fo shift and FIRNet can improve the SVS quality compared with that using HiFi-GAN.