Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.