Text-to-speech synthesizer is described, based on the the concatenation of the Polish diphones. The text-to-phoneme conversion is based on the neural network. Diphones are extracted and stored pitch- synchronously, using the variable rate linear predictive coder with mixed excitation. The pitch period modification is based on the time- domain interpolation of the excitation signal. Duration is controlled by insertion of the pitch periods and interpolation of the excitation signal. Preliminary results are reported.