In recent years, the quality of text-to-speech models has increased significantly, but most text-to-speech solutions are trained on datasets of read speech and do not cover conversational speaking styles due to the lack of suitable training data. This paper explores options for creating multi-style speech synthesis using speech recognition datasets that contain samples of spontaneous speech and dialogues but may also include background noise and an insufficient number of samples per speaker. We develop an Estonian multi-speaker TTS system that increases prosodic variability on conversational inputs while still being able to synthesize read speech. We show that our proposed approach can be used to train models that can be controlled to produce conversational speech with little compromise on audio quality. We also highlight a potential multilingual use case to achieve cross-lingual speaker and style transfer to low-resource languages that lack stylistically diverse speech corpora.