We present PronScribe, a novel method for phonemic transcription from speech and text input based on careful fine-tuning and adaptation of a massive, multilingual, multimodal speech-text pretrained model. We show that our model is capable of phonemically transcribing pronunciations of full utterances with accurate word boundaries in a variety of languages covering diverse phonological phenomena, achieving phoneme error rates in the vicinity of 1-2% which is comparable to human transcribers. We show that PronScribe can effectively learn this task from relatively little training data, making it attractive even in low-resource settings. It learns from text and speech simultaneously in a coherent way, and is better than previous models using speech, text or both. Additionally, the model's good transfer learning characteristics in multilingual settings can effectively boost performance for lower-resourced languages.