Speech-to-text translation (S2TT) has made it critically important to overcome language barriers. Several multilingual datasets have been introduced recently to expand the coverage of multilingual S2TT systems. However, most research works only focus on increasing the number of languages covered. Unfortunately, many of those languages were covered with only a few hours of training data resulting in a low translation performance. This paper proposes utilizing a unified speech-text representation learning framework to overcome the shortage of parallel speech-text datasets in the S2TT system. Although the approach can be utilized for any language pair, we focus on the Japanese-English S2TT task and evaluate it on the publicly available CoVoST 2 dataset. In addition, we also evaluate the S2TT system on our new Japanese-English dataset with sentence ambiguities in which the same spoken utterances can have different translation meanings depending on different prosodic features. We achieve competitive results compared with other state-of-the-art models in CoVoST 2 dataset and provide significant improvement in the more challenging case of our new dataset.