The rise of online multimedia, particularly on YouTube, has transformed information dissemination. Specifically, this study examines the multimodal nature of these online speeches, focusing on two prevalent speech genres in Taiwan Mandarin YouTube content: entertaining and informative clips. We collected 100-minute video clips from sixteen influential YouTubers for each genre, and segmented the clips into inter-pause units (IPUs) for subsequent analyses. For each IPU, acoustic features describing durational, rhythmic, and pitch patterns were derived from its speech signals, while bag-of-word lexical features were developed from its textual content. Our objectives were twofold: firstly, to explore the genre-specific prosodic patterns using the proposed prosodic feature set, and secondly, to evaluate the additional contribution of these prosodic features in enhancing the accuracy of speech genre classification when integrated with textual features. Results show that the ensemble model outperforms prosody-only and text-only mono-modal models with an 84.6% accuracy, suggesting the complementary role of prosodic features in speech genre classification. Furthermore, our findings underscore the impact of semantic topics on textual features, potentially leading to misclassifications of topic-neutral IPUs in monomodal models. This study highlights the imperative consideration of both prosodic and textual features in determining speech genres within multimodal discourse.