Sarcasm is expressed through subtle cues like pitch, speech rate, and facial expressions, with patterns varying across languages, e.g., English speakers lower the pitch while Cantonese speakers raise it. While humans readily interpret these signals, computational models struggle, creating challenges for Human-Machine Interaction. Most multimodal sarcasm recognition research focuses on English and the lack of high-quality datasets for other languages hinders cross-lingual and cross-cultural studies. We introduce the Multimodal Chinese Sarcasm Dataset (MCSD), containing 10.57 hours of video. We propose a standardized annotation framework that captures annotator certainty to reflect the subjectivity of sarcasm, achieving a Fleiss’ kappa of 0.74 (unweighted) and 0.79 (certainty-weighted). Validation of our dataset using SVM achieves a 76.64% F1-score in sarcasm detection. MCSD lays the foundation for robust cross-lingual sarcasm detection, contributing to advanced, human-centric systems.