We investigated inter-observer agreement and the reliability of self-reported emotion ratings (i.e., self-raters judging their own emotions) in spontaneous multimodal emotion data. During a multiplayer video game, vocal and facial expressions were recorded (including the game content itself) and were annotated by the players themselves on arousal and valence scales. In a perception experiment, observers rated a small part of the data that was provided in 4 conditions: audio only, visual only, audiovisual and audiovisual plus context. Inter-observer agreements varied between 0.32 and 0.52 when the ratings were scaled. Providing multimodal information usually increased agreement. Finally, we found that the averaged agreement between the self-rater and the observers was somewhat lower than the inter-observer agreement.