We propose a system designed for multimodal emotion recognition. Our research focuses on showing the impact of various signals in the emotion recognition process. Apart from reporting the average results of our models, we would like to encourage individual engagement of conference participants and explore how a unique emotional scene recorded on the spot can be interpreted by the models - for individual modalities as well as their combinations. Our models work for English, German and Korean. We show the comparison of emotion recognition accuracy for these 3 languages, including the influence of each modality. Our second experiment explores emotion recognition for people wearing face masks. We show that the use of face masks affects not only the video signal but also audio and text. To our knowledge, no other study shows the effects of wearing a mask for three modalities. Unlike other studies where masks are added artificially, we use real recordings with actors in masks.