Social robotics and human-robot partnership are becoming very relevant topics defining many challenges for state-of-the-art speech technology. This paper presents the first evaluation of speech emotion recognition (SER) technology with nonacted speech data recorded in a real indoor humanrobot interaction (HRI) scenario. The challenge is typified by distant speech processing, reverberation, and additive external and robot engine noise. We train and evaluate a machine learning-based based on simulated acoustic modelling that includes room impulse responses (RIRs), external noise, and beamforming response. We observe increased performance in the prediction of arousal, valence, and dominance with the proposed training procedure combined with delayandsum and minimum variance distortionless response (MVDR), with gain as high as 180%, compared with the result obtained with the model trained with the original data in controlled environments. Moreover, the degradation achieved when compared with the original matched training/testing condition is just 39%.