One of the keys for supervised learning techniques to succeed resides in the access to vast amounts of labelled training data. The process of data collection, however, is expensive, time-consuming, and application dependent. In the current digital era, data can be collected continuously. This continuity renders data annotation into an endless task, which potentially, in problems such as emotion recognition, requires annotators with different cultural backgrounds. Herein, we study the impact of utilising data from different cultures in a semi-supervised learning approach to label training material for the automatic recognition of arousal and valence. Specifically, we compare the performance of culture-specific affect recognition models trained with manual or cross-cultural automatic annotations. The experiments performed in this work use the dataset released for the Cross-cultural Emotion Sub-challenge of the Audio/Visual Emotion Challenge (AVEC) 2019. The results obtained convey that the cultures used for training impact on the system performance. Furthermore, in most of the scenarios assessed, affect recognition models trained with hybrid solutions, combining manual and automatic annotations, surpass the baseline model, which was exclusively trained with manual annotations.