The automatic recognition of facial behaviours is usually achieved through the detection of particular FACS Action Unit (AU), which then makes it possible to analyse the affective behaviours expressed in the face. Despite the fact that advanced techniques have been proposed to extract relevant facial descriptors, the processing of real-life data, i. e., recorded in unconstrained environments, makes the automatic detection of FACS AU much more challenging compared to constrained recordings, such as posed faces, and even impossible when the corresponding parts of the face are masked or subject to low or no illumination. We present in this paper the very first attempt in using acoustic cues for the automatic detection of FACS AU, as an alternative way to obtain information from the face when such data are not available. Results show that features extracted from the voice can be effectively used to predict different types of FACS AU, and that the best performance are obtained for the prediction of the apex, in comparison to the prediction of onset, offset and occurrence.