Personalised speech enhancement (PSE) and audio-visual (AV) speech enhancement (SE) have emerged as promising approaches to improve speech quality and intelligibility in challenging acoustic environments. PSE leverages individual-specific vocal characteristics to address the label permutation problem, while AV SE incorporates visual cues, particularly lip movements, to complement auditory signals in noisy conditions where speech is degraded by competing noise sources. This paper presents a novel framework that unifies these two, advancing towards personalised AV SE. By integrating raw enrolment audio for adaptive target speaker representation with AV inputs the proposed system aims to achieve robust SE in real-world environments. Experimental results demonstrate significant improvements in speech intelligibility and noise suppression on the COG-MHEAR Audio-Visual Speech Enhancement Challenge dataset, outperforming state-of-the-art PSE and AV SE models.