Recently, with the advent of deep learning, there has been significant progress in the processing of speech mixtures. In particular, the use of neural networks has enabled target speech extraction, which extracts speech signal of a target speaker from a speech mixture by utilizing auxiliary clue representing the characteristics of the target speaker. For example, audio clues derived from an auxiliary utterance spoken by the target speaker have been used to characterize the target speaker. Audio clues should capture the fine-grained characteristic of the target speaker’s voice (e.g., pitch). Alternatively, visual clues derived from a video of the target speaker’s face speaking in the mixture have also been investigated. Visual clues should mainly capture the phonetic information derived from lip movements. In this paper, we propose a novel target speech extraction scheme that combines audio and visual clues about the target speaker to take advantage of the information provided by both modalities. We introduce an attention mechanism that emphasizes the most informative speaker clue at every time frame. Experiments on mixture of two speakers demonstrated that our proposed method using audio-visual speaker clues significantly improved the extraction performance compared with the conventional methods using either audio or visual speaker clues.