The target speaker extraction aims to isolate the target speaker's speech from other interfering speakers. Typically, an auxiliary reference, such as a pre-recorded speech or lip movements, is vital to direct attention to the target speaker. Existing methods use one of these cues or fuse both via attention mechanisms, yielding a shared feature of the target speaker. While both cues represent the same speaker, they have distinct attributes. The audio cue registers the speaker's timbre, but lip movements illustrate the synchrony. To blend the strengths of different cues, we propose a unified TSE network termed Uni-Net that employs a divide-and-conquer strategy to fuse audio and lip cues into distinct networks, capitalizing on each cue's unique information. Speech extracted from various cues acts as prior information, further refined by the post-processing network. We conducted the experiments on the public VoxCeleb2 corpus and Uni-Net achieves SOTA performance compared with baselines.