Recently, Automatic Speech Recognition (ASR) that supports prompts has shown remarkable versatility. For contextual biasing with these systems, a pivotal factor lies in obtaining well-matched prompts. To address this issue, Contrastive Language-Audio Pre-training is exploited to retrieve matched entities from a user-specified list. Instead of only confining contrastive learning to the sentence level, we propose the Global-Local Contrastive Language-Audio Pre-trained model (GLCLAP). On the global scale, semantic information is extracted from audio and text, enabling a holistic understanding of the input. On the local scale, detailed local word information of individual segments is focused. This multi-scale information has led to a remarkable improvement in bias word retrieval accuracy. By using the GLCLAP bias word retrieval system as the prompts generation component, the accuracy of the final ASR decoding result is significantly improved without finetuning.