Automated audio captioning (AAC) is a task to generates text description of an audio recording. Sound events, acoustic scene and the relationship between events are described in audio captions. Currently, most AAC systems are based on encoder-decoder architecture, in which the decoder predicts the caption completely according to the features extracted from the audio clip. As a result, learning more efficient audio features allows the decoder to generate more appropriate descriptions. This paper proposes an approach to guide the generation of captioning by multi-level information extracted from audio clip. Specifically, we use two modules to obtain acoustic information for semantic expression. (1) A module that combines channel attention and spatial attention to pay more attention to important features. (2) A trained keyword prediction module to generate word-level guidance information. We apply our modules to the CNN-Transformer architecture and experiment in Clotho. The results show that the proposed approach can significantly improve the scores of various evaluation metrics and achieve the state-of-the-art performance in the Cross-entropy training stage.