Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition (ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historical recognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech or text, and a cross-modal encoder, which aims to learn the correlation between speech and text. For each modal-related encoder, we randomly mask some tokens of its input or the whole input sequence, then we perform a token-missing or modal-missing prediction and a modal-level CTC loss on cross-modal encoder. Thus, the model captures not only the bi-directional context dependencies in a specific modality but also relationships between different modalities. Then, the extractor will be frozen to extract the textual representations of preceding speech during the training of the conversational ASR system through attention mechanism. The effectiveness of the proposed approach is validated on several Mandarin conversation corpora and the highest character error rate (CER) reduction up to 16% is achieved on the MagicData dataset.