It is very attractive for the user to retrieve photos from a huge collection using high-level personal queries (e.g. “uncle Bill's house”), but technically very challenging. Previous works proposed a set of approaches toward the goal assuming only 30% of the photos are annotated by sparse spoken descriptions when the photos are taken. In this paper, to promote the interaction between different types of features, we use the continuous space word representations to train a paragraph vector model for the speech annotation, and then fuse the paragraph vector with the visual features produced by deep Convolutional Neural Network (CNN) using a Deep AutoEncoder (DAE). The retrieval framework therefore combines the word vectors and paragraph vectors of the speech annotations, the CNN-based visual features, and the DAE-based fused visual/speech features in a three-stage process including a two-layer random walk. The retrieval performance was significantly improved in the preliminary experiments.