This paper proposes a novel spiking neural network (SNN) architecture that integrates with the generalised Hough transform (GHT) framework for the task of detecting specific speech patterns such as command words. The idea is that the GHT can model the geometrical distribution of speech information over the wider temporal context, while the SNN to used learn the discriminative prior weighting in the GHT to provide a spike output indicating a detection decision. The SNN therefore enhances the projection of the GHT from the input acoustic information into the sparse Hough accumulator space for detecting specific sound patterns. Compared using conventional neural network architectures for this task, the GHT-SNN system has the advantage that it does not require a voice activity detection module or an explicit noise model to reject non-target frames. Instead the output of the SNN is a voltage that is trained to exceed a threshold for positive instances of the sound pattern while remaining below this threshold otherwise, requiring no explicit noise model. Experiments are carried out on the challenging Chalearn gesture recognition task where spoken commands must be detected against variable background noise while rejecting a range of out-of-vocabulary words.