Learning discriminative Acoustic Word Embeddings (AWEs) summarising variable length spoken word segments brings efficiency in speech retrieval tasks, notably, Query-by-Example (QbE) Speech or Spoken Term Detection (STD). In this paper, we add on to RNN based approaches for generating acoustic word embeddings. The model is trained in an encoder-decoder fashion on pairs of similar word segments by optimizing a pairwise self-supervised loss where the targets are generated offline via clustering. The pairs may be generated with word boundaries (weak supervision) or via augmentation of unlabelled word segments (no supervision). Experiments with word discrimination task on TIMIT and LibriSpeech show state of the art performance of the proposed approach outperforming popular RNN AWE approaches in both weakly supervised and unsupervised settings. The AWEs generated by our model generalise well to OOV words. On STD tasks performed on TIMIT, the proposed approach provides speed advantages.