The typical RNN (Recurrent Neural Network) pipeline in SLU (Spoken
Language Understanding), and specifically in the slot-filling task,
consists of three stages: word embedding, context window representation,
and label prediction. Label prediction, as a classification task, is
the one that creates a sensible context window representation during
learning through back-propagation. However, due to natural variations
of the data, differences in two same-labeled samples can lead to dissimilar
representations, whereas similarities in two differently-labeled samples
can lead to them having close representations. In computer vision applications,
specifically in face recognition and person re-identification, this
problem has recently been successfully tackled by introducing data
triplets and a triplet loss function.
In SLU, each word
can be mapped to one or multiple labels depending on small variations
of its context. We exploit this fact to construct data triplets consisting
of the same words with different contexts that form a pair of datapoints
with matching target labels and an another pair with non-matching labels.
By using these triplets and an additional loss function, we update
the context window representation in order to improve it, make dissimilar
samples more distant and similar samples closer, leading to better
classification results and an improved rate of convergence.