We consider the problem of spoken language understanding (SLU) of extracting
natural language intents and associated slot arguments or named entities
from speech that is primarily directed at voice assistants. Such a
system subsumes both automatic speech recognition (ASR) as well as
natural language understanding (NLU). An end-to-end joint SLU model
can be built to a required specification opening up the opportunity
to deploy on hardware constrained scenarios like devices enabling voice
assistants to work offline, in a privacy preserving manner, whilst
also reducing server costs.
We first present models
that extract utterance intent directly from speech without intermediate
text output. We then present a compositional model, which generates
the transcript using the Listen Attend Spell ASR system and then extracts
interpretation using a neural NLU model. Finally, we contrast these
methods to a jointly trained end-to-end joint SLU model, consisting
of ASR and NLU subsystems which are connected by a neural network based
interface instead of text, that produces transcripts as well as NLU
interpretation. We show that the jointly trained model shows improvements
to ASR incorporating semantic information from NLU and also improves
NLU by exposing it to ASR confusion encoded in the hidden layer.