Improving end-to-end spoken language understanding with prosodic attention and distillation

Abstract

End-to-end spoken language understanding (SLU) systems typically leverage pretrained ASR or language model features but overlook prosodic information. We propose two novel techniques: prosody-attention, which generates attention maps using prosodic features across utterance time frames, and prosody-distillation, which explicitly teaches acoustic encoders prosodic patterns rather than relying on concatenated implicit features. The prosody-distillation method achieves intent classification accuracy improvement of 8% and 2% on SLURP and STOP datasets over the prosody baseline.

Publication
Interspeech 2023