Improving End-to-End SLU Performance with Prosodic Attention and Distillation

Abstract

Most end-to-end SLU systems use pretrained ASR or LM features but completely ignore prosody, even though how something is said often matters as much as what is said. We propose two ways to fix this: prosody-attention, which builds attention maps from prosodic features across time, and prosody-distillation, which directly teaches the acoustic encoder to understand prosodic patterns rather than just concatenating them.

Prosody-distillation gives 8% and 2% intent accuracy gains on SLURP and STOP over the prosody baseline.