SpeechLLM: Multi-Modal LLM for Speech Understanding

Abstract

SpeechLLM is a multimodal language model that combines a speech encoder (HuBERT or WavLM) with a large language model backbone (TinyLlama) to analyze and extract metadata from speaker turns in conversations. Given 16 kHz audio, the model simultaneously performs six tasks: Speech Activity Detection, Automatic Speech Recognition, Gender Classification, Age Estimation (young/middle-aged/senior), Accent Recognition (seven global regions), and Emotion Detection (happy, sad, angry, neutral, frustrated). Two model variants are available: speechllm-2B and speechllm-1.5B.

Publication
GitHub / HuggingFace Model Release