Speech

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

DualTurn learns natural conversational turn-taking via generative pretraining on dual-channel audio, outperforming larger models on turn prediction tasks.

SpeechLLM: Multi-Modal LLM for Speech Understanding

SpeechLLM: A multimodal LLM combining speech encoders with TinyLlama for joint ASR, gender, age, accent, and emotion prediction from audio.

Improving spoken language identification with MAP-Mix

MAP-Mix: Training-dynamics-guided data augmentation for spoken language identification, achieving ~2% F1 improvement over random mixup at ICASSP 2023.

Improving end-to-end spoken language understanding with prosodic attention and distillation

Prosodic attention and distillation techniques to improve end-to-end SLU, achieving up to 8% intent classification accuracy gain on SLURP dataset.

Skit-S2I: An Indian Accented Speech to Intent Dataset

Skit-S2I: The first publicly available Indian-accented SLU dataset in the banking domain for end-to-end speech-to-intent research.

Learning speaker representation with semi-supervised learning approach for speaker profiling

Semi-supervised framework for speaker profiling (age, height estimation) leveraging external unlabelled speech data via consistency training.