DualTurn learns natural conversational turn-taking via generative pretraining on dual-channel audio, outperforming larger models on turn prediction tasks.
SpeechLLM: A multimodal LLM combining speech encoders with TinyLlama for joint ASR, gender, age, accent, and emotion prediction from audio.
MAP-Mix: Training-dynamics-guided data augmentation for spoken language identification, achieving ~2% F1 improvement over random mixup at ICASSP 2023.
Prosodic attention and distillation techniques to improve end-to-end SLU, achieving up to 8% intent classification accuracy gain on SLURP dataset.
Skit-S2I: The first publicly available Indian-accented SLU dataset in the banking domain for end-to-end speech-to-intent research.
Semi-supervised framework for speaker profiling (age, height estimation) leveraging external unlabelled speech data via consistency training.