DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Abstract

We address the gap between speech-to-speech models and ASR-LLM-TTS pipelines for natural turn-taking in conversational AI. DualTurn uses generative pretraining on dual-channel conversational audio to learn natural turn-taking. The model generates both speakers' future audio autoregressively and learns conversational dynamics without labeled data. After fine-tuning, it predicts turn-taking signals mapping to agent actions. The 0.5B parameter model outperforms larger alternatives, achieving wF1 0.633 vs. 0.389 compared to VAP on agent action prediction and AUC 0.930 vs. 0.880 versus a 3.1B model on word-level turn prediction, while anticipating turn boundaries with fewer interruptions.

Publication
Submitted to Interspeech 2026