Learning speaker representation with semi-supervised learning approach for speaker profiling

Abstract

We address speaker profiling — estimating characteristics like age and height — by proposing a semi-supervised framework that leverages external corpora to improve performance with limited training data. Our approach incorporates three components: supervised learning, unsupervised speaker representation learning, and consistency training for robustness. Evaluated on TIMIT and NISP datasets using Librispeech as external data, the method achieves competitive results, including RMSE of 6.8 and 7.4 years and MAE of 4.8 for age estimation in male and female speakers respectively.

Publication
arXiv preprint arXiv:2110.13653