Self-supervised learning with cluster-aware-dino

Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification - 2023

Research Paper On Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification

Research Area: Machine Learning

Abstract:

The automatic speaker verification task has achieved great success using deep learning approaches with a large-scale, manually annotated dataset. However, collecting a significant amount of well-labeled data for system building is very difficult and expensive. Recently, self-supervised speaker verification has attracted a lot of interest due to its no dependency on labeled data. In this article, we propose a novel and advanced self-supervised learning framework based on our prior work, which can construct a powerful speaker verification system with high performance without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we further introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iterative learning stage, due to a mass of unreliable labels from unsupervised clustering, the quality of pseudo labels is important for the system performance. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. Furthermore, we extend the DLG-LC from single-modality to multi-modality on the audio-visual dataset to further improve the performance. The experiments were conducted using the widely-used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method achieve relative EER improvement of 22.17%, 27.94% and 25.56% on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. Importantly, the newly proposed self-supervised learning system even achieves comparable results with the fully supervised system, but without using any human-labeled data.

Keywords:

Author(s) Name: Bing Han, Zhengyang Chen, Yanmin Qian

Journal name: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Conferrence name:

Publisher name: ACM Digital Library

DOI: 10.1109/TASLP.2023.3331949

Volume Information: Volume 32,Pages 529 - 541,(2023)

Paper Link: https://dl.acm.org/doi/10.1109/TASLP.2023.3331949

Office Address

Social List