Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions
Document Type
Conference Proceeding
Publication Title
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Abstract
The paper introduces Diff-Filter, a multichannel speech enhancement approach based on the diffusion probabilistic model, for improving speaker verification performance under noisy and reverberant conditions. It also presents a new two-step training procedure that takes the benefit of self-supervised learning. In the first stage, the Diff-Filter is trained by conducting time-domain speech filtering using a scoring-based diffusion model. In the second stage, the Diff-Filter is jointly optimized with a pre-trained ECAPA-TDNN speaker verification model under a self-supervised learning framework. We present a novel loss based on equal error rate. This loss is used to conduct self-supervised learning on a dataset that is not labelled in terms of speakers. The proposed approach is evaluated on MultiSV, a multichannel speaker verification dataset, and shows significant improvements in performance under noisy multichannel conditions.
First Page
3849
Last Page
3853
DOI
10.21437/Interspeech.2023-1890
Publication Date
1-1-2023
Keywords
diffusion probabilistic models, multichannel speech enhancement, self-supervised learning, speaker verification
Recommended Citation
S. Dowerah et al., "Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions," Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023-August, pp. 3849 - 3853, Jan 2023.
The definitive version is available at https://doi.org/10.21437/Interspeech.2023-1890