RCT-Net: TDNN based Speaker Verification with 2D Res2Nets on Frame Level Feature ExtractorRazieh Khamsehashari; Fengying Miao; Tim Polzehl; Sebastian Möller
In: The Eighth International Conference on Advances in Signal, Image and Video Processing - SIGNAL 2023. International Conference on Advances in Signal, Image and Video Processing (SIGNAL-2023), March 13-17, Barcelona, Spain, ISBN 978-1-68558-057-5, IARIA, 2023.
In speaker verification, Time Delay Neural Networks (TDNNs) and Residual Networks (ResNets) are currently achieving cutting-edge results. These architectures have very different structural characteristics, and development of hybrid networks appears to be a promising path forward. In this study, inspired by the combination of Convolutional Neural Network (CNN) blocks and multi-scale architectures we present a Residual-based CNN TDNN (RCT) system and evaluate the performance of integrating different residual blocks into a TDNN-based structure. We extend the state-of-the-art speaker embedding model for speaker recognition, namely Emphasized Channel Attention, Propagation, and Aggregation based CNN-TDNN (ECAPA CNN-TDNN), by gradually incorporating the proposed 2D convolutional stem with various bottleneck residual blocks. We evaluate the performance of our models on standard VoxCeleb1-O test set to investigate the performance of residual blocks and TDNN in the speaker verification domain. As a result, the proposed models significantly outperform the state-of-the-art by up to 14.6% of EER.