Advanced search
1 file | 275.20 KB Add to list

ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification

Brecht Desplanques (UGent) , Jenthe Thienpondt (UGent) and Kris Demuynck (UGent)
(2020) Proc. Interspeech 2020. p.3830-3834
Author
Organization
Abstract
Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel’s statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

Downloads

  • DS370.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 275.20 KB

Citation

Please use this url to cite or link to this publication:

MLA
Desplanques, Brecht, et al. “ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.” Proc. Interspeech 2020, International Speech Communication Association (ISCA), 2020, pp. 3830–34, doi:10.21437/Interspeech.2020-2650.
APA
Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification. In Proc. Interspeech 2020 (pp. 3830–3834). Online: International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2020-2650
Chicago author-date
Desplanques, Brecht, Jenthe Thienpondt, and Kris Demuynck. 2020. “ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.” In Proc. Interspeech 2020, 3830–34. International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2020-2650.
Chicago author-date (all authors)
Desplanques, Brecht, Jenthe Thienpondt, and Kris Demuynck. 2020. “ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.” In Proc. Interspeech 2020, 3830–3834. International Speech Communication Association (ISCA). doi:10.21437/Interspeech.2020-2650.
Vancouver
1.
Desplanques B, Thienpondt J, Demuynck K. ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification. In: Proc Interspeech 2020. International Speech Communication Association (ISCA); 2020. p. 3830–4.
IEEE
[1]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification,” in Proc. Interspeech 2020, Online, 2020, pp. 3830–3834.
@inproceedings{8680078,
  abstract     = {{Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel’s statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.}},
  author       = {{Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris}},
  booktitle    = {{Proc. Interspeech 2020}},
  issn         = {{1990-9772}},
  language     = {{eng}},
  location     = {{Online}},
  pages        = {{3830--3834}},
  publisher    = {{International Speech Communication Association (ISCA)}},
  title        = {{ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification}},
  url          = {{http://dx.doi.org/10.21437/Interspeech.2020-2650}},
  year         = {{2020}},
}

Altmetric
View in Altmetric