
Fine-tuning self-supervised models for language identification using orthonormal constraint
- Author
- Amrutha Prasad, Andrés Carofilis, Geoffroy Vanderreydt (UGent) , Driss Khalil, Srikanth Madikeri, Petr Motlicek and Christof Schuepbach
- Organization
- Abstract
- Self-supervised models trained with high linguistic diversity, such as the XLS-R model, can be effectively fine-tuned for the language recognition task. Typically, a back-end classifier followed by statistics pooling layer are added during training. Commonly used back-end classifiers require a large number of parameters to be trained, which is not ideal in limited data conditions. In this work, we explore smaller parameter back-ends using factorized Time Delay Neural Network (TDNN-F). The TDNN-F architecture is also integrated into Emphasized Channel Attention, Propagation and Aggregation- TDNN (ECAPA-TDNN) models, termed ECAPA-TDNN-F, reducing the number of parameters by 30 to 50% absolute, with competitive accuracies and no change in minimum cost. The results show that the ECAPA-TDNN-F can be extended to tasks where ECAPA-TDNN is suitable. We also test the effectiveness of a linear classifier and a variant, the Orthonormal linear classifier, previously used in x-vector type systems. The models are trained with NIST LRE17 data and evaluated on NIST LRE17, LRE22 and the ATCO2 LID datasets. Both linear classifiers outperform conventional back-ends with improvements in accuracy between 0.9% and 9.1%.
- Keywords
- Language Identification, Transformers, Wav2Vec2, fine-tuning, low-resource, out-of-domain
Downloads
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 963.32 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01JJ1GPXKPYHKYG5927Q37ESWT
- MLA
- Prasad, Amrutha, et al. “Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint.” 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), IEEE, 2024, pp. 11921–25, doi:10.1109/icassp48485.2024.10446751.
- APA
- Prasad, A., Carofilis, A., Vanderreydt, G., Khalil, D., Madikeri, S., Motlicek, P., & Schuepbach, C. (2024). Fine-tuning self-supervised models for language identification using orthonormal constraint. 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 11921–11925. https://doi.org/10.1109/icassp48485.2024.10446751
- Chicago author-date
- Prasad, Amrutha, Andrés Carofilis, Geoffroy Vanderreydt, Driss Khalil, Srikanth Madikeri, Petr Motlicek, and Christof Schuepbach. 2024. “Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint.” In 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 11921–25. IEEE. https://doi.org/10.1109/icassp48485.2024.10446751.
- Chicago author-date (all authors)
- Prasad, Amrutha, Andrés Carofilis, Geoffroy Vanderreydt, Driss Khalil, Srikanth Madikeri, Petr Motlicek, and Christof Schuepbach. 2024. “Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint.” In 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 11921–11925. IEEE. doi:10.1109/icassp48485.2024.10446751.
- Vancouver
- 1.Prasad A, Carofilis A, Vanderreydt G, Khalil D, Madikeri S, Motlicek P, et al. Fine-tuning self-supervised models for language identification using orthonormal constraint. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024). IEEE; 2024. p. 11921–5.
- IEEE
- [1]A. Prasad et al., “Fine-tuning self-supervised models for language identification using orthonormal constraint,” in 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Seoul, Republic of Korea, 2024, pp. 11921–11925.
@inproceedings{01JJ1GPXKPYHKYG5927Q37ESWT, abstract = {{Self-supervised models trained with high linguistic diversity, such as the XLS-R model, can be effectively fine-tuned for the language recognition task. Typically, a back-end classifier followed by statistics pooling layer are added during training. Commonly used back-end classifiers require a large number of parameters to be trained, which is not ideal in limited data conditions. In this work, we explore smaller parameter back-ends using factorized Time Delay Neural Network (TDNN-F). The TDNN-F architecture is also integrated into Emphasized Channel Attention, Propagation and Aggregation- TDNN (ECAPA-TDNN) models, termed ECAPA-TDNN-F, reducing the number of parameters by 30 to 50% absolute, with competitive accuracies and no change in minimum cost. The results show that the ECAPA-TDNN-F can be extended to tasks where ECAPA-TDNN is suitable. We also test the effectiveness of a linear classifier and a variant, the Orthonormal linear classifier, previously used in x-vector type systems. The models are trained with NIST LRE17 data and evaluated on NIST LRE17, LRE22 and the ATCO2 LID datasets. Both linear classifiers outperform conventional back-ends with improvements in accuracy between 0.9% and 9.1%.}}, author = {{Prasad, Amrutha and Carofilis, Andrés and Vanderreydt, Geoffroy and Khalil, Driss and Madikeri, Srikanth and Motlicek, Petr and Schuepbach, Christof}}, booktitle = {{2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024)}}, isbn = {{9798350344868}}, issn = {{1520-6149}}, keywords = {{Language Identification,Transformers,Wav2Vec2,fine-tuning,low-resource,out-of-domain}}, language = {{eng}}, location = {{Seoul, Republic of Korea}}, pages = {{11921--11925}}, publisher = {{IEEE}}, title = {{Fine-tuning self-supervised models for language identification using orthonormal constraint}}, url = {{http://doi.org/10.1109/icassp48485.2024.10446751}}, year = {{2024}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: