
SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech
- Author
- Siyuan Song (UGent) , Brecht Desplanques, Kris Demuynck (UGent) and Nilesh Madhu (UGent)
- Organization
- Project
- Abstract
- To increase the robustness of Acoustic Scene Clas-sification (ASC) during foreground speech presence, we recently proposed a noise-floor based iVector framework exploiting the statistical estimate of the background signal spectrum. Thereby, ASC accuracy was greatly improved when foreground speech was predominant, at the cost of poorer performance in scenarios with low foreground speech levels. A soft Voice Activity Detector (softVAD) is introduced, here, to improve this trade-off. Three possibilities are investigated: (a) a segment-wise, weighted score fusion system, yielding a sof VAD-based weighted average of the output scores of the (classical) iVector framework and those of the noise-floor based iVector framework; (b) the introduction of weighted Baum-Welch statistics in the iVector extraction stage, with weights that emphasize the background-dominant frames and disregard speech-dominant frames in the test sequence. Based on the performance of these alternatives, a third approach (approach (c)) that performs segment-level score fusion of the frame-wise weighted statistics (approach (b)) and the noise-floor system is proposed. Experiments conclusively demonstrate that all proposals significantly improve the classification accuracy. Especially the last approach outperforms all other methods in a wide range of experimental conditions.
- Keywords
- foreground speech robustness, noise-floor estimation, softVAD, iVector, Acoustic scene classification
Downloads
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 237.12 KB
-
DS554 acc.pdf
- full text (Accepted manuscript)
- |
- open access
- |
- |
- 229.70 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8767412
- MLA
- Song, Siyuan, et al. “SoftVAD in IVector-Based Acoustic Scene Classification for Robustness to Foreground Speech.” 2022 30th European Signal Processing Conference (EUSIPCO), IEEE, 2022, pp. 404–08.
- APA
- Song, S., Desplanques, B., Demuynck, K., & Madhu, N. (2022). SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech. 2022 30th European Signal Processing Conference (EUSIPCO), 404–408. IEEE.
- Chicago author-date
- Song, Siyuan, Brecht Desplanques, Kris Demuynck, and Nilesh Madhu. 2022. “SoftVAD in IVector-Based Acoustic Scene Classification for Robustness to Foreground Speech.” In 2022 30th European Signal Processing Conference (EUSIPCO), 404–8. IEEE.
- Chicago author-date (all authors)
- Song, Siyuan, Brecht Desplanques, Kris Demuynck, and Nilesh Madhu. 2022. “SoftVAD in IVector-Based Acoustic Scene Classification for Robustness to Foreground Speech.” In 2022 30th European Signal Processing Conference (EUSIPCO), 404–408. IEEE.
- Vancouver
- 1.Song S, Desplanques B, Demuynck K, Madhu N. SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech. In: 2022 30th European Signal Processing Conference (EUSIPCO). IEEE; 2022. p. 404–8.
- IEEE
- [1]S. Song, B. Desplanques, K. Demuynck, and N. Madhu, “SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech,” in 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 2022, pp. 404–408.
@inproceedings{8767412, abstract = {{To increase the robustness of Acoustic Scene Clas-sification (ASC) during foreground speech presence, we recently proposed a noise-floor based iVector framework exploiting the statistical estimate of the background signal spectrum. Thereby, ASC accuracy was greatly improved when foreground speech was predominant, at the cost of poorer performance in scenarios with low foreground speech levels. A soft Voice Activity Detector (softVAD) is introduced, here, to improve this trade-off. Three possibilities are investigated: (a) a segment-wise, weighted score fusion system, yielding a sof VAD-based weighted average of the output scores of the (classical) iVector framework and those of the noise-floor based iVector framework; (b) the introduction of weighted Baum-Welch statistics in the iVector extraction stage, with weights that emphasize the background-dominant frames and disregard speech-dominant frames in the test sequence. Based on the performance of these alternatives, a third approach (approach (c)) that performs segment-level score fusion of the frame-wise weighted statistics (approach (b)) and the noise-floor system is proposed. Experiments conclusively demonstrate that all proposals significantly improve the classification accuracy. Especially the last approach outperforms all other methods in a wide range of experimental conditions.}}, author = {{Song, Siyuan and Desplanques, Brecht and Demuynck, Kris and Madhu, Nilesh}}, booktitle = {{2022 30th European Signal Processing Conference (EUSIPCO)}}, isbn = {{9781665467988}}, issn = {{2076-1465}}, keywords = {{foreground speech robustness,noise-floor estimation,softVAD,iVector,Acoustic scene classification}}, language = {{eng}}, location = {{Belgrade, Serbia}}, pages = {{404--408}}, publisher = {{IEEE}}, title = {{SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech}}, year = {{2022}}, }