Rule-embedded network for audio-visual voice activity detection in live musical video streams
- Author
- Yuanbo Hou (UGent) , Yi Deng, Bilei Zhu, Zejun Ma and Dick Botteldooren (UGent)
- Organization
- Project
- Abstract
- Detecting anchor’s voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.
- Keywords
- Audio-visual voice detection, rule embedding, cross-modal learning, multi-modal fusion
Downloads
-
ACOUST 580a.pdf
- full text (Accepted manuscript)
- |
- open access
- |
- |
- 884.84 KB
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 3.53 MB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8708156
- MLA
- Hou, Yuanbo, et al. “Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams.” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 4165–69, doi:10.1109/icassp39728.2021.9413418.
- APA
- Hou, Y., Deng, Y., Zhu, B., Ma, Z., & Botteldooren, D. (2021). Rule-embedded network for audio-visual voice activity detection in live musical video streams. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4165–4169. https://doi.org/10.1109/icassp39728.2021.9413418
- Chicago author-date
- Hou, Yuanbo, Yi Deng, Bilei Zhu, Zejun Ma, and Dick Botteldooren. 2021. “Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams.” In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4165–69. IEEE. https://doi.org/10.1109/icassp39728.2021.9413418.
- Chicago author-date (all authors)
- Hou, Yuanbo, Yi Deng, Bilei Zhu, Zejun Ma, and Dick Botteldooren. 2021. “Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams.” In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4165–4169. IEEE. doi:10.1109/icassp39728.2021.9413418.
- Vancouver
- 1.Hou Y, Deng Y, Zhu B, Ma Z, Botteldooren D. Rule-embedded network for audio-visual voice activity detection in live musical video streams. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021. p. 4165–9.
- IEEE
- [1]Y. Hou, Y. Deng, B. Zhu, Z. Ma, and D. Botteldooren, “Rule-embedded network for audio-visual voice activity detection in live musical video streams,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 4165–4169.
@inproceedings{8708156, abstract = {{Detecting anchor’s voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.}}, author = {{Hou, Yuanbo and Deng, Yi and Zhu, Bilei and Ma, Zejun and Botteldooren, Dick}}, booktitle = {{ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}}, isbn = {{9781728176055}}, issn = {{2379-190X}}, keywords = {{Audio-visual voice detection,rule embedding,cross-modal learning,multi-modal fusion}}, language = {{eng}}, location = {{Toronto, ON, Canada}}, pages = {{4165--4169}}, publisher = {{IEEE}}, title = {{Rule-embedded network for audio-visual voice activity detection in live musical video streams}}, url = {{http://doi.org/10.1109/icassp39728.2021.9413418}}, year = {{2021}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: