Advanced search
1 file | 1.66 MB Add to list

CT-SAT : contextual transformer for sequential audio tagging

Yuanbo Hou (UGent) , Zhaoyi Liu, Bo Kang (UGent) , Yun Wang and Dick Botteldooren (UGent)
(2022) p.4147-4151
Author
Organization
Project
Abstract
Sequential audio event tagging can provide not only the type information of audio events, but also the order information between events and the number of events that occur in an audio clip. Most previous works on audio event sequence analysis rely on connectionist temporal classification (CTC). However, CTC's conditional independence assumption prevents it from effectively learning correlations between diverse audio events. This paper first introduces the Transformer into sequential audio tagging, since Transformers perform well in sequence-related tasks. To better utilize contextual information of audio event sequences, we draw on the idea of bidirectional recurrent neural networks, and propose a contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences. Experiments on the real-life polyphonic audio dataset show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse-grained audio event cues to exploit contextual information to successfully recognize and predict the audio event sequence in polyphonic audio clips.
Keywords
Audio tagging, sequential audio tagging, connectionist temporal classification, contextual Transformer, NEURAL-NETWORKS

Downloads

  • ACOUST 618.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 1.66 MB

Citation

Please use this url to cite or link to this publication:

MLA
Hou, Yuanbo, et al. CT-SAT : Contextual Transformer for Sequential Audio Tagging. International Speech Communication Association (ISCA), 2022, pp. 4147–51, doi:10.21437/Interspeech.2022-196.
APA
Hou, Y., Liu, Z., Kang, B., Wang, Y., & Botteldooren, D. (2022). CT-SAT : contextual transformer for sequential audio tagging. 4147–4151. https://doi.org/10.21437/Interspeech.2022-196
Chicago author-date
Hou, Yuanbo, Zhaoyi Liu, Bo Kang, Yun Wang, and Dick Botteldooren. 2022. “CT-SAT : Contextual Transformer for Sequential Audio Tagging.” In , 4147–51. International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2022-196.
Chicago author-date (all authors)
Hou, Yuanbo, Zhaoyi Liu, Bo Kang, Yun Wang, and Dick Botteldooren. 2022. “CT-SAT : Contextual Transformer for Sequential Audio Tagging.” In , 4147–4151. International Speech Communication Association (ISCA). doi:10.21437/Interspeech.2022-196.
Vancouver
1.
Hou Y, Liu Z, Kang B, Wang Y, Botteldooren D. CT-SAT : contextual transformer for sequential audio tagging. In International Speech Communication Association (ISCA); 2022. p. 4147–51.
IEEE
[1]
Y. Hou, Z. Liu, B. Kang, Y. Wang, and D. Botteldooren, “CT-SAT : contextual transformer for sequential audio tagging,” presented at the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022), Incheon, Korea, 2022, pp. 4147–4151.
@inproceedings{01GMBA3JPV0KX0YXEGDZVFSSMN,
  abstract     = {{Sequential audio event tagging can provide not only the type information of audio events, but also the order information between events and the number of events that occur in an audio clip. Most previous works on audio event sequence analysis rely on connectionist temporal classification (CTC). However, CTC's conditional independence assumption prevents it from effectively learning correlations between diverse audio events. This paper first introduces the Transformer into sequential audio tagging, since Transformers perform well in sequence-related tasks. To better utilize contextual information of audio event sequences, we draw on the idea of bidirectional recurrent neural networks, and propose a contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences. Experiments on the real-life polyphonic audio dataset show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse-grained audio event cues to exploit contextual information to successfully recognize and predict the audio event sequence in polyphonic audio clips.}},
  author       = {{Hou, Yuanbo and Liu, Zhaoyi and Kang, Bo and Wang, Yun and Botteldooren, Dick}},
  issn         = {{2308-457X}},
  keywords     = {{Audio tagging,sequential audio tagging,connectionist temporal classification,contextual Transformer,NEURAL-NETWORKS}},
  language     = {{eng}},
  location     = {{Incheon, Korea}},
  pages        = {{4147--4151}},
  publisher    = {{International Speech Communication Association (ISCA)}},
  title        = {{CT-SAT : contextual transformer for sequential audio tagging}},
  url          = {{http://doi.org/10.21437/Interspeech.2022-196}},
  year         = {{2022}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: