Advanced search
2 files | 1.50 MB Add to list

Cross-modality attention and multimodal fusion transformer for pedestrian detection

Wei-Yu Lee (UGent) , Ljubomir Jovanov (UGent) and Wilfried Philips (UGent)
Author
Organization
Project
Abstract
Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.
Keywords
Cross-Modality fusion, Multimodal pedestrian detection, Transformer

Downloads

  • 006.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 1.31 MB
  • 006-supp.pdf
    • supplementary material
    • |
    • open access
    • |
    • PDF
    • |
    • 193.77 KB

Citation

Please use this url to cite or link to this publication:

MLA
Lee, Wei-Yu, et al. “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection.” Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, edited by Leonid Karlinsky et al., vol. 13805, Springer, 2023, pp. 608–23, doi:10.1007/978-3-031-25072-9_41.
APA
Lee, W.-Y., Jovanov, L., & Philips, W. (2023). Cross-modality attention and multimodal fusion transformer for pedestrian detection. In L. Karlinsky, T. Michaeli, & K. Nishino (Eds.), Computer Vision : ECCV 2022 Workshops, Proceedings, Part V (Vol. 13805, pp. 608–623). https://doi.org/10.1007/978-3-031-25072-9_41
Chicago author-date
Lee, Wei-Yu, Ljubomir Jovanov, and Wilfried Philips. 2023. “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection.” In Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, edited by Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, 13805:608–23. Springer. https://doi.org/10.1007/978-3-031-25072-9_41.
Chicago author-date (all authors)
Lee, Wei-Yu, Ljubomir Jovanov, and Wilfried Philips. 2023. “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection.” In Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, ed by. Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, 13805:608–623. Springer. doi:10.1007/978-3-031-25072-9_41.
Vancouver
1.
Lee W-Y, Jovanov L, Philips W. Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Karlinsky L, Michaeli T, Nishino K, editors. Computer Vision : ECCV 2022 Workshops, Proceedings, Part V. Springer; 2023. p. 608–23.
IEEE
[1]
W.-Y. Lee, L. Jovanov, and W. Philips, “Cross-modality attention and multimodal fusion transformer for pedestrian detection,” in Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, Tel Aviv, Israel, 2023, vol. 13805, pp. 608–623.
@inproceedings{8766409,
  abstract     = {{Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.}},
  author       = {{Lee, Wei-Yu and Jovanov, Ljubomir and Philips, Wilfried}},
  booktitle    = {{Computer Vision : ECCV 2022 Workshops, Proceedings, Part V}},
  editor       = {{Karlinsky, Leonid and Michaeli, Tomer and Nishino, Ko}},
  isbn         = {{9783031250712}},
  issn         = {{0302-9743}},
  keywords     = {{Cross-Modality fusion,Multimodal pedestrian detection,Transformer}},
  language     = {{eng}},
  location     = {{Tel Aviv, Israel}},
  pages        = {{608--623}},
  publisher    = {{Springer}},
  title        = {{Cross-modality attention and multimodal fusion transformer for pedestrian detection}},
  url          = {{http://doi.org/10.1007/978-3-031-25072-9_41}},
  volume       = {{13805}},
  year         = {{2023}},
}

Altmetric
View in Altmetric