
Cross-modality attention and multimodal fusion transformer for pedestrian detection
(2023)
Computer Vision : ECCV 2022 Workshops, Proceedings, Part V.
In Lecture notes in computer science
13805.
p.608-623
- Author
- Wei-Yu Lee (UGent) , Ljubomir Jovanov (UGent) and Wilfried Philips (UGent)
- Organization
- Project
- Abstract
- Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.
- Keywords
- Cross-Modality fusion, Multimodal pedestrian detection, Transformer
Downloads
-
006.pdf
- full text (Accepted manuscript)
- |
- open access
- |
- |
- 1.31 MB
-
006-supp.pdf
- supplementary material
- |
- open access
- |
- |
- 193.77 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8766409
- MLA
- Lee, Wei-Yu, et al. “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection.” Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, edited by Leonid Karlinsky et al., vol. 13805, Springer, 2023, pp. 608–23, doi:10.1007/978-3-031-25072-9_41.
- APA
- Lee, W.-Y., Jovanov, L., & Philips, W. (2023). Cross-modality attention and multimodal fusion transformer for pedestrian detection. In L. Karlinsky, T. Michaeli, & K. Nishino (Eds.), Computer Vision : ECCV 2022 Workshops, Proceedings, Part V (Vol. 13805, pp. 608–623). https://doi.org/10.1007/978-3-031-25072-9_41
- Chicago author-date
- Lee, Wei-Yu, Ljubomir Jovanov, and Wilfried Philips. 2023. “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection.” In Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, edited by Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, 13805:608–23. Springer. https://doi.org/10.1007/978-3-031-25072-9_41.
- Chicago author-date (all authors)
- Lee, Wei-Yu, Ljubomir Jovanov, and Wilfried Philips. 2023. “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection.” In Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, ed by. Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, 13805:608–623. Springer. doi:10.1007/978-3-031-25072-9_41.
- Vancouver
- 1.Lee W-Y, Jovanov L, Philips W. Cross-modality attention and multimodal fusion transformer for pedestrian detection. In: Karlinsky L, Michaeli T, Nishino K, editors. Computer Vision : ECCV 2022 Workshops, Proceedings, Part V. Springer; 2023. p. 608–23.
- IEEE
- [1]W.-Y. Lee, L. Jovanov, and W. Philips, “Cross-modality attention and multimodal fusion transformer for pedestrian detection,” in Computer Vision : ECCV 2022 Workshops, Proceedings, Part V, Tel Aviv, Israel, 2023, vol. 13805, pp. 608–623.
@inproceedings{8766409, abstract = {{Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.}}, author = {{Lee, Wei-Yu and Jovanov, Ljubomir and Philips, Wilfried}}, booktitle = {{Computer Vision : ECCV 2022 Workshops, Proceedings, Part V}}, editor = {{Karlinsky, Leonid and Michaeli, Tomer and Nishino, Ko}}, isbn = {{9783031250712}}, issn = {{0302-9743}}, keywords = {{Cross-Modality fusion,Multimodal pedestrian detection,Transformer}}, language = {{eng}}, location = {{Tel Aviv, Israel}}, pages = {{608--623}}, publisher = {{Springer}}, title = {{Cross-modality attention and multimodal fusion transformer for pedestrian detection}}, url = {{http://doi.org/10.1007/978-3-031-25072-9_41}}, volume = {{13805}}, year = {{2023}}, }
- Altmetric
- View in Altmetric