Advanced search
2 files | 7.05 MB Add to list
Author
Organization
Project
Abstract
Automatic sign language recognition lies at the intersection of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adaptation of this concept for tasks that require video understanding, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain. In this work, we reduce the impact of this data limitation by automatically pre-extracting useful information from the sign language videos. In our approach, different types of information are offered to a VTN in a multi-modal setup. It includes per-frame human pose keypoints (extracted by OpenPose) to capture the body movement and hand crops to capture the (evolution of) hand shapes. We evaluate our method on the recently released AUTSL dataset for isolated sign recognition and obtain 92.92% accuracy on the test set using only RGB data. For comparison: the VTN architecture without hand crops and pose flow achieved 82% accuracy. A qualitative inspection of our model hints at further potential of multi-modal multi-head attention in a sign language recognition context.
Keywords
LANGUAGE

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 5.17 MB
  • DS441 acc.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 1.88 MB

Citation

Please use this url to cite or link to this publication:

MLA
De Coster, Mathieu, et al. “Isolated Sign Recognition from RGB Video Using Pose Flow and Self-Attention.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, 2021, pp. 3441–50, doi:10.1109/cvprw53098.2021.00383.
APA
De Coster, M., Van Herreweghe, M., & Dambre, J. (2021). Isolated sign recognition from RGB video using pose flow and self-attention. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3441–3450. https://doi.org/10.1109/cvprw53098.2021.00383
Chicago author-date
De Coster, Mathieu, Mieke Van Herreweghe, and Joni Dambre. 2021. “Isolated Sign Recognition from RGB Video Using Pose Flow and Self-Attention.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3441–50. IEEE. https://doi.org/10.1109/cvprw53098.2021.00383.
Chicago author-date (all authors)
De Coster, Mathieu, Mieke Van Herreweghe, and Joni Dambre. 2021. “Isolated Sign Recognition from RGB Video Using Pose Flow and Self-Attention.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3441–3450. IEEE. doi:10.1109/cvprw53098.2021.00383.
Vancouver
1.
De Coster M, Van Herreweghe M, Dambre J. Isolated sign recognition from RGB video using pose flow and self-attention. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE; 2021. p. 3441–50.
IEEE
[1]
M. De Coster, M. Van Herreweghe, and J. Dambre, “Isolated sign recognition from RGB video using pose flow and self-attention,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA (online), 2021, pp. 3441–3450.
@inproceedings{8719265,
  abstract     = {{Automatic sign language recognition lies at the intersection of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adaptation of this concept for tasks that require video understanding, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain. In this work, we reduce the impact of this data limitation by automatically pre-extracting useful information from the sign language videos. In our approach, different types of information are offered to a VTN in a multi-modal setup. It includes per-frame human pose keypoints (extracted by OpenPose) to capture the body movement and hand crops to capture the (evolution of) hand shapes. We evaluate our method on the recently released AUTSL dataset for isolated sign recognition and obtain 92.92% accuracy on the test set using only RGB data. For comparison: the VTN architecture without hand crops and pose flow achieved 82% accuracy. A qualitative inspection of our model hints at further potential of multi-modal multi-head attention in a sign language recognition context.}},
  author       = {{De Coster, Mathieu and Van Herreweghe, Mieke and Dambre, Joni}},
  booktitle    = {{2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}},
  isbn         = {{9781665448994}},
  issn         = {{2160-7508}},
  keywords     = {{LANGUAGE}},
  language     = {{eng}},
  location     = {{Nashville, TN, USA (online)}},
  pages        = {{3441--3450}},
  publisher    = {{IEEE}},
  title        = {{Isolated sign recognition from RGB video using pose flow and self-attention}},
  url          = {{http://doi.org/10.1109/cvprw53098.2021.00383}},
  year         = {{2021}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: