Challenges with sign language datasets
- Author
- Vincent Vandeghinste, Mirella De Sisto, Santiago Egea Gómez and Mathieu De Coster (UGent)
- Organization
- Project
- Abstract
- Sign Languages are the primary means of communication more than half a million people in Europe alone. However, the development of sign language recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and, when data is available, in standardisation issues in the available data. The former challenge relates to the volume and quality of data available for machine learning as well as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing the provision of automatic tools based on neural models. This chapter provides an overview of such challenges by comparing various sign language corpora and sign language machine learning datasets. Furthermore, it proposes a framework to address the lack of standardisation at format level, unify the available resources and facilitate sign language research for different languages. The framework takes ELAN files as inputs and returns textual and visual data ready to train sign language recognition and translation models. We present a proof of concept, training neural translation models on the data produced by the proposed framework.
Downloads
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 2.01 MB
-
(...).pdf
- full text (Accepted manuscript)
- |
- UGent only (changes to open access on 2026-07-27)
- |
- |
- 1.82 MB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01JCN7H5F4HXXH5SA8ZV0TQ0C6
- MLA
- Vandeghinste, Vincent, et al. “Challenges with Sign Language Datasets.” Sign Language Machine Translation, edited by Andy Way et al., vol. 5, Springer, 2024, pp. 117–39, doi:10.1007/978-3-031-47362-3_5.
- APA
- Vandeghinste, V., De Sisto, M., Gómez, S. E., & De Coster, M. (2024). Challenges with sign language datasets. In A. Way, L. Leeson, & D. Shterionov (Eds.), Sign language machine translation (Vol. 5, pp. 117–139). https://doi.org/10.1007/978-3-031-47362-3_5
- Chicago author-date
- Vandeghinste, Vincent, Mirella De Sisto, Santiago Egea Gómez, and Mathieu De Coster. 2024. “Challenges with Sign Language Datasets.” In Sign Language Machine Translation, edited by Andy Way, Lorraine Leeson, and Dimitar Shterionov, 5:117–39. Cham: Springer. https://doi.org/10.1007/978-3-031-47362-3_5.
- Chicago author-date (all authors)
- Vandeghinste, Vincent, Mirella De Sisto, Santiago Egea Gómez, and Mathieu De Coster. 2024. “Challenges with Sign Language Datasets.” In Sign Language Machine Translation, ed by. Andy Way, Lorraine Leeson, and Dimitar Shterionov, 5:117–139. Cham: Springer. doi:10.1007/978-3-031-47362-3_5.
- Vancouver
- 1.Vandeghinste V, De Sisto M, Gómez SE, De Coster M. Challenges with sign language datasets. In: Way A, Leeson L, Shterionov D, editors. Sign language machine translation. Cham: Springer; 2024. p. 117–39.
- IEEE
- [1]V. Vandeghinste, M. De Sisto, S. E. Gómez, and M. De Coster, “Challenges with sign language datasets,” in Sign language machine translation, vol. 5, A. Way, L. Leeson, and D. Shterionov, Eds. Cham: Springer, 2024, pp. 117–139.
@incollection{01JCN7H5F4HXXH5SA8ZV0TQ0C6,
abstract = {{Sign Languages are the primary means of communication more than half a million people in Europe alone. However, the development of sign language recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and, when data is available, in standardisation issues in the available data.
The former challenge relates to the volume and quality of data available for machine learning as well as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing the provision of automatic tools based on neural models.
This chapter provides an overview of such challenges by comparing various sign language corpora and sign language machine learning datasets. Furthermore, it proposes a framework to address the lack of standardisation at format level, unify the available resources and facilitate sign language research for different languages. The framework takes ELAN files as inputs and returns textual and visual data ready to train sign language recognition and translation models. We present a proof of concept, training neural translation models on the data produced by the proposed framework.}},
author = {{Vandeghinste, Vincent and De Sisto, Mirella and Gómez, Santiago Egea and De Coster, Mathieu}},
booktitle = {{Sign language machine translation}},
editor = {{Way, Andy and Leeson, Lorraine and Shterionov, Dimitar}},
isbn = {{9783031473616}},
issn = {{2522-8021}},
language = {{eng}},
pages = {{117--139}},
publisher = {{Springer}},
series = {{Machine translation : technologies and applications}},
title = {{Challenges with sign language datasets}},
url = {{http://doi.org/10.1007/978-3-031-47362-3_5}},
volume = {{5}},
year = {{2024}},
}
- Altmetric
- View in Altmetric