In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora
- Author
- Ayla Rigouts Terryn, Veronique Hoste (UGent) and Els Lefever (UGent)
- Organization
- Abstract
- Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation.
- Keywords
- LT3, automatic term extraction, terminology, ATR, Comparable Corpora, term annotation, TERMINOLOGY EXTRACTION, RECOGNITION, ENGLISH, DOMAIN
Downloads
-
published.pdf
- full text (Published version)
- |
- open access
- |
- |
- 467.92 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8640705
- MLA
- Rigouts Terryn, Ayla, et al. “In No Uncertain Terms : A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora.” LANGUAGE RESOURCES AND EVALUATION, vol. 54, no. 2, 2020, pp. 385–418, doi:10.1007/s10579-019-09453-9.
- APA
- Rigouts Terryn, A., Hoste, V., & Lefever, E. (2020). In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora. LANGUAGE RESOURCES AND EVALUATION, 54(2), 385–418. https://doi.org/10.1007/s10579-019-09453-9
- Chicago author-date
- Rigouts Terryn, Ayla, Veronique Hoste, and Els Lefever. 2020. “In No Uncertain Terms : A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora.” LANGUAGE RESOURCES AND EVALUATION 54 (2): 385–418. https://doi.org/10.1007/s10579-019-09453-9.
- Chicago author-date (all authors)
- Rigouts Terryn, Ayla, Veronique Hoste, and Els Lefever. 2020. “In No Uncertain Terms : A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora.” LANGUAGE RESOURCES AND EVALUATION 54 (2): 385–418. doi:10.1007/s10579-019-09453-9.
- Vancouver
- 1.Rigouts Terryn A, Hoste V, Lefever E. In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora. LANGUAGE RESOURCES AND EVALUATION. 2020;54(2):385–418.
- IEEE
- [1]A. Rigouts Terryn, V. Hoste, and E. Lefever, “In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora,” LANGUAGE RESOURCES AND EVALUATION, vol. 54, no. 2, pp. 385–418, 2020.
@article{8640705, abstract = {{Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation.}}, author = {{Rigouts Terryn, Ayla and Hoste, Veronique and Lefever, Els}}, issn = {{1574-020X}}, journal = {{LANGUAGE RESOURCES AND EVALUATION}}, keywords = {{LT3,automatic term extraction,terminology,ATR,Comparable Corpora,term annotation,TERMINOLOGY EXTRACTION,RECOGNITION,ENGLISH,DOMAIN}}, language = {{eng}}, number = {{2}}, pages = {{385--418}}, title = {{In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora}}, url = {{http://doi.org/10.1007/s10579-019-09453-9}}, volume = {{54}}, year = {{2020}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: