HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology
- Author
- Ayla Rigouts Terryn, Veronique Hoste (UGent) and Els Lefever (UGent)
- Organization
- Project
- Abstract
- Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.
- Keywords
- Library and Information Sciences, Communication, Language and Linguistics, lt3
Downloads
-
published article.pdf
- full text (Published version)
- |
- open access
- |
- |
- 476.93 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8718062
- MLA
- Rigouts Terryn, Ayla, et al. “HAMLET : Hybrid Adaptable Machine Learning Approach to Extract Terminology.” TERMINOLOGY, vol. 27, no. 2, 2021, pp. 290–329, doi:10.1075/term.20017.rig.
- APA
- Rigouts Terryn, A., Hoste, V., & Lefever, E. (2021). HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology. TERMINOLOGY, 27(2), 290–329. https://doi.org/10.1075/term.20017.rig
- Chicago author-date
- Rigouts Terryn, Ayla, Veronique Hoste, and Els Lefever. 2021. “HAMLET : Hybrid Adaptable Machine Learning Approach to Extract Terminology.” TERMINOLOGY 27 (2): 290–329. https://doi.org/10.1075/term.20017.rig.
- Chicago author-date (all authors)
- Rigouts Terryn, Ayla, Veronique Hoste, and Els Lefever. 2021. “HAMLET : Hybrid Adaptable Machine Learning Approach to Extract Terminology.” TERMINOLOGY 27 (2): 290–329. doi:10.1075/term.20017.rig.
- Vancouver
- 1.Rigouts Terryn A, Hoste V, Lefever E. HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology. TERMINOLOGY. 2021;27(2):290–329.
- IEEE
- [1]A. Rigouts Terryn, V. Hoste, and E. Lefever, “HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology,” TERMINOLOGY, vol. 27, no. 2, pp. 290–329, 2021.
@article{8718062, abstract = {{Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.}}, author = {{Rigouts Terryn, Ayla and Hoste, Veronique and Lefever, Els}}, issn = {{0929-9971}}, journal = {{TERMINOLOGY}}, keywords = {{Library and Information Sciences,Communication,Language and Linguistics,lt3}}, language = {{eng}}, number = {{2}}, pages = {{290--329}}, title = {{HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology}}, url = {{http://doi.org/10.1075/term.20017.rig}}, volume = {{27}}, year = {{2021}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: