Advanced search
1 file | 476.93 KB Add to list

HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology

Ayla Rigouts Terryn (UGent) , Veronique Hoste (UGent) and Els Lefever (UGent)
(2021) TERMINOLOGY. 27(2). p.290-329
Author
Organization
Project
Abstract
Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.
Keywords
Library and Information Sciences, Communication, Language and Linguistics, lt3

Downloads

  • published article.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 476.93 KB

Citation

Please use this url to cite or link to this publication:

MLA
Rigouts Terryn, Ayla, et al. “HAMLET : Hybrid Adaptable Machine Learning Approach to Extract Terminology.” TERMINOLOGY, vol. 27, no. 2, 2021, pp. 290–329, doi:10.1075/term.20017.rig.
APA
Rigouts Terryn, A., Hoste, V., & Lefever, E. (2021). HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology. TERMINOLOGY, 27(2), 290–329. https://doi.org/10.1075/term.20017.rig
Chicago author-date
Rigouts Terryn, Ayla, Veronique Hoste, and Els Lefever. 2021. “HAMLET : Hybrid Adaptable Machine Learning Approach to Extract Terminology.” TERMINOLOGY 27 (2): 290–329. https://doi.org/10.1075/term.20017.rig.
Chicago author-date (all authors)
Rigouts Terryn, Ayla, Veronique Hoste, and Els Lefever. 2021. “HAMLET : Hybrid Adaptable Machine Learning Approach to Extract Terminology.” TERMINOLOGY 27 (2): 290–329. doi:10.1075/term.20017.rig.
Vancouver
1.
Rigouts Terryn A, Hoste V, Lefever E. HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology. TERMINOLOGY. 2021;27(2):290–329.
IEEE
[1]
A. Rigouts Terryn, V. Hoste, and E. Lefever, “HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology,” TERMINOLOGY, vol. 27, no. 2, pp. 290–329, 2021.
@article{8718062,
  abstract     = {{Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.}},
  author       = {{Rigouts Terryn, Ayla and Hoste, Veronique and Lefever, Els}},
  issn         = {{0929-9971}},
  journal      = {{TERMINOLOGY}},
  keywords     = {{Library and Information Sciences,Communication,Language and Linguistics,lt3}},
  language     = {{eng}},
  number       = {{2}},
  pages        = {{290--329}},
  title        = {{HAMLET : Hybrid Adaptable Machine Learning approach to Extract Terminology}},
  url          = {{http://dx.doi.org/10.1075/term.20017.rig}},
  volume       = {{27}},
  year         = {{2021}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: