The trade-off between quantity and quality : comparing a large web corpus and a small focused corpus for medical terminology extraction
- Author
- Veronique Hoste (UGent) , Klaar Vanopstal (UGent) , Ayla Rigouts Terryn and Els Lefever (UGent)
- Organization
- Abstract
- We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the "noisy" crawled corpus than with a dedicated corpus of the same size.
- Keywords
- terminology, automatic terminology extraction, corpora, medical terminology
Downloads
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 112.37 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8531641
- MLA
- Hoste, Veronique, et al. “The Trade-off between Quantity and Quality : Comparing a Large Web Corpus and a Small Focused Corpus for Medical Terminology Extraction.” ACROSS LANGUAGES AND CULTURES, vol. 20, no. 2, 2019, pp. 197–211, doi:10.1556/084.2019.20.2.3.
- APA
- Hoste, V., Vanopstal, K., Rigouts Terryn, A., & Lefever, E. (2019). The trade-off between quantity and quality : comparing a large web corpus and a small focused corpus for medical terminology extraction. ACROSS LANGUAGES AND CULTURES, 20(2), 197–211. https://doi.org/10.1556/084.2019.20.2.3
- Chicago author-date
- Hoste, Veronique, Klaar Vanopstal, Ayla Rigouts Terryn, and Els Lefever. 2019. “The Trade-off between Quantity and Quality : Comparing a Large Web Corpus and a Small Focused Corpus for Medical Terminology Extraction.” ACROSS LANGUAGES AND CULTURES 20 (2): 197–211. https://doi.org/10.1556/084.2019.20.2.3.
- Chicago author-date (all authors)
- Hoste, Veronique, Klaar Vanopstal, Ayla Rigouts Terryn, and Els Lefever. 2019. “The Trade-off between Quantity and Quality : Comparing a Large Web Corpus and a Small Focused Corpus for Medical Terminology Extraction.” ACROSS LANGUAGES AND CULTURES 20 (2): 197–211. doi:10.1556/084.2019.20.2.3.
- Vancouver
- 1.Hoste V, Vanopstal K, Rigouts Terryn A, Lefever E. The trade-off between quantity and quality : comparing a large web corpus and a small focused corpus for medical terminology extraction. ACROSS LANGUAGES AND CULTURES. 2019;20(2):197–211.
- IEEE
- [1]V. Hoste, K. Vanopstal, A. Rigouts Terryn, and E. Lefever, “The trade-off between quantity and quality : comparing a large web corpus and a small focused corpus for medical terminology extraction,” ACROSS LANGUAGES AND CULTURES, vol. 20, no. 2, pp. 197–211, 2019.
@article{8531641, abstract = {{We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the "noisy" crawled corpus than with a dedicated corpus of the same size.}}, author = {{Hoste, Veronique and Vanopstal, Klaar and Rigouts Terryn, Ayla and Lefever, Els}}, issn = {{1585-1923}}, journal = {{ACROSS LANGUAGES AND CULTURES}}, keywords = {{terminology,automatic terminology extraction,corpora,medical terminology}}, language = {{eng}}, number = {{2}}, pages = {{197--211}}, title = {{The trade-off between quantity and quality : comparing a large web corpus and a small focused corpus for medical terminology extraction}}, url = {{http://doi.org/10.1556/084.2019.20.2.3}}, volume = {{20}}, year = {{2019}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: