Advanced search
Add to list

Annotated Corpora for Term Extraction Research (ACTER)

(2020)
Author
Organization
Abstract
The Annotated Corpora for Term Extraction Research (ACTER) contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one, lowercased, unlemmatised, unique annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparant and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020.
Keywords
LT3, automatic term extraction, terminology, comparable corpora, ATE, terms, annotation, corpora
License
CC-BY-NC-SA-4.0
Access
open access

Citation

Please use this url to cite or link to this publication:

@misc{8649280,
  abstract     = {{The Annotated Corpora for Term Extraction Research (ACTER) contain texts in four domains (corruption, dressage (horse riding), heart failure, and wind energy) and three languages (English, French, Dutch). For each corpus (combination of domain & language), around 50k tokens have been manually annotated to identify terminology and named entities (almost 600k annotated tokens in total). The results are presented as lists of annotations per corpus, with one, lowercased, unlemmatised, unique annotation per line, tab-separated by its label. In total, there are 19k unique annotations. The annotation process is transparant and well-documented, with freely available guidelines (http://hdl.handle.net/1854/LU-8503113) and several published papers for the validation of the dataset. It has also been used for the TermEval 2020 shared task on automatic term extraction, organised at the CompuTerm workshop at LREC 2020.}},
  author       = {{Rigouts Terryn, Ayla and Hoste, Veronique and Lefever, Els}},
  keywords     = {{LT3,automatic term extraction,terminology,comparable corpora,ATE,terms,annotation,corpora}},
  language     = {{eng,fre,dut}},
  publisher    = {{Zenodo}},
  title        = {{Annotated Corpora for Term Extraction Research (ACTER)}},
  url          = {{http://doi.org/10.1007/s10579-019-09453-9}},
  year         = {{2020}},
}

Altmetric
View in Altmetric