Advanced search

LeTs Preprocess: the Multilingual LT3 Linguistic Preprocessing Toolkit

(2013)
Author
Organization
Project
LT3
Abstract
This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules that include part-of-speech taggers, lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train these components. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the accuracy of our preprocessing tools on this corpus and compare it to the performance of other existing tools.

Citation

Please use this url to cite or link to this publication:

Chicago
Van de Kauter, Marjan, Geert Coorman, Els Lefever, Bart Desmet, Sofie Niemegeers, Lubbert-Jan Gringhuis, Lieve Macken, and Veronique Hoste. 2013. “LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit.” In .
APA
Van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Niemegeers, S., Gringhuis, L.-J., Macken, L., et al. (2013). LeTs Preprocess: the Multilingual LT3 Linguistic Preprocessing Toolkit. Presented at the CLIN 2013.
Vancouver
1.
Van de Kauter M, Coorman G, Lefever E, Desmet B, Niemegeers S, Gringhuis L-J, et al. LeTs Preprocess: the Multilingual LT3 Linguistic Preprocessing Toolkit. 2013.
MLA
Van de Kauter, Marjan, Geert Coorman, Els Lefever, et al. “LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit.” 2013. Print.
@inproceedings{7199027,
  abstract     = {This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules that include part-of-speech taggers, lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German.

We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train these components. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the accuracy of our preprocessing tools on this corpus and compare it to the performance of other existing tools.},
  author       = {Van de Kauter, Marjan and Coorman, Geert and Lefever, Els and Desmet, Bart and Niemegeers, Sofie and Gringhuis, Lubbert-Jan and Macken, Lieve and Hoste, Veronique},
  language     = {eng},
  location     = {Enschede, The Netherlands},
  title        = {LeTs Preprocess: the Multilingual LT3 Linguistic Preprocessing Toolkit},
  url          = {http://hmi.ewi.utwente.nl/clin2013/Posters},
  year         = {2013},
}