Advanced search
1 file | 229.55 KB Add to list

Multi-modular text normalization of Dutch user-generated content

Author
Organization
Abstract
As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.
Keywords
user-generated content, LANGUAGE, text normalization, machine translation, social media

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 229.55 KB

Citation

Please use this url to cite or link to this publication:

MLA
Schulz, Sarah, et al. “Multi-Modular Text Normalization of Dutch User-Generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, vol. 7, no. 4, 2016, doi:10.1145/2850422.
APA
Schulz, S., De Pauw, G., De Clercq, O., Desmet, B., Hoste, V., Daelemans, W., & Macken, L. (2016). Multi-modular text normalization of Dutch user-generated content. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 7(4). https://doi.org/10.1145/2850422
Chicago author-date
Schulz, Sarah, Guy De Pauw, Orphée De Clercq, Bart Desmet, Veronique Hoste, Walter Daelemans, and Lieve Macken. 2016. “Multi-Modular Text Normalization of Dutch User-Generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY 7 (4). https://doi.org/10.1145/2850422.
Chicago author-date (all authors)
Schulz, Sarah, Guy De Pauw, Orphée De Clercq, Bart Desmet, Veronique Hoste, Walter Daelemans, and Lieve Macken. 2016. “Multi-Modular Text Normalization of Dutch User-Generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY 7 (4). doi:10.1145/2850422.
Vancouver
1.
Schulz S, De Pauw G, De Clercq O, Desmet B, Hoste V, Daelemans W, et al. Multi-modular text normalization of Dutch user-generated content. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY. 2016;7(4).
IEEE
[1]
S. Schulz et al., “Multi-modular text normalization of Dutch user-generated content,” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, vol. 7, no. 4, 2016.
@article{7010719,
  abstract     = {{As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.}},
  articleno    = {{61}},
  author       = {{Schulz, Sarah and De Pauw, Guy and De Clercq, Orphée and Desmet, Bart and Hoste, Veronique and Daelemans, Walter and Macken, Lieve}},
  issn         = {{2157-6904}},
  journal      = {{ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY}},
  keywords     = {{user-generated content,LANGUAGE,text normalization,machine translation,social media}},
  language     = {{eng}},
  number       = {{4}},
  pages        = {{24}},
  title        = {{Multi-modular text normalization of Dutch user-generated content}},
  url          = {{http://doi.org/10.1145/2850422}},
  volume       = {{7}},
  year         = {{2016}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: