Advanced search
1 file | 229.55 KB Add to list

Multi-modular text normalization of Dutch user-generated content

Author
Organization
Project
  • LT3
Abstract
As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.
Keywords
user-generated content, LANGUAGE, text normalization, machine translation, social media

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 229.55 KB

Citation

Please use this url to cite or link to this publication:

MLA
Schulz, Sarah et al. “Multi-modular Text Normalization of Dutch User-generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY 7.4 (2016): n. pag. Print.
APA
Schulz, S., De Pauw, G., De Clercq, O., Desmet, B., Hoste, V., Daelemans, W., & Macken, L. (2016). Multi-modular text normalization of Dutch user-generated content. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 7(4).
Chicago author-date
Schulz, Sarah, Guy De Pauw, Orphée De Clercq, Bart Desmet, Veronique Hoste, Walter Daelemans, and Lieve Macken. 2016. “Multi-modular Text Normalization of Dutch User-generated Content.” Acm Transactions on Intelligent Systems and Technology 7 (4).
Chicago author-date (all authors)
Schulz, Sarah, Guy De Pauw, Orphée De Clercq, Bart Desmet, Veronique Hoste, Walter Daelemans, and Lieve Macken. 2016. “Multi-modular Text Normalization of Dutch User-generated Content.” Acm Transactions on Intelligent Systems and Technology 7 (4).
Vancouver
1.
Schulz S, De Pauw G, De Clercq O, Desmet B, Hoste V, Daelemans W, et al. Multi-modular text normalization of Dutch user-generated content. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY. 2016;7(4).
IEEE
[1]
S. Schulz et al., “Multi-modular text normalization of Dutch user-generated content,” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, vol. 7, no. 4, 2016.
@article{7010719,
  abstract     = {As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.},
  articleno    = {61},
  author       = {Schulz, Sarah and De Pauw, Guy and De Clercq, Orphée and Desmet, Bart and Hoste, Veronique and Daelemans, Walter and Macken, Lieve},
  issn         = {2157-6904},
  journal      = {ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY},
  keywords     = {user-generated content,LANGUAGE,text normalization,machine translation,social media},
  language     = {eng},
  number       = {4},
  pages        = {24},
  title        = {Multi-modular text normalization of Dutch user-generated content},
  url          = {http://dx.doi.org/10.1145/2850422},
  volume       = {7},
  year         = {2016},
}

Altmetric
View in Altmetric
Web of Science
Times cited: