Multi-modular text normalization of Dutch user-generated content
- Author
- Sarah Schulz (UGent) , Guy De Pauw, Orphée De Clercq (UGent) , Bart Desmet (UGent) , Veronique Hoste (UGent) , Walter Daelemans and Lieve Macken (UGent)
- Organization
- Abstract
- As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.
- Keywords
- user-generated content, LANGUAGE, text normalization, machine translation, social media
Downloads
-
(...).pdf
- full text
- |
- UGent only
- |
- |
- 229.55 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-7010719
- MLA
- Schulz, Sarah, et al. “Multi-Modular Text Normalization of Dutch User-Generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, vol. 7, no. 4, 2016, doi:10.1145/2850422.
- APA
- Schulz, S., De Pauw, G., De Clercq, O., Desmet, B., Hoste, V., Daelemans, W., & Macken, L. (2016). Multi-modular text normalization of Dutch user-generated content. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 7(4). https://doi.org/10.1145/2850422
- Chicago author-date
- Schulz, Sarah, Guy De Pauw, Orphée De Clercq, Bart Desmet, Veronique Hoste, Walter Daelemans, and Lieve Macken. 2016. “Multi-Modular Text Normalization of Dutch User-Generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY 7 (4). https://doi.org/10.1145/2850422.
- Chicago author-date (all authors)
- Schulz, Sarah, Guy De Pauw, Orphée De Clercq, Bart Desmet, Veronique Hoste, Walter Daelemans, and Lieve Macken. 2016. “Multi-Modular Text Normalization of Dutch User-Generated Content.” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY 7 (4). doi:10.1145/2850422.
- Vancouver
- 1.Schulz S, De Pauw G, De Clercq O, Desmet B, Hoste V, Daelemans W, et al. Multi-modular text normalization of Dutch user-generated content. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY. 2016;7(4).
- IEEE
- [1]S. Schulz et al., “Multi-modular text normalization of Dutch user-generated content,” ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, vol. 7, no. 4, 2016.
@article{7010719, abstract = {{As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.}}, articleno = {{61}}, author = {{Schulz, Sarah and De Pauw, Guy and De Clercq, Orphée and Desmet, Bart and Hoste, Veronique and Daelemans, Walter and Macken, Lieve}}, issn = {{2157-6904}}, journal = {{ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY}}, keywords = {{user-generated content,LANGUAGE,text normalization,machine translation,social media}}, language = {{eng}}, number = {{4}}, pages = {{24}}, title = {{Multi-modular text normalization of Dutch user-generated content}}, url = {{http://doi.org/10.1145/2850422}}, volume = {{7}}, year = {{2016}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: