Ghent University Academic Bibliography

Advanced

Collecting a corpus of Dutch SMS

Maaske Treurniet, Orphée De Clercq UGent, Henk van den Heuvel and Nelleke Oostdijk (2012) LREC 2012 : eight international conference on language resources and evaluation. p.2268-2273
abstract
In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies.
Please use this url to cite or link to this publication:
author
organization
alternative title
Collection of a corpus of Dutch SMS
year
type
conference
publication status
published
subject
keyword
text messaging, corpus collection, SoNaR
in
LREC 2012 : eight international conference on language resources and evaluation
editor
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk and Stelios Piperidis
pages
2268 - 2273
publisher
European Language Resources Association (ELRA)
place of publication
Paris, France
conference name
8th International conference on Language Resources and Evaluation Conference (LREC 2012)
conference location
Istanbul, Turkey
conference start
2012-05-23
conference end
2012-05-25
Web of Science type
Proceedings Paper
Web of Science id
000323927702056
ISBN
9782951740877
language
English
UGent publication?
yes
classification
P1
copyright statement
I have transferred the copyright for this publication to the publisher
id
2129174
handle
http://hdl.handle.net/1854/LU-2129174
date created
2012-06-01 14:59:11
date last changed
2015-06-17 10:04:14
@inproceedings{2129174,
  abstract     = {In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies.},
  author       = {Treurniet, Maaske and De Clercq, Orph{\'e}e and van den Heuvel, Henk and Oostdijk, Nelleke},
  booktitle    = {LREC 2012 : eight international conference on language resources and evaluation},
  editor       = {Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and U\u{g}ur Do\u{g}an, Mehmet and Maegaard, Bente and Mariani, Joseph and Odijk, Jan and Piperidis, Stelios},
  isbn         = {9782951740877},
  keyword      = {text messaging,corpus collection,SoNaR},
  language     = {eng},
  location     = {Istanbul, Turkey},
  pages        = {2268--2273},
  publisher    = {European Language Resources Association (ELRA)},
  title        = {Collecting a corpus of Dutch SMS},
  year         = {2012},
}

Chicago
Treurniet, Maaske, Orphée De Clercq, Henk van den Heuvel, and Nelleke Oostdijk. 2012. “Collecting a Corpus of Dutch SMS.” In LREC 2012 : Eight International Conference on Language Resources and Evaluation, ed. Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, 2268–2273. Paris, France: European Language Resources Association (ELRA).
APA
Treurniet, M., De Clercq, O., van den Heuvel, H., & Oostdijk, N. (2012). Collecting a corpus of Dutch SMS. In N. Calzolari, K. Choukri, T. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, J. Odijk, et al. (Eds.), LREC 2012 : eight international conference on language resources and evaluation (pp. 2268–2273). Presented at the 8th International conference on Language Resources and Evaluation Conference (LREC 2012), Paris, France: European Language Resources Association (ELRA).
Vancouver
1.
Treurniet M, De Clercq O, van den Heuvel H, Oostdijk N. Collecting a corpus of Dutch SMS. In: Calzolari N, Choukri K, Declerck T, Uğur Doğan M, Maegaard B, Mariani J, et al., editors. LREC 2012 : eight international conference on language resources and evaluation. Paris, France: European Language Resources Association (ELRA); 2012. p. 2268–73.
MLA
Treurniet, Maaske, Orphée De Clercq, Henk van den Heuvel, et al. “Collecting a Corpus of Dutch SMS.” LREC 2012 : Eight International Conference on Language Resources and Evaluation. Ed. Nicoletta Calzolari et al. Paris, France: European Language Resources Association (ELRA), 2012. 2268–2273. Print.