Advanced search
1 file | 1.42 MB Add to list

Annotating affective dimensions in user-generated content : comparing the reliability of best-worst scaling, pairwise comparison and rating scales for annotating valence, arousal and dominance

Luna De Bruyne (UGent) , Orphée De Clercq (UGent) and Veronique Hoste (UGent)
(2021) LANGUAGE RESOURCES AND EVALUATION. 55(4). p.1017-1045
Author
Organization
Project
Abstract
In an era where user-generated content becomes ever more prevalent, reliable methods to judge emotional properties of these kinds of complex texts are needed, for example for developing corpora in machine learning contexts. In this study, we focus on Dutch Twitter messages, a genre which is high in emotional content and frequently investigated in the field of computational linguistics. We compare three annotation methods to annotate the emotional dimensions valence, arousal and dominance in 300 Tweets, namely rating scales, pairwise comparison and best–worst scaling. We evaluate the annotation methods on the criterion of inter-annotator agreement, based on judgments of 18 annotators in total. On this dataset, best–worst scaling has the highest inter-annotator agreement. We find that the difference in agreement is largest for dominance and smallest for valence, suggesting that the benefit of best–worst scaling becomes more pronounced as the annotation task gets more difficult. However, we also find that best–worst scaling is particularly more time-consuming than are rating scale and pairwise comparison annotations. This leads us to conclude that, in particular when dealing with computational models, a comparative assessment of quality versus costs needs to be made.
Keywords
LT3, User-generated content, Emotion annotation, Best-worst scaling, User-generated content, Emotion annotation, Best-worst scaling

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 1.42 MB

Citation

Please use this url to cite or link to this publication:

MLA
De Bruyne, Luna, et al. “Annotating Affective Dimensions in User-Generated Content : Comparing the Reliability of Best-Worst Scaling, Pairwise Comparison and Rating Scales for Annotating Valence, Arousal and Dominance.” LANGUAGE RESOURCES AND EVALUATION, vol. 55, no. 4, 2021, pp. 1017–45, doi:10.1007/s10579-020-09524-2.
APA
De Bruyne, L., De Clercq, O., & Hoste, V. (2021). Annotating affective dimensions in user-generated content : comparing the reliability of best-worst scaling, pairwise comparison and rating scales for annotating valence, arousal and dominance. LANGUAGE RESOURCES AND EVALUATION, 55(4), 1017–1045. https://doi.org/10.1007/s10579-020-09524-2
Chicago author-date
De Bruyne, Luna, Orphée De Clercq, and Veronique Hoste. 2021. “Annotating Affective Dimensions in User-Generated Content : Comparing the Reliability of Best-Worst Scaling, Pairwise Comparison and Rating Scales for Annotating Valence, Arousal and Dominance.” LANGUAGE RESOURCES AND EVALUATION 55 (4): 1017–45. https://doi.org/10.1007/s10579-020-09524-2.
Chicago author-date (all authors)
De Bruyne, Luna, Orphée De Clercq, and Veronique Hoste. 2021. “Annotating Affective Dimensions in User-Generated Content : Comparing the Reliability of Best-Worst Scaling, Pairwise Comparison and Rating Scales for Annotating Valence, Arousal and Dominance.” LANGUAGE RESOURCES AND EVALUATION 55 (4): 1017–1045. doi:10.1007/s10579-020-09524-2.
Vancouver
1.
De Bruyne L, De Clercq O, Hoste V. Annotating affective dimensions in user-generated content : comparing the reliability of best-worst scaling, pairwise comparison and rating scales for annotating valence, arousal and dominance. LANGUAGE RESOURCES AND EVALUATION. 2021;55(4):1017–45.
IEEE
[1]
L. De Bruyne, O. De Clercq, and V. Hoste, “Annotating affective dimensions in user-generated content : comparing the reliability of best-worst scaling, pairwise comparison and rating scales for annotating valence, arousal and dominance,” LANGUAGE RESOURCES AND EVALUATION, vol. 55, no. 4, pp. 1017–1045, 2021.
@article{8689714,
  abstract     = {{In an era where user-generated content becomes ever more prevalent, reliable methods to judge emotional properties of these kinds of complex texts are needed, for example for developing corpora in machine learning contexts. In this study, we focus on Dutch Twitter messages, a genre which is high in emotional content and frequently investigated in the field of computational linguistics. We compare three annotation methods to annotate the emotional dimensions valence, arousal and dominance in 300 Tweets, namely rating scales, pairwise comparison and best–worst scaling. We evaluate the annotation methods on the criterion of inter-annotator agreement, based on judgments of 18 annotators in total. On this dataset, best–worst scaling has the highest inter-annotator agreement. We find that the difference in agreement is largest for dominance and smallest for valence, suggesting that the benefit of best–worst scaling becomes more pronounced as the annotation task gets more difficult. However, we also find that best–worst scaling is particularly more time-consuming than are rating scale and pairwise comparison annotations. This leads us to conclude that, in particular when dealing with computational models, a comparative assessment of quality versus costs needs to be made.}},
  author       = {{De Bruyne, Luna and De Clercq, Orphée and Hoste, Veronique}},
  issn         = {{1574-020X}},
  journal      = {{LANGUAGE RESOURCES AND EVALUATION}},
  keywords     = {{LT3,User-generated content,Emotion annotation,Best-worst scaling,User-generated content,Emotion annotation,Best-worst scaling}},
  language     = {{eng}},
  number       = {{4}},
  pages        = {{1017--1045}},
  title        = {{Annotating affective dimensions in user-generated content : comparing the reliability of best-worst scaling, pairwise comparison and rating scales for annotating valence, arousal and dominance}},
  url          = {{http://doi.org/10.1007/s10579-020-09524-2}},
  volume       = {{55}},
  year         = {{2021}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: