LT3: Sentiment Classification in User-Generated Content Using a Rich Feature Set

,


Introduction
Over the past few years, Web 2.0 applications such as microblogging services, social networking sites, and short messaging services have considerably increased the amount of user-generated content produced online. Millions of people rely on these services to send messages, share their views or gather information about others. Simultaneously, companies, marketeers and politicians are anxious to detect sentiment in UGC since these messages might contain valuable information about the public opinion. This explains why sentiment analysis has been a research area of great interest in the last few years Pang and Lee, 2008;Mohammad and Yang, 2011). Though first studies focussed more on product or movie reviews, we see that analyzing sentiment in UGC is currently becoming increasingly popular. The main difference between these two sources of information is that the former is rather long and contains quite formal language whereas the latter one is generally very brief and noisy and thus represents some different challenges (Maynard et al., 2012).
In this paper, we describe our contribution to the SemEval-2014 Task 9: Sentiment Analysis in This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ Twitter (Rosenthal et al., 2014), which was a rerun of SemEval-2013 Task 2 (Nakov et al., 2013) and consisted of two subtasks: • Subtask A -Contextual Polarity Disambiguation: Given a message containing a marked instance of a word or phrase, determine whether that instance is positive, negative or neutral in that context.
• Subtask B -Message Polarity Classification: Given a message, classify whether the message is of positive, negative, or neutral sentiment. For messages conveying both a positive and negative sentiment, whichever is the stronger sentiment should be chosen.
The datasets for training, development and testing were provided by the task organizers. The training datasets consisted of Twitter messages on a variety of topics. The test sets contained regular tweets (Twitter2013, Twitter2014), tweets labeled as sarcastic (TwitterSarcasm), SMS messages (SMS2013), and blog posts (LiveJour-nal2014). For both subtasks, the possible polarity labels were positive, negative, neutral, and objective. The datasets for subtask B contained an additional label, i.e. objective-OR-neutral. Table 1 presents an overview of all provided datasets. For each task and test dataset, two runs could be submitted: a constrained run using the provided training data only, and an unconstrained one using additional training data. For both tasks, we created a constrained model based on supervised learning, relying on additional lexicons and using the test datasets of SemEval-2013 as development data. Evaluation was based on averaged Fmeasure, considering averaged F-positive and Fnegative.

System Description
Our main goal was to develop, for each polarity classification task, a classifier to label a message or an instance of that message as either positive, negative, or neutral. We ran several experiments to identify the most discriminative classifier features. This section gives an overview of the pipeline we developed and which features were implemented.

Linguistic Preprocessing
First, we performed manual cleaning on the datasets to replace non-UTF-8 characters, and we tokenized all messages using the Carnegie Mellon University Twitter Part-of-Speech Tagger (Gimpel et al., 2011). Subsequently, we Part-of-Speech tagged all instances using the CMU Twitter Partof-Speech Tagger (Gimpel et al., 2011), and performed dependency parsing using a caseless parsing model of the Stanford parser (de Marneffe et al., 2006). Besides that, we also tagged all named entities using the Twitter NLP tools (Ritter et al., 2011) for Named Entity Recognition. As a final preprocessing step, we decided to combine the labels neutral, objective and neutral-OR-objective, thus recasting the task as a three-way classification task.

Feature Extraction
We implemented a number of lexical and syntactic features that represent every phrase (subtask A) or message (subtask B) within a feature vector:

N-gram features
• Word token n-gram features: a binary value for every token unigram, bigram, and trigram found in the training data.
• Character n-gram features: a binary value for every character trigram, and fourgram (within word tokens) found in the training data.
• Normalized n-gram features: n-grams that consisted of URLs and mentions or @replies were replaced by http://someurl and by @someuser, respectively. We also normalized commonly used abbreviations 1 to their full written form (e.g. h8 → hate).

Word shape features
• Character flooding: the number of word tokens with a character repeated more than two times (e.g. sooooooo join).
• Punctuation of the last token: a binary value indicating whether the last word token of a message contains a question/exclamation mark (e.g. Going to Helsinki tomorrow or on the day after tomorrow, yay!).
• The number of capitalized words (e.g. SO EXCITED).
Lexicon features: As sentiment lexicons we consulted existing resources: AFINN (Nielsen, 2011), General Inquirer (Stone et al., 1966), MPQA , NRC Emotion (Mohammad and Turney, 2010; Mohammad and Yang, 2011), Bing Liu (Hu and Liu, 2004), and Bounce (Kökciyan et al., 2013) -the latter three are Twitter-specific. Additionally, we created a list of emoticons extracted from the SemEval-2014 training data. Based on these resources, the following features were extracted: • The number of positive, negative, and neutral lexicon words averaged over text length • The overall polarity, which is the sum of the values of identified sentiment words These features were extracted by 1) looking at all tokens in the instance, and 2) looking at hashtag tokens only (e.g. win from #win). We also considered negation cues by flipping the polarity sign of a sentiment word if it occurred in a negation relation (e.g. @ 2Shades maybe 3rd team bro, he's not better than trey Burke from Michigan). Negation relations were identified using the output of the dependency parser. In the example above, the positive polarity of the sentiment word better is flipped into negative since it occurs in a relation with not.
• Dependency relations -four binary values for every dependency relation found in the training data. The first value indicates the presence of the lexicalized dependency relations in the test data. Additionally, as proposed by (Joshi and Penstein-Rosé, 2009), the dependency relation features are generalized in three ways: by backing off the head word to its PoS-tag, by backing off the modifier word to its PoS-tag, and by backing off both the head and modifier word.
Named entity features: This feature group consists of four features: binary (tweet contains NEs or not), absolute (number of NEs), absolute tokens (number of tokens that are part of an NE), and frequency tokens (frequency of NE tokens).
As the equation shows, the association score of a word with negative sentiment is subtracted from the word's association score with positive sentiment.

Optimizing the Classification Results
The core of our approach consisted in evaluating the aforementioned features and selecting those feature groups contributing most to the classification results. To this end, we trained an SVM classifier using the LIBSVM package (Chang and Lin, 2001) and created models for various feature combinations. A linear kernel and a cost value of 1 were chosen as parameter settings for all further experiments after cross-validation on the training data. Our experimental setup consisted of three steps: 1) training an SVM on the original training data provided by the task organizers (no development data was used), 2) generating a model, and 3) applying and evaluating the model on the development data (Twitter and SMS test data of SemEval-2013). We started our experiments with sentiment lexicon and n-gram features only, and gradually added other feature groups to identify the most contributive features. Tables 2 and 3 reveal the obtained F-scores for each step.    were already relatively high (~0.8559 for subtask A and~0.6241 for subtask B) for the combined lexicon and n-gram features (on average 0.8559 for subtask A and 0.6241 for subtask B), which we therefore consider as a our baseline setup. Considering the results for both subtasks and data genres, we conclude that n-grams, sentiment lexicons, and PoS-tags were the most contributive feature groups, whereas named entity and dependency features did not improve the overall classification performance. However, using all feature groups (n-grams, lexicons, normalized n-grams, Part-of-Speech features, negation features, word shape features, named entity features, dependency features, and PMI features) improved the classification results (reaching an averaged F= 0.8632 for subtask A, and F= 0.6525 for subtask B) compared to classification based on lexicon (averaged F= 0.6629 for subtask A, and F= 0.5231 for subtask B) or n-gram features only (averaged F= 0.8356 for subtask A, and F= 0.5762 for subtask B). Based on these results, we conclude that using the full feature set for the classification of unseen data appears to be a promising approach, considering that it achieves good performance and that it would not tune the training model to a particular data genre.
For further optimization of the classification results, we performed feature selection in the feature groups by using a genetic algorithm approach which can explore different areas of the search space in parallel. In order to do so, we made use of the Gallop (Genetic Algorithms for Linguistic Learner Optimization) python package (Desmet et al., 2013). This enabled us to select the most contributive features from every feature group: ngram features at token and character level, lexicon features from General Inquirer, Liu, AFINN, and Bounce, character flooding and token capitalization features, Part-of-Speech features (binary, ternary, and absolute), named entity features (binary, absolute tokens, and frequency tokens), and PMI features based on the NRC lexicon. None of the dependency relation features were selected.

Results
We submitted sentiment labels for the Contextual Polarity Disambiguation (subtask A) and for the Message Polarity Classification (subtask B). Our competition results are reported in Table 4. Rankings for each dataset are added between brackets. The results reveal that our systems achieved good performance in the polarity classification of unseen data across the various genres and tasks. Overall, we achieved our best classification performance on the Twitter2013 test set, obtaining an F-score of 86.28, while the best performance for this data genre is an F-score of 90.14. We saw a drop in performance on the Twitter2014 Sarcasm test set. This is consistent with most other teams as sarcastic language is hard to handle in sentiment analysis. Considering the rankings, we conclude that we performed particularly well on the SMS test dataset of SemEval-2013 for both subtasks, ranking seventh for this genre. Our systems ranked ninth among 27 submissions and sixteenth among 50 submissions for subtasks A and B respectively.

Conclusions and Future Work
Using a rich feature set proves to be beneficial for automatic sentiment analysis on user-generated content. Feature selection experiments revealed that features based on n-grams, sentiment lexicons, and PoS-tags were most contributive for both classification tasks, while dependency features did not contribute to overall classification performance. As future work it will be interesting to study the impact of normalization of the data on the classification performance.
Based on a shallow error analysis, we believe that including additional classification features may also be promising: modifiers other than negation cues (diminishers, increasers, modal verbs, etc.) that affect the polarity intensity, emoticon flooding, and pre-and suffixes that indicate emotion (un-, dis-, -less, etc.). Additionally, lemmatization and hashtag segmentation on the training data could also improve classification results.