Advanced search
1 file | 3.88 MB Add to list

Finding the online cry for help : automatic text classification for suicide prevention

Bart Desmet (UGent)
(2014)
Author
Promoter
(UGent) and (UGent)
Organization
Project
SubTLe
Project
HPC-UGent: the central High Performance Computing infrastructure of Ghent University
Project
LT3
Abstract
Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content. In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance. The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected. We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus. To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall. At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content.
Keywords
text classification, optimization, suicide prevention, normalization

Downloads

  • Desmet PhD print.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 3.88 MB

Citation

Please use this url to cite or link to this publication:

MLA
Desmet, Bart. “Finding the Online Cry for Help : Automatic Text Classification for Suicide Prevention.” 2014 : n. pag. Print.
APA
Desmet, Bart. (2014). Finding the online cry for help : automatic text classification for suicide prevention. Ghent University. Faculty of Arts and Philosophy, Ghent, Belgium.
Chicago author-date
Desmet, Bart. 2014. “Finding the Online Cry for Help : Automatic Text Classification for Suicide Prevention”. Ghent, Belgium: Ghent University. Faculty of Arts and Philosophy.
Chicago author-date (all authors)
Desmet, Bart. 2014. “Finding the Online Cry for Help : Automatic Text Classification for Suicide Prevention”. Ghent, Belgium: Ghent University. Faculty of Arts and Philosophy.
Vancouver
1.
Desmet B. Finding the online cry for help : automatic text classification for suicide prevention. [Ghent, Belgium]: Ghent University. Faculty of Arts and Philosophy; 2014.
IEEE
[1]
B. Desmet, “Finding the online cry for help : automatic text classification for suicide prevention,” Ghent University. Faculty of Arts and Philosophy, Ghent, Belgium, 2014.
@phdthesis{5768292,
  abstract     = {Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content.
In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance.
The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected.
We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus.
To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall.
At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content.},
  author       = {Desmet, Bart},
  keywords     = {text classification,optimization,suicide prevention,normalization},
  language     = {eng},
  pages        = {XII, 205},
  publisher    = {Ghent University. Faculty of Arts and Philosophy},
  school       = {Ghent University},
  title        = {Finding the online cry for help : automatic text classification for suicide prevention},
  year         = {2014},
}