Advanced search
2 files | 3.66 MB Add to list
Author
Organization
Project
Abstract
This paper presents an impartial and extensive benchmark for text classification involving five different text classification tasks, 20 datasets, 11 different model architectures, and 42,800 algorithm runs. The five text classification tasks are fake news classification, topic detection, emotion detection, polarity detection, and sarcasm detection. While in practice, especially in Natural Language Processing (NLP), research tends to focus on the most sophisticated models, we hypothesize that this is not always necessary. Therefore, our main objective is to investigate whether the largest state-of-the-art (SOTA) models are always preferred, or in what cases simple methods can compete with complex models, i.e. for which dataset specifications and classification tasks. We assess the performance of different methods with varying complexity, ranging from simple statistical and machine learning methods to pretrained transformers like robustly optimized BERT (Bidirectional Encoder Representations from Transformers) pretraining approach (RoBERTa). This comprehensive benchmark is lacking in existing literature, with research mainly comparing similar types of methods. Furthermore, with increasing awareness of the ecological impacts of extensive computational resource usage, this comparison is both critical and timely.We find that overall, bidirectional long short-term memory (LSTM) networks are ranked as the best-performing method albeit not statistically significantly better than logistic regression and RoBERTa. Overall, we cannot conclude that simple methods perform worse although this depends mainly on the classification task. Concretely, we find that for fake news classification and topic detection, simple techniques are the best-ranked models and consequently, it is not necessary to train complicated neural network architectures for these classification tasks. Moreover, we also find a negative correlation between F1 performance and complexity for the smallest datasets (with dataset size less than 10,000). Finally, the different models' results are analyzed in depth to explain the model decisions, which is an increasing requirement in the field of text classification.
Keywords
Benchmark, Text classification, RoBERTa, Bidirectional LSTM, Natural, language processing, Machine learning, CONVOLUTIONAL NEURAL-NETWORK

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 2.36 MB
  • A 57 P Reusens et al A review and experimental evaluation of the state of the art in text classification (4).pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 1.31 MB

Citation

Please use this url to cite or link to this publication:

MLA
Reusens, Manon, et al. “Evaluating Text Classification : A Benchmark Study.” EXPERT SYSTEMS WITH APPLICATIONS, vol. 254, 2024, doi:10.1016/j.eswa.2024.124302.
APA
Reusens, M., Stevens, A., Tonglet, J., De Smedt, J., Verbeke, W., vanden Broucke, S., & Baesens, B. (2024). Evaluating text classification : a benchmark study. EXPERT SYSTEMS WITH APPLICATIONS, 254. https://doi.org/10.1016/j.eswa.2024.124302
Chicago author-date
Reusens, Manon, Alexander Stevens, Jonathan Tonglet, Johannes De Smedt, Wouter Verbeke, Seppe vanden Broucke, and Bart Baesens. 2024. “Evaluating Text Classification : A Benchmark Study.” EXPERT SYSTEMS WITH APPLICATIONS 254. https://doi.org/10.1016/j.eswa.2024.124302.
Chicago author-date (all authors)
Reusens, Manon, Alexander Stevens, Jonathan Tonglet, Johannes De Smedt, Wouter Verbeke, Seppe vanden Broucke, and Bart Baesens. 2024. “Evaluating Text Classification : A Benchmark Study.” EXPERT SYSTEMS WITH APPLICATIONS 254. doi:10.1016/j.eswa.2024.124302.
Vancouver
1.
Reusens M, Stevens A, Tonglet J, De Smedt J, Verbeke W, vanden Broucke S, et al. Evaluating text classification : a benchmark study. EXPERT SYSTEMS WITH APPLICATIONS. 2024;254.
IEEE
[1]
M. Reusens et al., “Evaluating text classification : a benchmark study,” EXPERT SYSTEMS WITH APPLICATIONS, vol. 254, 2024.
@article{01J0N7D56YYACM74AX8CA81GHB,
  abstract     = {{This paper presents an impartial and extensive benchmark for text classification involving five different text classification tasks, 20 datasets, 11 different model architectures, and 42,800 algorithm runs. The five text classification tasks are fake news classification, topic detection, emotion detection, polarity detection, and sarcasm detection. While in practice, especially in Natural Language Processing (NLP), research tends to focus on the most sophisticated models, we hypothesize that this is not always necessary. Therefore, our main objective is to investigate whether the largest state-of-the-art (SOTA) models are always preferred, or in what cases simple methods can compete with complex models, i.e. for which dataset specifications and classification tasks. We assess the performance of different methods with varying complexity, ranging from simple statistical and machine learning methods to pretrained transformers like robustly optimized BERT (Bidirectional Encoder Representations from Transformers) pretraining approach (RoBERTa). This comprehensive benchmark is lacking in existing literature, with research mainly comparing similar types of methods. Furthermore, with increasing awareness of the ecological impacts of extensive computational resource usage, this comparison is both critical and timely.We find that overall, bidirectional long short-term memory (LSTM) networks are ranked as the best-performing method albeit not statistically significantly better than logistic regression and RoBERTa. Overall, we cannot conclude that simple methods perform worse although this depends mainly on the classification task. Concretely, we find that for fake news classification and topic detection, simple techniques are the best-ranked models and consequently, it is not necessary to train complicated neural network architectures for these classification tasks. Moreover, we also find a negative correlation between F1 performance and complexity for the smallest datasets (with dataset size less than 10,000). Finally, the different models' results are analyzed in depth to explain the model decisions, which is an increasing requirement in the field of text classification.}},
  articleno    = {{124302}},
  author       = {{Reusens, Manon and Stevens, Alexander and Tonglet, Jonathan and De Smedt, Johannes and Verbeke, Wouter and vanden Broucke, Seppe and Baesens, Bart}},
  issn         = {{0957-4174}},
  journal      = {{EXPERT SYSTEMS WITH APPLICATIONS}},
  keywords     = {{Benchmark,Text classification,RoBERTa,Bidirectional LSTM,Natural,language processing,Machine learning,CONVOLUTIONAL NEURAL-NETWORK}},
  language     = {{eng}},
  pages        = {{25}},
  title        = {{Evaluating text classification : a benchmark study}},
  url          = {{http://doi.org/10.1016/j.eswa.2024.124302}},
  volume       = {{254}},
  year         = {{2024}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: