Advanced search
1 file | 453.00 KB Add to list

News topic classification as a first step towards diverse news recommendation

Orphée De Clercq (UGent) , Luna De Bruyne (UGent) and Veronique Hoste (UGent)
Author
Organization
Abstract
When developing an algorithm that uses news diversity as a key driver for personalized news recommendation it is crucial to focus on means to cluster news articles in a fine-grained manner, ideally by leveraging the content of the text. In this paper we investigate semantic classi fication of news articles in an un filtered news stream. We first present an analysis of the EventDNA corpus: a collection of Dutch-language news articles annotated with event data according to a predefi ned typology. We found that the types assigned as features of events do not allow for such a semantic classi fication and investigate the IPTC News Media Topics standard as an alternative. By mapping event types with manually-assigned IPTC topics, we observe that a more diversi fied picture emerges, which leads us to conclude that the IPTC classi fication is a useful proxy. Based on a historical data sample of Dutch news articles covering the year 2018, we then perform a series of machine learning experiments in order to automatically predict the top two levels of the IPTC taxonomy. Various multi-label classi fication models are built with BERTje using a bottom-up and top-down approach. The results reveal that the top-down approach yields the best results, with an overall macro F-1 score of 86.4% and a Jaccard accuracy of 89.2% for the level-one topics and one of 83.7% and 87.5% for the level-two predictions.
Keywords
LT3

Downloads

  • FullText.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 453.00 KB

Citation

Please use this url to cite or link to this publication:

MLA
De Clercq, Orphée, et al. “News Topic Classification as a First Step towards Diverse News Recommendation.” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, vol. 10, 2020, pp. 37–55.
APA
De Clercq, O., De Bruyne, L., & Hoste, V. (2020). News topic classification as a first step towards diverse news recommendation. COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, 10, 37–55.
Chicago author-date
De Clercq, Orphée, Luna De Bruyne, and Veronique Hoste. 2020. “News Topic Classification as a First Step towards Diverse News Recommendation.” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL 10: 37–55.
Chicago author-date (all authors)
De Clercq, Orphée, Luna De Bruyne, and Veronique Hoste. 2020. “News Topic Classification as a First Step towards Diverse News Recommendation.” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL 10: 37–55.
Vancouver
1.
De Clercq O, De Bruyne L, Hoste V. News topic classification as a first step towards diverse news recommendation. COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL. 2020;10:37–55.
IEEE
[1]
O. De Clercq, L. De Bruyne, and V. Hoste, “News topic classification as a first step towards diverse news recommendation,” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, vol. 10, pp. 37–55, 2020.
@article{8684328,
  abstract     = {{When developing an algorithm that uses news diversity as a key driver for personalized news recommendation it is crucial to focus on means to cluster news articles in a fine-grained manner, ideally by leveraging the content of the text. In this paper we investigate semantic classification of news articles in an unfiltered news stream. We first present an analysis of the EventDNA corpus: a collection of Dutch-language news articles annotated with event data according to a
predefined typology. We found that the types assigned as features of events do not allow for such a semantic classification and investigate the IPTC News Media Topics standard as an alternative. By mapping event types with manually-assigned IPTC topics, we observe that a more diversified picture emerges, which leads us to conclude that the IPTC classification is a useful proxy. Based on a historical data sample of Dutch news articles covering the year 2018, we then perform a series of machine learning experiments in order to automatically predict the top two levels of the IPTC taxonomy. Various multi-label classification models are built with BERTje using a bottom-up and top-down approach. The results reveal that the top-down approach yields the best results, with an overall macro F-1 score of 86.4% and a Jaccard accuracy of 89.2% for the level-one topics and one of 83.7% and 87.5% for the level-two predictions.}},
  author       = {{De Clercq, Orphée and De Bruyne, Luna and Hoste, Veronique}},
  issn         = {{2211-4009}},
  journal      = {{COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL}},
  keywords     = {{LT3}},
  language     = {{eng}},
  pages        = {{37--55}},
  title        = {{News topic classification as a first step towards diverse news recommendation}},
  url          = {{https://www.clinjournal.org/clinj/article/view/103/92}},
  volume       = {{10}},
  year         = {{2020}},
}