
News topic classification as a first step towards diverse news recommendation
- Author
- Orphée De Clercq (UGent) , Luna De Bruyne (UGent) and Veronique Hoste (UGent)
- Organization
- Abstract
- When developing an algorithm that uses news diversity as a key driver for personalized news recommendation it is crucial to focus on means to cluster news articles in a fine-grained manner, ideally by leveraging the content of the text. In this paper we investigate semantic classification of news articles in an unfiltered news stream. We first present an analysis of the EventDNA corpus: a collection of Dutch-language news articles annotated with event data according to a predefined typology. We found that the types assigned as features of events do not allow for such a semantic classification and investigate the IPTC News Media Topics standard as an alternative. By mapping event types with manually-assigned IPTC topics, we observe that a more diversified picture emerges, which leads us to conclude that the IPTC classification is a useful proxy. Based on a historical data sample of Dutch news articles covering the year 2018, we then perform a series of machine learning experiments in order to automatically predict the top two levels of the IPTC taxonomy. Various multi-label classification models are built with BERTje using a bottom-up and top-down approach. The results reveal that the top-down approach yields the best results, with an overall macro F-1 score of 86.4% and a Jaccard accuracy of 89.2% for the level-one topics and one of 83.7% and 87.5% for the level-two predictions.
- Keywords
- LT3
Downloads
-
FullText.pdf
- full text (Published version)
- |
- open access
- |
- |
- 453.00 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-8684328
- MLA
- De Clercq, Orphée, et al. “News Topic Classification as a First Step towards Diverse News Recommendation.” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, vol. 10, 2020, pp. 37–55.
- APA
- De Clercq, O., De Bruyne, L., & Hoste, V. (2020). News topic classification as a first step towards diverse news recommendation. COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, 10, 37–55.
- Chicago author-date
- De Clercq, Orphée, Luna De Bruyne, and Veronique Hoste. 2020. “News Topic Classification as a First Step towards Diverse News Recommendation.” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL 10: 37–55.
- Chicago author-date (all authors)
- De Clercq, Orphée, Luna De Bruyne, and Veronique Hoste. 2020. “News Topic Classification as a First Step towards Diverse News Recommendation.” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL 10: 37–55.
- Vancouver
- 1.De Clercq O, De Bruyne L, Hoste V. News topic classification as a first step towards diverse news recommendation. COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL. 2020;10:37–55.
- IEEE
- [1]O. De Clercq, L. De Bruyne, and V. Hoste, “News topic classification as a first step towards diverse news recommendation,” COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL, vol. 10, pp. 37–55, 2020.
@article{8684328, abstract = {{When developing an algorithm that uses news diversity as a key driver for personalized news recommendation it is crucial to focus on means to cluster news articles in a fine-grained manner, ideally by leveraging the content of the text. In this paper we investigate semantic classification of news articles in an unfiltered news stream. We first present an analysis of the EventDNA corpus: a collection of Dutch-language news articles annotated with event data according to a predefined typology. We found that the types assigned as features of events do not allow for such a semantic classification and investigate the IPTC News Media Topics standard as an alternative. By mapping event types with manually-assigned IPTC topics, we observe that a more diversified picture emerges, which leads us to conclude that the IPTC classification is a useful proxy. Based on a historical data sample of Dutch news articles covering the year 2018, we then perform a series of machine learning experiments in order to automatically predict the top two levels of the IPTC taxonomy. Various multi-label classification models are built with BERTje using a bottom-up and top-down approach. The results reveal that the top-down approach yields the best results, with an overall macro F-1 score of 86.4% and a Jaccard accuracy of 89.2% for the level-one topics and one of 83.7% and 87.5% for the level-two predictions.}}, author = {{De Clercq, Orphée and De Bruyne, Luna and Hoste, Veronique}}, issn = {{2211-4009}}, journal = {{COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS JOURNAL}}, keywords = {{LT3}}, language = {{eng}}, pages = {{37--55}}, title = {{News topic classification as a first step towards diverse news recommendation}}, url = {{https://www.clinjournal.org/clinj/article/view/103/92}}, volume = {{10}}, year = {{2020}}, }