Advanced search
2 files | 14.77 MB Add to list

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Dilawar Ali (UGent) , Kenzo Milleville (UGent) , Steven Verstockt (UGent) , Nico Van de Weghe (UGent) , Sally Chambers (UGent) and Julie M. Birkholz (UGent)
Author
Organization
Abstract
Purpose: Historical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue. Design/methodology/approach: In this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons - literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data. Findings: The results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles. Originality/value: The proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).
Keywords
Digital libraries and archives, Information extraction, Document layout analysis, Article segmentation, Named entity recognition, Digitized historical newspapers, Feuilleton extraction, DOCUMENT STRUCTURE, RECOGNITION, FEUILLETON, CHALLENGES, ALGORITHMS

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 11.76 MB
  • DS525 acc.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 3.00 MB

Citation

Please use this url to cite or link to this publication:

MLA
Ali, Dilawar, et al. “Computer Vision and Machine Learning Approaches for Metadata Enrichment to Improve Searchability of Historical Newspaper Collections.” JOURNAL OF DOCUMENTATION, 2024, doi:10.1108/jd-01-2022-0029.
APA
Ali, D., Milleville, K., Verstockt, S., Van de Weghe, N., Chambers, S., & Birkholz, J. M. (2024). Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections. JOURNAL OF DOCUMENTATION. https://doi.org/10.1108/jd-01-2022-0029
Chicago author-date
Ali, Dilawar, Kenzo Milleville, Steven Verstockt, Nico Van de Weghe, Sally Chambers, and Julie M. Birkholz. 2024. “Computer Vision and Machine Learning Approaches for Metadata Enrichment to Improve Searchability of Historical Newspaper Collections.” JOURNAL OF DOCUMENTATION. https://doi.org/10.1108/jd-01-2022-0029.
Chicago author-date (all authors)
Ali, Dilawar, Kenzo Milleville, Steven Verstockt, Nico Van de Weghe, Sally Chambers, and Julie M. Birkholz. 2024. “Computer Vision and Machine Learning Approaches for Metadata Enrichment to Improve Searchability of Historical Newspaper Collections.” JOURNAL OF DOCUMENTATION. doi:10.1108/jd-01-2022-0029.
Vancouver
1.
Ali D, Milleville K, Verstockt S, Van de Weghe N, Chambers S, Birkholz JM. Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections. JOURNAL OF DOCUMENTATION. 2024;
IEEE
[1]
D. Ali, K. Milleville, S. Verstockt, N. Van de Weghe, S. Chambers, and J. M. Birkholz, “Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections,” JOURNAL OF DOCUMENTATION, 2024.
@article{01GVDDDEC7RTRJRZ8FR6H0JPAD,
  abstract     = {{Purpose: Historical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.
Design/methodology/approach: In this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons - literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.
Findings: The results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.
Originality/value: The proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).}},
  author       = {{Ali, Dilawar and Milleville, Kenzo and Verstockt, Steven and Van de Weghe, Nico and Chambers, Sally and Birkholz, Julie M.}},
  issn         = {{0022-0418}},
  journal      = {{JOURNAL OF DOCUMENTATION}},
  keywords     = {{Digital libraries and archives,Information extraction,Document layout analysis,Article segmentation,Named entity recognition,Digitized historical newspapers,Feuilleton extraction,DOCUMENT STRUCTURE,RECOGNITION,FEUILLETON,CHALLENGES,ALGORITHMS}},
  language     = {{eng}},
  title        = {{Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections}},
  url          = {{http://doi.org/10.1108/jd-01-2022-0029}},
  year         = {{2024}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: