Advanced search
1 file | 214.78 KB Add to list

Evaluating the multilingual capabilities of the OCCAM workflow : a case of digitised historical newspapers

Author
Organization
Abstract
Increased digitization of historical newspapers by cultural heritage institutions1 has allowed humanities scholars to expand their corpora in terms of volume and diversity. This has accompanied an increase in the use of computational tools2 . Simultaneously this has also brought to rise a new set of questions around the accuracy and value of this data and related studies. In this presentation we will focus on the multilingual challenge of studying historical newspapers from the lens of Belgium: questioning how a humanities researcher with an understanding of one language Dutch or French can accurately implement a study of a national press without the knowledge of the other most populous language. This question is core to the workflows being developed in the OCCAM (OCR, ClassificAtion & Machine Translation) project’s digital humanities case. OCCAM implements a workflow for the integration of image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT) to support the automated translation of scanned documents. We explain this through the case of the Belgian press using a set of multilingual historical newspapers from KBR- the Royal Library of Belgium’s historical newspaper collections: BelgicaPress. Through examples from a set of Dutch and French language newspapers from the early 1900s we explain how images of textual sources in multiple languages can efficiently be OCRed using the machine learning based model of PERO. In a subsequent workflow, the results of the OCR are then fed through a machine translation module. Originally developed for the machine translation of contemporary documents, we will report on the results using our digitized historical newspapers test case, to afford research of these historical documents.

Downloads

  • DHBenelux2021 paper 16.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 214.78 KB

Citation

Please use this url to cite or link to this publication:

MLA
Birkholz, Julie M., et al. “Evaluating the Multilingual Capabilities of the OCCAM Workflow : A Case of Digitised Historical Newspapers.” DH Benelux 2021, Abstracts, 2021.
APA
Birkholz, J. M., Chambers, S., Hradis, M., & Smrz, P. (2021). Evaluating the multilingual capabilities of the OCCAM workflow : a case of digitised historical newspapers. In DH Benelux 2021, Abstracts. Leiden, The Netherlands.
Chicago author-date
Birkholz, Julie M., Sally Chambers, Michal Hradis, and Pavel Smrz. 2021. “Evaluating the Multilingual Capabilities of the OCCAM Workflow : A Case of Digitised Historical Newspapers.” In DH Benelux 2021, Abstracts.
Chicago author-date (all authors)
Birkholz, Julie M., Sally Chambers, Michal Hradis, and Pavel Smrz. 2021. “Evaluating the Multilingual Capabilities of the OCCAM Workflow : A Case of Digitised Historical Newspapers.” In DH Benelux 2021, Abstracts.
Vancouver
1.
Birkholz JM, Chambers S, Hradis M, Smrz P. Evaluating the multilingual capabilities of the OCCAM workflow : a case of digitised historical newspapers. In: DH Benelux 2021, Abstracts. 2021.
IEEE
[1]
J. M. Birkholz, S. Chambers, M. Hradis, and P. Smrz, “Evaluating the multilingual capabilities of the OCCAM workflow : a case of digitised historical newspapers,” in DH Benelux 2021, Abstracts, Leiden, The Netherlands, 2021.
@inproceedings{8712402,
  abstract     = {{Increased digitization of historical newspapers by cultural heritage institutions1 has allowed humanities scholars to expand their corpora in terms of volume and diversity. This has accompanied an increase in the use of computational tools2 . Simultaneously this has also brought to rise a new set of questions around the accuracy and value of this data and related studies. In this presentation we will focus on the multilingual challenge of studying historical newspapers from the lens of Belgium: questioning how a humanities researcher with an understanding of one language Dutch or French can accurately implement a study of a national press without the knowledge of the other most populous language. This question is core to the workflows being developed in the OCCAM (OCR, ClassificAtion & Machine Translation) project’s digital humanities case. OCCAM implements a workflow for the integration of image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT) to support the automated translation of scanned documents.

We explain this through the case of the Belgian press using a set of multilingual historical newspapers from KBR- the Royal Library of Belgium’s historical newspaper collections: BelgicaPress. Through examples from a set of Dutch and French language newspapers from the early 1900s we explain how images of textual sources in multiple languages can efficiently be OCRed using the machine learning based model of PERO. In a subsequent workflow, the results of the OCR are then fed through a machine translation module. Originally developed for the machine translation of contemporary documents, we will report on the results using our digitized historical newspapers test case, to afford research of these historical documents.}},
  articleno    = {{2}},
  author       = {{Birkholz, Julie M. and Chambers, Sally and Hradis, Michal and Smrz, Pavel}},
  booktitle    = {{DH Benelux 2021, Abstracts}},
  language     = {{eng}},
  location     = {{Leiden, The Netherlands}},
  title        = {{Evaluating the multilingual capabilities of the OCCAM workflow : a case of digitised historical newspapers}},
  url          = {{https://2021.dhbenelux.org/home/abstracts/}},
  year         = {{2021}},
}