Advanced search
1 file | 149.66 KB Add to list

Evaluating transformers for OCR post-correction in early modern Dutch theatre

Florian Debaene (UGent) , Aaron Maladry (UGent) , Els Lefever (UGent) and Veronique Hoste (UGent)
Author
Organization
Project
Abstract
This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.

Downloads

  • 20241209.pdf
    • full text (Published version)
    • |
    • open access
    • |
    • PDF
    • |
    • 149.66 KB

Citation

Please use this url to cite or link to this publication:

MLA
Debaene, Florian, et al. “Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre.” Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, edited by Owen Rambow et al., Association for Computational Linguistics (ACL), 2025, pp. 10367–74.
APA
Debaene, F., Maladry, A., Lefever, E., & Hoste, V. (2025). Evaluating transformers for OCR post-correction in early modern Dutch theatre. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025 (pp. 10367–10374). Association for Computational Linguistics (ACL).
Chicago author-date
Debaene, Florian, Aaron Maladry, Els Lefever, and Veronique Hoste. 2025. “Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre.” In Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, edited by Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, 10367–74. Association for Computational Linguistics (ACL).
Chicago author-date (all authors)
Debaene, Florian, Aaron Maladry, Els Lefever, and Veronique Hoste. 2025. “Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre.” In Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, ed by. Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, 10367–10374. Association for Computational Linguistics (ACL).
Vancouver
1.
Debaene F, Maladry A, Lefever E, Hoste V. Evaluating transformers for OCR post-correction in early modern Dutch theatre. In: Rambow O, Wanner L, Apidianaki M, Al-Khalifa H, Eugenio BD, Schockaert S, editors. Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025. Association for Computational Linguistics (ACL); 2025. p. 10367–74.
IEEE
[1]
F. Debaene, A. Maladry, E. Lefever, and V. Hoste, “Evaluating transformers for OCR post-correction in early modern Dutch theatre,” in Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, United Arab Emirates, 2025, pp. 10367–10374.
@inproceedings{01JJ4MZRSPC0EZQGK1G52BRJMQ,
  abstract     = {{This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.}},
  author       = {{Debaene, Florian and Maladry, Aaron and Lefever, Els and Hoste, Veronique}},
  booktitle    = {{Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025}},
  editor       = {{Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven}},
  isbn         = {{9798891761964}},
  language     = {{eng}},
  location     = {{Abu Dhabi, United Arab Emirates}},
  pages        = {{10367--10374}},
  publisher    = {{Association for Computational Linguistics (ACL)}},
  title        = {{Evaluating transformers for OCR post-correction in early modern Dutch theatre}},
  url          = {{https://aclanthology.org/2025.coling-main.690/}},
  year         = {{2025}},
}