
Predicting machine translation performance on low-resource languages : the role of domain similarity
- Author
- Eric Khiu, Hasti Toossi, Jinyu Liu, Jiaxu Li, David Anugraha, Juan Flores, Leandro Roman, A. Seza Doğruöz (UGent) and En-Shiun Lee
- Organization
- Abstract
- Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.
Downloads
-
2024.findings-eacl.100v2.pdf
- full text (Published version)
- |
- open access
- |
- |
- 585.88 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01HT7AS5NTMR82R901RCXVR71H
- MLA
- Khiu, Eric, et al. “Predicting Machine Translation Performance on Low-Resource Languages : The Role of Domain Similarity.” FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024, edited by Yvette Graham and Matthew Purver, Association for Computational Linguistics (ACL), 2024, pp. 1474–86.
- APA
- Khiu, E., Toossi, H., Liu, J., Li, J., Anugraha, D., Flores, J., … Lee, E.-S. (2024). Predicting machine translation performance on low-resource languages : the role of domain similarity. In Y. Graham & M. Purver (Eds.), FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024 (pp. 1474–1486). Association for Computational Linguistics (ACL).
- Chicago author-date
- Khiu, Eric, Hasti Toossi, Jinyu Liu, Jiaxu Li, David Anugraha, Juan Flores, Leandro Roman, A. Seza Doğruöz, and En-Shiun Lee. 2024. “Predicting Machine Translation Performance on Low-Resource Languages : The Role of Domain Similarity.” In FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024, edited by Yvette Graham and Matthew Purver, 1474–86. Association for Computational Linguistics (ACL).
- Chicago author-date (all authors)
- Khiu, Eric, Hasti Toossi, Jinyu Liu, Jiaxu Li, David Anugraha, Juan Flores, Leandro Roman, A. Seza Doğruöz, and En-Shiun Lee. 2024. “Predicting Machine Translation Performance on Low-Resource Languages : The Role of Domain Similarity.” In FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024, ed by. Yvette Graham and Matthew Purver, 1474–1486. Association for Computational Linguistics (ACL).
- Vancouver
- 1.Khiu E, Toossi H, Liu J, Li J, Anugraha D, Flores J, et al. Predicting machine translation performance on low-resource languages : the role of domain similarity. In: Graham Y, Purver M, editors. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024. Association for Computational Linguistics (ACL); 2024. p. 1474–86.
- IEEE
- [1]E. Khiu et al., “Predicting machine translation performance on low-resource languages : the role of domain similarity,” in FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024, St. Julian’s, Malta, 2024, pp. 1474–1486.
@inproceedings{01HT7AS5NTMR82R901RCXVR71H, abstract = {{Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.}}, author = {{Khiu, Eric and Toossi, Hasti and Liu, Jinyu and Li, Jiaxu and Anugraha, David and Flores, Juan and Roman, Leandro and Doğruöz, A. Seza and Lee, En-Shiun}}, booktitle = {{FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : EACL 2024}}, editor = {{Graham, Yvette and Purver, Matthew}}, isbn = {{9798891760936}}, language = {{eng}}, location = {{St. Julian’s, Malta}}, pages = {{1474--1486}}, publisher = {{Association for Computational Linguistics (ACL)}}, title = {{Predicting machine translation performance on low-resource languages : the role of domain similarity}}, url = {{https://aclanthology.org/2024.findings-eacl.100}}, year = {{2024}}, }