Advanced search
1 file | 310.67 KB Add to list

Dutch compound splitting for bilingual terminology extraction

Lieve Macken (UGent) and Arda Tezcan (UGent)
Author
Organization
Project
  • LT3
Abstract
Compounds pose a problem for applications that rely on precise word alignments such as bilingual terminology extraction. We therefore developed a state-of-the-art hybrid compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. We perform an extensive intrinsic evaluation on a Gold Standard set of 50,000 Dutch compounds and a set of 5,000 Dutch compounds belonging to the automotive domain. We also propose a novel methodology for word alignment that makes use of the compound splitter. As compounds are not always translated compositionally, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. The obtained word alignment points are then combined.
Keywords
word alignment, translation, Compound splitting, Dutch, multi-word units, bilingual terminology extraction, LT3

Downloads

  • Dutch Compound Splitting for Bilingual Terminology ExtractionJuly.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 310.67 KB

Citation

Please use this url to cite or link to this publication:

MLA
Macken, Lieve, and Arda Tezcan. “Dutch Compound Splitting for Bilingual Terminology Extraction.” Multiword Units in Machine Translation and Translation Technology, edited by Ruslan Mitkov et al., vol. 341, John Benjamins, 2018, pp. 148–62.
APA
Macken, L., & Tezcan, A. (2018). Dutch compound splitting for bilingual terminology extraction. In R. Mitkov, J. Monti, G. Corpas Pastor, & V. Seretan (Eds.), Multiword units in machine translation and translation technology (Vol. 341, pp. 148–162). John Benjamins.
Chicago author-date
Macken, Lieve, and Arda Tezcan. 2018. “Dutch Compound Splitting for Bilingual Terminology Extraction.” In Multiword Units in Machine Translation and Translation Technology, edited by Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor, and Violeta Seretan, 341:148–62. John Benjamins.
Chicago author-date (all authors)
Macken, Lieve, and Arda Tezcan. 2018. “Dutch Compound Splitting for Bilingual Terminology Extraction.” In Multiword Units in Machine Translation and Translation Technology, ed by. Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor, and Violeta Seretan, 341:148–162. John Benjamins.
Vancouver
1.
Macken L, Tezcan A. Dutch compound splitting for bilingual terminology extraction. In: Mitkov R, Monti J, Corpas Pastor G, Seretan V, editors. Multiword units in machine translation and translation technology. John Benjamins; 2018. p. 148–62.
IEEE
[1]
L. Macken and A. Tezcan, “Dutch compound splitting for bilingual terminology extraction,” in Multiword units in machine translation and translation technology, vol. 341, R. Mitkov, J. Monti, G. Corpas Pastor, and V. Seretan, Eds. John Benjamins, 2018, pp. 148–162.
@incollection{7126122,
  abstract     = {Compounds pose a problem for applications that rely on precise word alignments such as bilingual terminology extraction. We therefore developed a state-of-the-art hybrid compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. We perform an extensive intrinsic evaluation on a Gold Standard set of 50,000 Dutch compounds and a set of 5,000 Dutch compounds belonging to the automotive domain. We also propose a novel methodology for word alignment that makes use of the compound splitter. As compounds are not always translated compositionally, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. The obtained word alignment points are then combined.},
  author       = {Macken, Lieve and Tezcan, Arda},
  booktitle    = {Multiword units in machine translation and translation technology},
  editor       = {Mitkov, Ruslan and Monti, Johanna and Corpas Pastor, Gloria and Seretan, Violeta},
  isbn         = {9789027200600},
  keywords     = {word alignment,translation,Compound splitting,Dutch,multi-word units,bilingual terminology extraction,LT3},
  language     = {eng},
  pages        = {148--162},
  publisher    = {John Benjamins},
  series       = {Current Issues in Linguistic Theory},
  title        = {Dutch compound splitting for bilingual terminology extraction},
  url          = {https://benjamins.com/catalog/cilt.341},
  volume       = {341},
  year         = {2018},
}