Dutch Compound Splitting for Bilingual Terminology Extraction

Publication type
Publication status
In press
Macken, L., & Tezcan, A.
Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
Multi-word Units in Machine Translation and Translation Technology
John Benjamins
View in Biblio
(externe link)


Compounds pose a problem for applications that rely on precise word alignments such as bilingual terminology extraction. We therefore developed a state-of-the-art hybrid compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. We perform an extensive intrinsic evaluation on a Gold Standard set of 50,000 Dutch compounds and a set of 5,000 Dutch compounds belonging to the automotive domain. We also propose a novel methodology for word alignment that makes use of the compound splitter. As compounds are not always translated compositionally, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. The obtained word alignment points are then combined.