Dutch Compound Splitting for Bilingual Terminology Extraction

Publication type
B2
Publication status
In press
Authors
Macken, L., & Tezcan, A.
Editor
Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
Series
Multi-word Units in Machine Translation and Translation Technology
Volume
341
Publisher
John Benjamins
View in Biblio
(externe link)

Abstract

Compounds pose a problem for applications that rely on precise word alignments such as bilingual terminology extraction. We therefore developed a state-of-the-art hybrid compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. We perform an extensive intrinsic evaluation on a Gold Standard set of 50,000 Dutch compounds and a set of 5,000 Dutch compounds belonging to the automotive domain. We also propose a novel methodology for word alignment that makes use of the compound splitter. As compounds are not always translated compositionally, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. The obtained word alignment points are then combined.