Sub-sentential alignment of translational correspondences

Publication type
Publication status
Macken, L.
University Press Antwerp (Antwerp)


The focus of this thesis is sub-sentential alignment, i.e. the automatic alignment of translational correspondences below sentence level. The system that we developed takes as its input sentence-aligned parallel texts and aligns translational correspondences at the sub-sentential level, which can be words, word groups or chunks. The research described in this thesis aims to be of value to the developers of computer-assisted translation tools and to human translators in general.

Two important aspects of this research are its focus on different text types and its focus on precision. In order to cover a wide range of syntactic and stylistic phenomena that emerge from different writing and translation styles, we used parallel texts of different text types. As the intended users are ultimately human translators, our explicit aim was to develop a model that aligns segments with a very high precision.

This thesis consists of three major parts. The first part is introductory and focuses on the manual annotation, the resources used and the evaluation methodology. The second part forms the main contribution of this thesis and describes the sub-sentential alignment system that was developed. In the third part, two different applications are discussed.

Although the global architecture of our sub-sentential alignment module is language-independent, the main focus is on the English-Dutch language pair. At the beginning of the research project, a Gold Standard was created. The manual reference corpus contains three different types of links: regular links for straightforward correspondences, fuzzy links for translation-specific shifts of various kinds, and null links for words for which no correspondence could be indicated. The different writing and translation styles in the different text types was reflected in the number of regular, fuzzy and null links.

The sub-sentential alignment system is conceived as a cascaded model consisting of two phases. In the first phase, anchor chunks are linked on the basis of lexical correspondences and syntactic similarity. In the second phase, we use a bootstrapping approach to extract language-pair specific translation patterns. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-of-speech taggers and chunkers.

To generate the lexical correspondences, we experimented with two different types of bilingual dictionaries: a handcrafted bilingual dictionary and probabilistic bilingual dictionaries. In the bootstrapping experiments, we started from the precise GIZA++ intersected word alignments. The proposed system improves the recall of the intersected GIZA++ word alignments without sacrificing precision, which makes the resulting alignments more useful for incorporation in CAT-tools or bilingual terminology extraction tools. Moreover, the system’s ability to align discontiguous chunks makes the system useful for languages containing split verbal constructions and phrasal verbs.

In the last part of this thesis, we demonstrate the usefulness of the sub-sentential alignment module in two different applications. First, we used the sub-sentential alignment module to guide bilingual terminology extraction on three different language pairs, viz. French-English, French-Italian and French-Dutch. Second, we compare the performance of our alignment system with a commercial sub-sentential translation memory system.