Smart Computer-Aided Translation Environment

Start date
March 1, 2014
End date
Feb. 28, 2018
CCL (KU Leuven)
ESAT/PSI (KU Leuven)
LIIR (KU Leuven)
Language and Communication (KU Leuven)
EDM (Hasselt University)
Research portal


The SCATE project primarily aims at the improvement of translators' efficiency. Current commercial translation tools do not meet the productivity requirements imposed by the globalisation of business activities and the increasing information flow. They include limited linguistic knowledge, are difficult to adapt to new domains, provide restricted support for speech recognition, and have interfaces which are technology-driven rather than user-driven.
The SCATE project will develop innovative translation technology components and integrate them in a proof-of-concept system, a Smart Computer-Aided Translation Environment (SCATE) that is a genuine support for the professional translator.
SCATE will focus on: - the integration of more linguistic knowledge in the translation modules, - the use of more sophisticated techniques for terminology extraction, - the addition of speech recognition devices, and - the optimisation of human-machine interaction.
As a partner of the SCATE consortium, LT3 is responsible for two work packages:
1. Evaluation of Computer-Aided Translation
Goal: The goal of this WP is is three-fold: (1) develop a taxonomy of typical MT errors and create a data set in which MT and TM errors are manually annotated; (2) study the current post-editing effort by observing and analysing how human translators actually post-edit; and (3) develop confidence metrics that can estimate a-priori the post-editing effort on the basis of the manually created data set.
Motivation: It is still largely unclear how translation technologies can help translators in the best way to produce high quality translations faster. In order to obtain translations of high, publishable quality, humans still intervene in the translation process by post-editing MT output. However, post-editing poor MT can require more effort than translating from scratch. The taxonomy will provide an answer to the question of what needs to be fixed. The process studies of human-machine interaction in post-editing will provide an answer to the question of how MT errors are actually fixed and of how demanding a fix is. The temporal and technical effort of the post-editing process will be linked to develop the classified errors and will be used to develop confidence estimation metrics.
2. Terminology Extraction from Comparable Corpora
Goals: The goal of this WP is to study the process of terminology extraction by humans and to automate the process of terminology extraction from comparable corpora. The WP therefore consists of three tasks: investigating translator's methods in acquiring domain terminology, investigating methods to determine what texts contain comparable information (in different languages), and the actual extraction of terminology from these comparable corpora.
Motivation: Many translators comment that industry-specific jargon and terminology is one of the biggest barriers to producing a quality translation. Even experienced translators struggle with terminology, which is often adopted by individual clients for product-specific documentation. In addition, product developers regularly coin new terms to denote new features of their product, forcing the translator to develop neologisms or to resort to loanwords or phono-semantic matching. An additional challenge of terminology as a study lies in the contrast between domains with fast-evolving terminology (e.g. IT), and conservative fields that cling to archaic expressions (e.g. legal). Translator specialisation is key, and the search for trustworthy reference materials on the Internet has grown to an important part in a translator’s work package. Translators draw on many online translation resources, from online glossaries to parallel or comparable corpora. The use of gigantic data vaults at the translator’s disposal have limited value when they are not used correctly, or become simply too large to handle. It is crucial to keep in mind that many of the current resources were not produced for human translation reference purposes, but rather to feed SMT engines with bilingual data.