SCATE Taxonomy and Corpus of Machine Translation Errors

Publication type
B2
Publication status
In press
Authors
Tezcan, A., Hoste, V., & Macken, L.
Editor
Gloria Corpas Pastor and Isabel Durán Muñoz
Series
Trends in e-tools and resources for translators and interpreters
Publisher
Brill
View in Biblio
(externe link)

Abstract

Quality estimation (QE) and error analysis of machine translation (MT) output remains to be an active area in Natural Language Processing (NLP) research. Many recent efforts focus on machine learning (ML) systems to estimate the MT quality, translation errors, post-editing speed or the post-editing effort. As the accuracy of such ML tasks relies on the availability of corpora, the need for large corpora of machine translations annotated with translation errors and the error annotation guidelines to produce consistent annotations emerges. Building on previous work on translation error taxonomies, we present the SCATE MT error taxonomy, which is hierarchical in nature and builds upon the well-known notions of accuracy and fluency. In the SCATE annotation framework, we annotate fluency errors on monolingual level in the target text and accuracy errors on bilingual level and link the corresponding source and target text fragments to each other. We also propose a novel method for alignment-based inter-annotator agreement (IAA) analysis and show that this method can be used effectively on large annotation sets. Using the SCATE taxonomy and the guidelines, we build the first corpus of MT errors for the English-Dutch language pair, consisting of statistical machine translation (SMT) and rule-based machine translation (RBMT) errors, which is a valuable resource not only for NLP tasks in this field but also to study the relation of MT errors and post-editing effort in the future. Finally, we analyse the error profiles of the SMT and the RBMT systems used in this study and compare the quality of these two different MT architectures based on the error types.