Extraction of Parallel Elements from Comparable Texts

Start date
Sept. 15, 2013
End date
Sept. 14, 2016
Research portal

About ExPECT

Both translators and interpreters are assumed to possess knowledge of a very broad range of topics, in different domains and in different languages. Moreover, each of those topics has its very own specific terminology, which translators and interpreters are also assumed to master. Efficient methods for terminology acquisition are therefore increasingly important. Even with the massive availability of text in numerous languages on the Internet, it is difficult to find documents suitable for automatic term extraction (ATE). Parallel texts are scarce, especially for very specific domains and even more so for Dutch. This lack of parallel corpora on the one hand, and the increasing availability of text on the web on the other hand have inspired researchers to explore the usability of comparable texts for ATE (Daille & Morin, 2005; Déjean et al., 2005; Delpech et al., 2012; Fung & Yee, 1998). In the ExPECT project, we will study the extent to which the – raw and corrected – output from automatic term extraction (ATE) from comparable texts can be useful for translators and interpreters. Previously developed methods will be used to develop a system for corpus compilation and term extraction for Dutch, in combination with English, French and German. Once the corpus has been developed for the four focus languages, monolingual terms will be extracted using the TExSIS tool (Macken, Lefever & Hoste, 2013). Subsequently, these terms will be linked to candidate translations by comparing their contexts (‘distributional hypothesis’). Next to the traditional evaluation using precision and recall, the output of the tool will be assessed in terms of the gain in time and in quality of translations and interpreting assignments with/without terminological support. Daille, B., & Morin, E. (2005). French-English terminology extraction from comparable corpora, Proceedings of the Second international joint conference on Natural Language Processing. Jeju Island, Korea: Springer-Verlag. Déjean, H., Gaussier, E., Renders, J.-M., & Sadat, F. (2005). Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine, 33(2), 111-124. Delpech, E., Daille, B., Morin, E., & Lemaire, C. (2012). Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking (pp. 745-762). Proceedings of the 24th International Conference on Computational Linguistics, Mumbai. Fung, P., & Yee, L.Y. (1998). An IR approach for translating new words from nonparallel, comparable texts, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1. Montreal, Quebec, Canada: Association for Computational Linguistics. Macken, L., Lefever, E., & Hoste, V. (2013). TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-based Alignment. Terminology, 19(1), 1-30.