Extracting Terminology Automatically from Comparable Texts

Start date
Oct. 1, 2017
End date
Sept. 30, 2021


Specialised, domain-specific vocabulary, i.e. terminology, has always been difficult and time-consuming to understand and translate well. Nonetheless, terms often contain essential information and a good comprehension is imperative in many contexts (e.g. medical texts, technical manuals and legal documents).  Not only human translators face this problem, but machine translation (MT) as well.  MT is typically based on enormous volumes of texts (mostly human translations), but terms are inherently much rarer than general vocabulary and very domain-specific as well. Therefore, a different approach must be adopted to handle terminology: automatic term extraction (ATE). 

Monolingual ATE was developed to recognise and extract terms from running text. Next, a translation component was added using parallel corpora, where terms can be extracted from aligned human translations and then linked to their potential equivalents in the target language. However, sufficient domain-specific parallel corpora cannot always be found, especially for smaller domains and languages. The latest innovation is multilingual ATE from comparable corpora (ATECC). Collections of similar texts on the same subject in different languages (but not translations) are used to, first, perform monolingual ATE and, then, find translation equivalents in the lists of extracted candidate terms. Comparable corpora are a solution for the data acquisition bottleneck because they are much easier to collect. However, finding translation equivalents becomes much harder, because the texts aren’t aligned, so the position and even presence of translations is unknown.

This PhD project aims to investigate a single, holistic methodology for ATECC. In accordance with the most successful trends in natural language processing, special attention will be paid to machine learning and deep learning approaches. In preparation, several comparable corpora have been collected and annotated to provide data. Every aspect of ATECC will be researched, from data collection (what is the impact of the corpus?), to monolingual ATE (can a bottom-up approach based on human term annotations shed light on the ambiguous characteristics of terms?), to bilingual term linking (how can the most informative different strategies for bilingual alignment be combined?), to evaluation (how can an informative gold standard be constructed for ATECC?). Finally, this approach allows an exploration of potentially beneficial interactions between the different components. In conclusion, the goal of this project is to use a holistic, bottom-up approach to investigate the best strategies for automatic term extraction from comparable corpora.