Building a new-generation corpus for empirical translation studies : the Dutch Parallel Corpus 2.0

Publication type
Publication status
Reynaert, R., Macken, L., Tezcan, A., & De Sutter, G.
Vincent Wang, Lily Lim and Defeng Li
New perspectives on corpus translation studies
Springer (Singapore)
View in Biblio
(externe link)


This chapter introduces a new, updated version of the Dutch Parallel Corpus, a bidirectional parallel corpus of expert translations for Dutch><English and Dutch><French language pairs. This revisited version of the corpus, which we dub Dutch Parallel Corpus 2.0, is dynamic in nature, and contains 2.75 million words at the time of writing. The corpus is sentence-aligned, lemmatized and POS-tagged using the state-of-the-art natural language processing toolkit Stanza. Compared to its predecessor, the Dutch Parallel Corpus 2.0 contains more metadata about the translators (e.g. gender, education, experience) and the translation projects (e.g. L1/L2 translation, software used, degree and type of revision), next to the traditional metadata about the texts themselves (e.g. source and target language, intended audience, intended goal, register). The availability of an extensive set of metadata is considered the main asset of this corpus, together with a more principled and flexible register classification, thus stimulating corpus-based translation scholars to answer more refined research questions about the linguistic and contextual factors that shape translated texts, and ultimately fostering ideas and theories about the social and cognitive processes involved in translation performance. The corpus is freely available for research purposes via