The STEVIN project SoNaR aims to build a 500-million word balanced reference corpus for contemporary (1954-present) written Dutch. Besides comprising no less than 38 text types, the corpus will also be balanced according to the number of speakers in Dutch-speaking regions, one-third of the texts coming from Flanders, and two-thirds from the Netherlands. Not only texts from the more conventional text types will be gathered such as newspapers, reports, etcetera, but also data coming from new media such as chat, SMS, internet fora and email. A very important aspect of the SoNaR project is that for all text material included, Intellectual Property Rights (IPR) are settled, so as to guarantee a widespread availability.
We are always looking for more people interested in donating text material to the SoNaR-corpus. This can be done using WINKLE or by contacting Orphée De Clercq. Many people and organizations have gone before you and donated text material for which we are very grateful (you can find an overview here ). Some Frequently Asked Questions are answered in this document (in Dutch).
- Currently we are also looking for people willing to donate text messages. More instructions on how you can help SoNaR can be found on: www.sonarproject.be (in Dutch).
Semantic Annotation of 1MW
Within the SoNaR corpus, a core corpus of 1 million words is manually annotated with four semantic layers:
- Named Entities
- Coreference Relations
- Spatio-Temporal Relations
- Semantic Roles