Multi-modular text normalization of Dutch user-generated content

Publication type
A1
Publication status
Published
Authors
Schulz, S., De Pauw, G., De Clercq, O., Desmet, B., Hoste, V., Daelemans, W., & Macken, L.
Journal
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
Volume
7
Issue
4
Download
(.pdf)
View in Biblio
(externe link)

Abstract

As social media constitute a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the non-standard language used on social media poses problems for Natural Language Processing (NLP) tools as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multi-modular approach to account for the diversity of normalization issues encountered in user-generated content. We consider three different types of user-generated content written in Dutch (SNS, SMS and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer and named-entity recognizer before and after normalization.