Optimization Issues in Machine Learning of Coreference Resolution

Publication type
Publication status
Hoste, V.
External link


This thesis presents a machine learning approach to the resolution of coreferen- tial relations between nominal constituents in Dutch. It is the first automatic resolution approach proposed for this language. The corpus-based strategy was enabled by the annotation of a substantial corpus (ca. 12,500 noun phrases) of Dutch news magazine text with coreferential links for pronominal, proper noun and common noun coreferences. Based on the hypothesis that different types of information sources contribute to a correct resolution of different types of coreferential links, we propose a modular approach in which a separate module is trained per NP type. Lacking comparative results for Dutch, we also perform all experiments for the English MUC-6 and MUC-7 data sets, which are widely used for evaluation. Applied to the task at hand, we focus on the methodological issues which arise when performing a machine learning of language experiment. In order to determine the effect of algorithm ‘bias’ on learning coreference resolution, we evaluate the performance of two learning approaches which provide extremes of the ea- gerness dimension, namely timbl as an instance of lazy learning and ripper as an instance of eager learning. We show that apart from the algorithm bias, many other factors potentially play a role in the outcome of a comparative ma- chine learning experiment. In this thesis, we study the effect of selection of information sources, parameter optimization and the effect of sampling to cope with the skewed class distribution in the data. In addition, we investigate the interaction of these factors. In a set of feature selection experiments using backward elimination and bidirectional hillclimbing, we show the large effect feature selection can have on classifier performance. We also observe that the feature selection considered to be optimal for one learner cannot be generalized to the other learner. Further- more, in the parameter optimization or model selection experiments, we observe that the performance differences within one learning method are much larger than the method-comparing performance differences. A similar observation is made in the experiments exploring the interaction between feature selection and parameter optimization, using a genetic algorithm as a computationally feasible way to achieve this type of costly optimization. These experiments also show that the parameter settings and information sources which are selected after optimization cannot be generalized. In the experiments varying the class dis- tribution of the training data, we show that both learning approaches behave quite differently in case of skewedness of the classes and that they also react differently to a change in class distribution. A change of class distribution is primarily beneficial for ripper. However, we observe that once again no particular class distribution is optimal for all data sets, which makes this resampling also subject to optimization. In all optimization experiments, we show that changing any of the architectural variables can have great effects on the performance of a machine learning method, making questionable conclusions in the literature based on the exploration of only a few points in the space of possible experiments for the algorithms to be compared. We show that there is a high risk that other areas in the experimental search space lead to radically different results and conclusions. At the end of the thesis, we move away from the instance level and concentrate on the coreferential chains reporting results on the Dutch and English data sets. In order to gain an insight into the errors committed in the resolution, we perform a qualitative error analysis on a selection of English and Dutch texts.