Learning-based detection of scientific terms in patient information

Publication type
Publication status
Hoste, V., Lefever, E., Vanopstal, K., & Delaere, I.
LREC 2008 : sixth international conference on language resources and evaluation
European Language Resources Association (ELRA) (Paris, France)
6th International conference on Language Resources and Evaluation (LREC 2008) (Marrakech, Morocco)
View in Biblio
(externe link)


In this paper, we investigate the use of a machine-learning based approach to the specific problem of scientific term detection in patient information. Lacking lexical databases which differentiate between the scientific and popular nature of medical terms, we used local context, morphosyntactic, morphological and statistical information to design a learner which accurately detects scientific medical terms. This study is the first step towards the automatic replacement of a scientific term by its popular counterpart, which should have a beneficial effect on readability. We show a F-score of 84% for the prediction of scientific terms in an English and Dutch EPAR corpus. Since recasting the term extraction problem as a classification problem leads to a large skewedness of the resulting data set, we rebalanced the data set through the application of some simple TF-IDF-based and Log-likelihood-based filters. We show that filtering indeed has a beneficial effect on the learner’s performance. However, the results of the filtering approach combined with the learning-based approach remain below those of the learning-based approach.