How relevant is part-of-speech information to compute similarity between Greek verses in a graph database?

Publication type
C1
Publication status
Published
Authors
Swaelens, C., Deforche, M., De Tré, G., De Vos, I., & Lefever, E.
Editor
Colin Swaelens, Maxime Deforche, Ilse De Vos and Els Lefever
Series
Proceedings of the first workshop on Data-driven Approaches to Ancient Languages (DAAL 2024)
Pagination
33-43
Publisher
Language & Translation Technology Team (LT3) (Ghent)
Conference
First Workshop on Data-driven Approaches to Ancient Languages (DAAL 2024) (Ghent, Belgium)
Download
(.pdf)
View in Biblio
(externe link)

Abstract

This paper presents the automatic linguistic analysis of the Database of Byzantine Book Epigrams (DBBE) on the one hand, and its representation and integration in a graph database on the other hand. Firstly, we provide a comprehensive description of the DBBE data we want to provide with a complete morphological analysis. The presented methodology explores the possibilities of fine-tuning the DBBErt transformer-based language model, which was trained on pre-Modern and Modern Greek. Secondly, the automatically annotated epigrams are integrated in a graph database, a new way to represent the relatedness of this entangled corpus. With the graph database, we can compute similarity between words, verses and epigrams. Given the scope of this paper, we computed a complete orthographic similarity between the verses, a similarity based on the automatically assigned part-of-speech information and a final similarity measure that combines both orthography and part-of-speech information. The results of these similarity measures provide scholars with new visual representations of relations between (parts of) texts, which is beneficial for new critical editions and commentaries.