In this study we will combine insights and resources from the domain of second language acquisition (SLA) with the power of generative large language models (LLMs), which have become very adept at various language tasks (ranging from text summarisation to machine translation). Specifically, this study will evaluate the use of retrieval-augmented generation (RAG), which allows steering the output of an LLM prompt towards information included in a self-provided knowledge base, to automatically generate feedback for SLA purposes.
The acquisition of a foreign/second language encompasses the acquisition of a complex and multifaceted communication system, of which grammar is a main component. Grammatical concepts (e.g., conditional clauses) are commonly practised through close-ended exercises such as fill-in-the-blanks (e.g., "If they _____ how it would turn out, they would have reacted differently." [to know]). One of the factors determining the learning success of such exercises is receiving adequate feedback, which, in turn, puts a huge strain on the already considerable workload of teachers. To overcome this "feedback bottleneck", computer-assisted methods for automated feedback generation (AFG) have been proposed. The most simple and rule-based AFG method is checking if the learner's response occurs in a predefined list of correct answers. This, however, excludes targeted and individualised feedback on why a given answer was wrong.
LLMs offer a means to address this limitation: using the exercise description and the learner's response as input, LLMs can be prompted to output personalised written feedback. Nevertheless, these LLMs lack educational specialisation and grounding in specific course materials, meaning that the accuracy and pedagogical value of their output cannot be taken for granted. Research to analyse and evaluate LLM performance in this regard is thus needed before they can be safely implemented as "feedback assistants" (e.g., to be used by teachers or to be integrated in autonomous Intelligent Language Tutoring Systems).
The current study aims to contribute to this new area of research by (1) presenting a dataset containing a total of 675 authentic student responses to three types of close-ended grammar exercises for three different languages (Dutch, English, and Spanish); (2) using this dataset to develop a RAG system that automatically generates feedback based on course materials as the knowledge base; and (3) conducting a human evaluation experiment in which teachers assess the automatically generated feedback for accuracy and pedagogical validity.
Regarding the RAG system, we will use open-source LLMs (e.g., Llama or Mistral) and compare two architectures: a baseline "dummy system" where course materials are included directly in the prompt, versus an advanced system using vector database retrieval from the knowledge base. In summary, this research will contribute to understanding how retrieval-augmented generation can leverage educational content to improve automated feedback quality in language learning contexts.