The MultiplEYE Text Corpus: Towards a Diverse and Ever-Expanding Multilingual Text Corpus
The MultiplEYE Text Corpus: Towards a Diverse and Ever-Expanding Multilingual Text Corpus
Abstract
We present the MultiplEYE Text Corpus, a large-scale, document-level, multi-parallel resource designed to advance cross-linguistic research on reading and language processing. The corpus provides paragraph-level alignment for texts in thirty-nine languages spanning seven language families and seven scripts. Unlike many existing multilingual corpora, a substantial number of documents were originally written in languages other than English, reducing English-centric bias and supporting more typologically diverse investigations. The texts are carefully selected to balance linguistic richness with experimental feasibility, particularly for eye-tracking-while-reading studies. Developed within a multi-lab initiative, the MultiplEYE Text Corpus follows unified translation, alignment, and experimental design guidelines to ensure cross-linguistic comparability. Its inclusion of texts varying in type and difficulty enables research on discourse-level processing, genre effects, and individual differences across a wide range of languages. The text corpus and accompanying metadata provide a robust foundation for multilingual psycholinguistic and computational modeling research. Data and materials are publicly available.
One. Introduction
One. Introduction
Document-level, multi-parallel text corpora play an important role in advancing cross-lingual research in theoretical linguistics, natural language processing, and psycholinguistics. In contrast to sentence-based corpora, they enable the investigation of discourse-level phenomena and the development of context-aware language technologies. In this paper, we present the MultiplEYE Text Corpus, a multi-parallel document-level resource that provides paragraph-aligned texts in thirty-nine languages across seven language families and seven scripts. The text corpus is specifically designed for reading experiments using eye-tracking, but it can also support other behavioral methods, such as self-paced reading, and neurophysiological methods, including electroencephalography in co-registration with eye-tracking. The texts are compact enough for use in experimental settings, yet sufficiently long and diverse to enable the study of discourse-level processing and comparisons across text types and languages. The corpus can also be used to investigate research questions outside the scope of behavioral and neurophysiological research. For example, the multi-parallel nature of the corpus makes it possible to conduct cross-linguistic research that goes beyond the comparison of individual language pairs. Moreover, the corpus coverage of both typologically distinct and closely related languages, together with varied scripts, supports research on language contact, cross-linguistic universals, and comparative analysis across different scripts. Unlike many existing datasets, a substantial part of the texts in the MultiplEYE Text Corpus were originally written in languages other than English, which helps reduce the English-centric bias that is common in linguistic resources. In addition, the corpus includes texts of varying type and difficulty; this diversity supports the investigation of a broad range of research questions that are often difficult to address because of resource scarcity. For example, recent research has demonstrated that text genre affects readers' eye movement patterns and interacts with established psycholinguistic phenomena, such as predictability effects. The MultiplEYE Text Corpus enables detailed investigation of such questions across a wide range of languages.
The corpus was created within the multi-lab MultiplEYE European Cooperation in Science and Technology Action aimed at building a large multilingual eye-tracking-while-reading dataset that supports cross-linguistic research in psycholinguistics and multilingual modeling. The MultiplEYE project provides translation and alignment guidelines and establishes the necessary infrastructure required for cross-linguistic comparability through a unified experimental design, covering stimulus selection and layout, procedure, and pre-processing, and shared FAIR-compliant resources for software, and metadata management, storage, and sharing. The main outcome of the initiative will be a large, publicly available dataset of eye-tracking data collected across multiple European and non-European languages, with a special focus on the inclusion of low- and very low-resource languages.
In this paper, we present the multilingual text stimuli used in the MultiplEYE eye-tracking-while-reading experiment. We also document the text selection and translation procedures, describe cross-linguistic differences between the texts, and provide linguistic annotations to enable the use of the corpus beyond the scope of the MultiplEYE initiative.