Ruse of reuse: Detecting text-similarity with AI in historical sources

Europe/Vienna
Seminary rooms 7 and 8, 5th Floor (Georg-Coch-Platz 2, Austrian Academy of Sciences)

Seminary rooms 7 and 8, 5th Floor

Georg-Coch-Platz 2, Austrian Academy of Sciences

Description

Ruse of reuse
Detecting text-similarity with AI in historical sources

 

The volume of text available in various digital corpora has grown immensely thanks to numerous ongoing digitization projects, and the continued success of HTR will only further expand these collections. Consequently, the need to take full advantage of these vast resources is more pressing than ever. The ability to identify text-reuse and measure text-similarity is thus more important than ever, offering the potential to see connections never viewed before. AI is the key to making this possible.This two-day workshop is a continuation of our 2023 event, Finding Connections: Using AI and DNA Sequencing to Find Similarities and Parallels in Medieval Texts, now with an extended focus on general historical sources.

March 5th will be devoted to research presentations across two sessions. In the afternoon, William Mattingly will deliver an associated KIMAFO Lecture: From Physical Object to Structured Data: Building AI Pipelines in Cultural Heritage.

March 6th will take a more informal, hands-on workshop approach. Two morning sessions will be dedicated to work-in-progress reports, plans for implementing AI in various projects, experimental approaches, and open discussions. The afternoon will conclude with a closed practical programming workshop focusing on the application of various AI methods, led by Martin Roček and Gleb Schmidt.

Organizers

  • Jan Odstrčilík, Institute for Medieval Research, Austrian Academy of Sciences
  • Martin Roček, Faculty of Arts, Charles University and Institute for Medieval Research, Austrian Academy of Sciences
  • Gleb Schmidt, The Social Life of Early Medieval Normative Texts’ SOLEMNE (canones.org) (ERC Cons: 101087979), Radboud University

Organizing institutions

  • Institute for Medieval Research, Austrian Academy of Sciences
  • The Social Life of Early Medieval Normative Texts’ SOLEMNE (canones.org) (ERC Cons: 101087979), Radboud University

Cooperation

  • Austrian Center for Digital Humanities, Austrian Academy of Sciences
  • Machine Learning Topical Platform, Austrian Academy of Sciences
Registration
Ruse of Reuse - Registration
  • Thursday 5 March
    • 1
      Registration
    • Welcome
      • 2
        Welcome and introduction
    • Session 1
      • 3
        Visualising semantic similarity
        Speakers: Martin Roček (Faculty of Arts, Charles University; IMAFO, ÖAW), Jan Odstrčilík (IMAFO, ÖAW)
      • 4
        Tackling the Granularity Matching Problem in Hierarchical Texts with Multi-Resolution Embeddings

        Scholastic corpora present distinctive challenges for recommender systems because of their deeply nested textual hierarchies. In such contexts, identifying related texts is insufficient; recommendations must also operate at an appropriate level of granularity, matching the user’s current focus without being overly narrow or excessively broad. This talk addresses the granularity matching problem by expanding the notion of a “document” and generating multiple types of embeddings across hierarchical levels, enabling similarity to be modeled at different conceptual scales. It further suggests that effective recommendation in these corpora is not only a retrieval problem but also a user-interface problem and demonstrates how interface design can support contextualized and transparent recommendations.

        Speaker: Jeffrey Witt (Loyola University Maryland)
      • 5
        What is Semantic Search Good for in Scholastic Corpora?

        Given the highly intertextual character of scholastic literature, locating the exact source of auctoritates – that is, the authoritative statements commonly evoked in medieval quaestiones – is both a basic, though notoriously demanding, editorial task and an indispensable precondition for more elaborate research. Most available computational solutions aiding this task map direct lexical signals, which, while sufficient to track many cases of reuse, leave out relevant instances of paraphrased or otherwise distorted references. This paper reports on experiments with an alternative approach that relies on similarity search using contextual word embeddings. I will discuss the position of this approach compared to alternative methods (especially the so-called fuzzy search and Retrieval Augmented Generation), highlighting the differences in infrastructural requirements and data model. Focusing on this last aspect, I will discuss the details of implementation that I tested on a corpus of Stephen Langton’s Quaestiones Theologiae and a selection of its known sources (Parisian literary production c. 1200). In this, I will argue that semantic search seems to offer a viable solution for middle-sized corpora (~10M words), while being less likely to replace fuzzy search as a primary method of tracking large scale text reuse in sizeable corpora.

        Speaker: Jan Maliszewski (Wydział Filozofii, Uniwersytet Warszawski)
    • 12:00
      Lunch Break (lunch not provided)
    • Session 2
      • 6
        Automatic cataloguing of the Books of Hours. Textual unit identification and structural annotation

        The talk will deal with rule-based approaches, which allow us to establish a baseline that will be further used to assess the benefits of the LLMs (both in detection quality and computational cost).

        Speaker: Svetlana Yatsyk (IRHT, CNRS)
      • 7
        Does AI break the chains of Prometheus? Or how AI can advance the analytical possibilities of the Latin Text Archive (LTA)

        Artificial Intelligence has long played a role in Digital Humanities. Since the 2000s, methods using Machine Learning and Deep Learning have been tested and applied successfully, especially for tasks like reusing and comparing texts. The Latin Text Archive (LTA), which is a platform designed for historical semantic analysis of medieval Latin texts, once offered such features. However, due to challenges with maintenance and security, these capabilities had to be discontinued. Today, new advancements in Agentic Coding and powerful AI technologies present opportunities to recreate previous analytical tools and develop entirely new ones for genuine semantic analysis. This article discusses the current problems faced by Digital Humanities projects and examines possible solutions.

        Speaker: Tim Geelhaar (Goethe Universität)
      • 8
        How to Align Medieval Prose Texts and Other Impossibilities

        The lecture takes its starting point from a concrete practical example, namely the planned digital edition of “Der Heiligen Leben, Redaktion”, a late medieval collection of legends about saints. This collection exists in two different text versions, created in quick succession, whose differences are relevant for the cultural-historical context of the legends. Since the texts are in prose, it is difficult to create a synopsis, especially since the changes go beyond minor differences on the surface of the text and in some cases are merely semantically comparable paraphrases. The application of classical collation methods (such as Levenshtein distance) is not sufficient here; instead, embedding-based approaches are recommended. The presentation explores the applicability of different models and ultimately determines the extent to which a completely LLM-based approach can be used to address the alignment problem even more effectively.

        Speaker: Gabriel Viehhauser (Universität Wien)
    • 14:30
      Coffee break
    • 9
      KIMAFO Lecture - From Physical Object to Structured Data: Building AI Pipelines in Cultural Heritage
      Speaker: William Mattingly (Yale University)
    • Session 3
      • 10
        Operationalising classical antiquity as a culture of reference(s)
        Speaker: Marin Le Bris (Radboud University)
      • 11
        A multimodal LLM for Ancient Greek: initial results and future perspectives

        In the fall of 2025, the Decoding Antiquity project was launched by Anna Dolganov in collaboration with Mistral AI and Reply AI with the aim of building advanced AI systems for ancient languages. We have started with a multimodal LLM for Ancient Greek, and our next goal will be to train a Vision Language model capable of transcribing hundreds of thousands of undeciphered Ancient Greek papyri. This presentation is a preliminary report of our progress and the broader vision of the project.

        Speaker: Anna Dolganov (ÖAI, ÖAW)
      • 12
        Data Augmentation for Capturing Variance in Manuscript Traditions
        Speaker: William Mattingly (Yale University)
    • 10:30
      Coffee break
    • Session 4
      • 14
        Text Reuse and the Social Life of Early Medieval Canon Law
        Speakers: Sven Meeder (The Social Life of Early Medieval Normative Texts (SOLEMNE) project, Radboud University), Gleb Schmidt (The Social Life of Early Medieval Normative Texts (SOLEMNE) project, Radboud University)
      • 15
        Tracing the Tradition of the Roman Conquest of Jerusalem in Latin Texts (c.400- c.1300): A Database with c.2500 Entries

        My current project investigates how the Roman conquest of Jerusalem (70 CE) was deployed in medieval Latin texts between c.400 and c.1300. This includes building up a database that catalogues each occurrence that I can find, counting at the moment c.2200 entries, and eventually probably c.2500 entries. Deploying the Roman conquest was both an intense textual engagement with biblical references such as Lk. 19.41 and a process of recycling (but also adapting) what earlier authors such as Gregory the Great had already written about the event. This talk shall outline some of the project's strategies for determining the use of preexistent textual materials in the pertinent texts, and how this dimension is presented in the database for the use of future scholarship.

        Speakers: Alexander Marx (IMAFO, ÖAW), Peter Andorfer (ACDH, ÖAW)
    • Closed workshop