DARIAH Annual Event 2026
Rome, Italy. May 26–29, 2026
Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
Please note that all times are shown in the time zone of the conference. The current conference time is: 21st Apr 2026, 05:16:10pm CEST
|
Agenda Overview |
| Session | ||
Topic: Applied AI and Reproducible Workflows: Sustainable Infrastructures for Public Knowledge
| ||
| Presentations | ||
11:30am - 11:45am
Iterative Layout Detection training for Digitising Swedish Medical Periodicals (1781–2011): Fine-Tuning Layout Detection Models in the SweMPer Pipeline Uppsala University, Sweden Historical printed periodicals pose particular challenges for digitization: varying typography, inconsistent page layouts, degraded printing quality, and diverse content types such as articles, tables, figures, and advertisements. Within the framework of the SweMPer project, our goal is to build a national-scale digital archive of Swedish medical periodicals spanning over two centuries. To guarantee high-quality digitisation and to make the material accessible and reusable for both researchers and the public, a reliable layout detection and segmentation infrastructure is essential. The work presented here is part of a larger, machine-learning based pipeline for digitalising historical printed materials. We describe a human-in-the-loop workflow for fine-tuning layout detection models tailored to the idiosyncrasies of Swedish medical periodicals. The process began with conceptualizing the process through discussions with domain experts, and careful selection and manual annotation of representative pages drawn from across the multi-century corpus. These annotations captured key structural elements—texts, titles, images, advertisements, tables, and more—forming a domain-specific ground-truth dataset that reflects the diversity and complexity of historical medical publications. Using this dataset, we performed an initial fine-tuning of a Mask R-CNN–based model via the LayoutParser toolkit, establishing a baseline capable of segmenting the main layout components of the scanned pages. To further strengthen model performance, we adopted an iterative refinement strategy in which low-confidence predictions and underrepresented classes were systematically identified. Targeted re-annotation and the addition of new examples enabled successive rounds of re-training that substantially improved robustness and evaluation metrics. Building on this foundation, we then applied a transfer-learning approach using a transformer-based CoDeTR model with a Vision Transformer backbone. After pre-training on the large PubLayNet dataset and subsequent fine-tuning on the SweMPer data, this model achieved markedly higher detection accuracy. We report quantitative improvements over the baseline across metrics such as mean average precision and recall. More broadly, we show how this workflow—combining domain expert discussions, domain-specific annotation, human-in-the-loop sampling, and state-of-the-art model adaptation—can be integrated into sustainable digitization pipelines and adapted for other historical document collections. Our results demonstrate that strategic investment in annotation and iterative model tuning enables scalable, high-quality digitization of culturally and historically significant corpora, supporting future digital humanities research, long-term archival preservation, and public access. 11:45am - 12:00pm
Actionable Workflows: Infrastructures Enabling Collaborative, Reproducible and Sustainable Research 1DARIAH-EU; 2University of Alicante; 3CNR, Istituto Opera del Vocabolario Italiano; 4Galaxy Europe Creating reproducible research workflows does not come naturally to arts and humanities researchers. Thanks to the Social Sciences and Humanities Open Marketplace (SSHOMP), narrative workflows that describe and contextualise research tools and services are becoming more commonplace (Barbot et al., 2024 & Candela et al., 2023). However, enabling researchers to repeat, reproduce or reuse existing workflows supposes that they are actionable. This requires appropriate infrastructure and consideration of different modes of reproduction, taking into account research questions, datasets and methods of analysis or tools (Schöch, 2023). Structuring workflows for others to understand and adapt them is crucial for reproducibility and reuse (Rule et al., 2019). From the perspective of the DARIAH Annual Event 2026 theme, actionable workflows transform infrastructures from static collections of services into collaborative research platforms. They encourage transparent sharing of methods and increase the social value of publicly funded research by enabling scholars to build on each other’s work. Despite growing awareness of workflows’ role in implementing the FAIR principles their integration into scholarly practice remains challenging. Experiences from the SSHOMP and the ATRIUM project show that guidelines alone are insufficient, especially in interdisciplinary contexts (Oberbichler et al., 2021). Writing workflows is iterative and time-consuming, requires editorial oversight and lacks clear academic incentives. In practice, workflows are very seldom reused, and researchers often struggle to assess how easily they can be adapted beyond the dataset for which they were originally developed. To address these challenges, the OSCARS project which promotes the uptake of Open Science in Europe, explores ‘composability’: the ability to combine modular components into more complex and reconfigurable systems. Workflow engines play a key role by facilitating the implementation of actionable workflows. For example, Galaxy, an open-source data analysis platform originally developed in bioinformatics (Galaxy Community, 2024), now supports Digital Humanities (Schneider, Leendertse, 2025). Galaxy enables the creation and execution of reproducible workflows without programming, supports the use of High-Performance Computing (HPC), and tracks all analytical steps. In OSCARS, Galaxy is used to reproduce an OCR-based workflow for analysing historical newspapers. Another example is AEON (dAriah sErvice Oriented iNfrastructure), developed by DARIAH-IT within the H2IOSC project, which supports the transition to Open Science in Italy by federating the national nodes of 4 European Research Infrastructures —CLARIN, DARIAH, E-RIHS, and OPERAS. AEON is a Low Code / No Code platform providing advanced service orchestration capabilities, allowing researchers to create and execute reproducible, secure, and scalable research pipelines. Within H2IOSC, AEON automates the FAIRification and semantic transformation of heterogeneous digital collections, towards the creation of interoperable knowledge bases from diverse cultural heritage resources, ensuring their long-term accessibility and reuse. This paper focuses on two examples - Galaxy and AEON - within a growing infrastructure ecosystem of services and platforms, including EOSC, ECCCH, and HPC-based initiatives. It argues that DARIAH can bridge the “last mile” (Koureas, 2016) between infrastructures and research communities by testing and guiding adoption of actionable workflows. Working across the spectrum of workflow types has significant potential to foster open, reproducible and collaborative research in the digital arts and humanities. 12:00pm - 12:15pm
Open, On-Premise AI for Public Knowledge in German Administrations Hasso Plattner Institute, Germany Recent advances in language technologies have renewed interest in how public knowledge can be made more accessible, navigable, and understandable. Yet in public administrations, the deployment of AI systems raises specific infrastructural, ethical, and political questions that experience similarity with those in academic contexts. This practice-based, reflective infrastructure paper presents three AI-based knowledge infrastructures developed with German public administrations and reflects on their implications for public engagement, digital sovereignty, and long-term stewardship of public knowledge. The contribution is authored institutionally by the AI Service Centre Berlin-Brandenburg and it is positioned from the perspective of knowledge infrastructure. The paper discusses three concrete deployments that are architecturally distinct. First, a retrieval-augmented generation (RAG) system for the German Bundestag, enabling complex semantic queries over the archive of the Bundestag’s Scientific Service, supporting in-depth exploration of scientific studies on politically relevant research questions. The system is based on a modular RAG backbone that can be instantiated for different authoritative document collections and extended with heterogeneous sources. Second, a translation system that renders administrative texts, which are difficult to process, into Leichte Sprache (easy language) and is used by municipal administrations in Brandenburg. Third, a protocolling system that transcribes committee meetings and drafts minutes from audio recordings. The system, which is based on automatic speech recognition and a fine-tuned large language model, is deployed in the Landtag Brandenburg and Brandenburg municipalities. Across these heterogeneous applications, a set of shared infrastructural design principles emerges. All systems are fully open source, optimized for local, on-premise deployment on low-power GPUs, and deliberately avoid dependencies on external cloud APIs. They are designed to comply with European privacy and data protection regulations and to process sensitive political and administrative data. Local deployment is motivated not primarily by performance, but by digital sovereignty, cost and accessibility, operational robustness within administrations, and the ability to treat AI systems as long-term public infrastructure rather than transient services. Conceptually, the paper frames these systems as knowledge infrastructures rather than decision-making systems. They structure access, retrieval, translation, and summarization of public knowledge without automating political judgment or administrative discretion. This distinction is central to democratic accountability and trust: the systems aim to improve access, comprehensibility, and navigability of politically and socially relevant information, thereby supporting transparency and participation, while leaving normative decisions firmly with human actors. The paper explicitly addresses safeguards that are treated as preconditions for trust in public-sector AI, including mechanisms to mitigate hallucination, systematic traceability of sources in RAG outputs, and human-in-the-loop verification workflows. It reflects on the opportunities and limitations of local language models, the infrastructural requirements of privacy-compliant AI, and the societal implications of embedding such systems in everyday administrative practice. Finally, it highlights the reusability of the three systems for external researchers, both as a basis for domain-specific knowledge infrastructures and as a research instrument for exploratory, qualitative, and mixed-methods work on large document collections. In doing so, the paper aligns with DARIAH’s mission to build ethical, resilient infrastructures of engagement between academia, public institutions, and society. 12:15pm - 12:30pm
Publication models and infrastructures as a leverage for sustainable community-building 1Max Weber Foundation; 2DARIAH; 3Belgrade Center for Digital Humanities; 4FIZ Karlsruhe – Leibniz-Institut für Informationsinfrastruktur; 5OPERAS/IBL PAN If you are looking for a community-driven scholarly publication model, Diamond Open Access is a reliable option with many advantages. Namely, the model charges no fees to either authors or readers, treating scientific knowledge as a public good rather than a commercial commodity. What is more, the Diamond model draws on the work of Institutional Publishing Service Providers (IPSPs) that provide institutional support for journals (Armengou et al. 2023). In the perspective of global engagement, this approach makes multilingual scientific knowledge openly available, accessible and reusable for scholars, practitioners as well as for non-affiliated scholars, easing its adoption in lesser-resourced countries (Raju & Nkrumah Kuagbedzi 2025). These practical and ethical considerations were seminal in the choice of a publication venue for DARIAH’s journal Transformations. The newly published overlay journal is hosted by the French publishing platform Episciences and includes contributions based on a variety of digital approaches in the arts and humanities. Concretely, the publication process is backed by public Open Access repositories (Pinfield 2009, Rousi & Laakso 2022). The journal has developed a demanding editorial process (Transformations Editorial Board 2025) giving way to high quality publications (Baillot 2025). The dialogue between authors, reviewers and editors is a key component (Gouzi 2025), and Transformations benefits from the community-building work conducted by DARIAH over the past decades. Technology alone is insufficient to sustain the Diamond-OA ecosystem across Europe, and research infrastructures can empower actors to address the related challenges collaboratively. In 2025, OPERAS launched the European Diamond Capacity Hub (EDCH) programme to provide support and bridge the competence gap between editors and IPSPs. The EDCH empowers the community with a dedicated Training Platform and a Registry & Forum as well as with specialized guidelines (Maryl et al. 2024). National and disciplinary specificities play a key role in the adoption of community-driven publication practices. In a country like Germany where a rigid publishing culture has long been dominant, the digital transformation is now facilitating the adoption of more open practices. Open Science has become part of national funding strategies (DFG 2025a; DFG 2025b). The national funding for the service centre for Diamond-OA (SeDOA, Stäcker et al. 2025) has now established a digital research infrastructure to support Diamond-OA across all disciplines. Through the integration of the review database infrastructure zbMATH Open (Deb et al. 2025), the example of a field marked by a long history in preprint-based open-access publishing paves the way for other disciplines. The mathematical community strives to create a public, non-proprietary, global digital mathematical library. The role of open publication structures is key here: for both the mathematics community and the arts and humanities, Episciences is a major provider of a Diamond-OA technical infrastructure, used across borders, consortia, and communities. Interdisciplinary and international approaches foster innovative dialogues and open the horizons of long-time enclosed publication practices. In this paper, we will show how infrastructures play a key role in providing the backbone to sustain this development – both at a technical level and in terms of community-building. | ||
