P4C: Large-scale repositories
Sustaining a Large-Scale Repository Architecture: Behind the Scenes of the Stanford Digital Repository
Stanford University, United States of America
In 2006, Stanford Libraries built the Stanford Digital Repository (SDR). The system has served us well—thirteen years later, SDR contains over 1.8 million objects (~500 terabytes of content). We built SDR using open-source software (including Samvera, Fedora, and Blacklight) and an additional ~300,000 lines of custom code. We believe it is among the largest and most complex repository systems in research libraries, and yet the challenges we face are common.
We have grown SDR to a point where it is extremely difficult for us to sustain. Some of our foundational technologies are not only aging but are beyond end-of-life. Meanwhile, we are challenged to continue offering a valuable, performant, highly-available repository service to our stakeholders. Over the past two years, we have analyzed the factors complicating sustainability; that work has led to operational changes that improve the current state and a plan for sustaining repository development combining open-source and custom software.
Our presentation highlights the reasons SDR became unsustainable and shares areas where we have made improvements and where we go next. We believe the lessons we have learned are widely applicable to institutions that develop their own repository solutions.
The Dos and Don’ts about setting up a very large full-text repository
State Library Berlin, Germany
In mid 2017, CrossAsia – a service of the Berlin State Library for Asian Studies – began to set up an Integrated Text Repository (ITR) to store a large amount of full-text documents mainly from China. The licensed texts are books, newspapers, journals and are cut up to their smallest pieces – pages, articles etc. In this way, scholars can cite single entities, annotate or transcribe them. The presentation will show the challenges of dealing with more than 50 million objects with different languages and encodings, ingest them into a Fedora Repository without losing the logical structures and make them searchable. We will also talk about the search functions via the Website or an SRU interface and discuss upcoming possibilities for working with the data.
Migrating The Language Archive to a new repository solution.
Max Planck Institute for Psycholinguistics
The Language Archive at the Max Planck Institute for Psycholinguistics (TLA) is an archive of digital resources on a large number of languages that are spoken around the world. An important part of its collections concerns languages that are endangered. Some of these collections were added to the UNESCO Memory of the World registry in 2015. The archive was established in the late nineties, using an in-house built repository system that was used until January 2018, when the archive was migrated to a new solution that is largely based on Islandora. To complement Islandora, a custom ingest back-end and front-end were developed that allow for a more controlled ingest workflow and that enable researchers themselves to deposit their materials. Some additional modules were developed to support the CMDI metadata framework, to visualise specific data types and to add certain missing functionality. This paper describes the whole migration trajectory, from selecting a suitable open source repository foundation to developing the additional components and finally the migration of over 1 million objects comprising about 100 TB of data.