TF01: Research Data & Digital Preservation
Tuesday, 05/Jun/2018:
1:30pm - 3:00pm

Session Chair: Hannah Frost, Stanford University Libraries
Session Chair: Pedro Principe, University of Minho
Location: Ballroom C
The sustainable use of digital repository software for the preservation of research data and cultural heritage. Notes

Session Abstract

24x7 presentations that focus on the long term sustainability of digital preservation of research and cultural heritage communities' collections. Digital preservation and access to research data requires sustainable policies and flexible technology.

Relational Databases as Repository Objects

Alexander Garnett

Simon Fraser University, Canada

Faculty and other researchers wanting to deposit relational databases can pose interesting challenges for research repositories. In many such cases, an application layer which was custom-made to run on top of the database is no longer sustainable for the research team who had been maintaining it, and it is not always clear whether a less bespoke solution will be an adequate replacement.

Generally, making a relational database fit the model of research repositories, which usually assume more-or-less flat files, can be tricky. MySQL or Postgres databases can be exported down to a single file, but this usually requires access to admin tools and/or curator support; ingesting these export files is easy, but serving them back as-is doesn’t provide much value to end users who may lack the technical expertise to load them into a live database.

It is technically possible for a repository platform to interact programmatically with virtualization software like Docker to run an isolated database instance for every such uploaded object, to provide a relatively simple SQL interface in the browser on the object’s landing page, and to save a few database-specific queries for novice users to run automatically. I will discuss the implications of doing so.

File loss: hits and near misses

Lars Holm Nielsen, Alexandros Ioannidis, Krzysztof Nowak, Jose Benito Gonzalez Lopez

CERN, Switzerland

Repositories increasingly depend on external cloud storage or other complex distributed systems in order to satisfy ever-growing needs for storing larger data volumes. The cloud system helps repository manages store terabytes and petabytes of data, and often simplifies the file management in the underlying repository software. We trust these systems to store our files, yet, often we lack understanding of the operation and internals of these systems and how they can fail. This talk will present two file loss incidents on Zenodo, uncovering some ways these distributed systems can fail. One incident was caused by a coincidence of two software bugs in independent systems (the hit), and a second incident was caused by a human operational mistake in the cloud storage system (the near miss).

Digital Preservation through EPrints-Archivematica Integration

Tomasz Neugebauer1, Justin Simpson2, Justin Bradley3

1Concordia University, Canada; 2Artefactual Systems Inc.; 3University of Southampton

This presentation addresses digital preservation challenges with EPrints repository content through integration with the Archivematica system specifically designed for digital preservation. A workflow and folder structure using BagIt for exporting EPrints content into Archivematica is described. A sample item export with multiple files and formats is used to demonstrate the integration plan.

Little ideas for Big Data

Estelle Pope, Bethany Seeger

Amherst College, United States of America

Are there alternatives to strictly digital preservation for repository content? This session reaches into the absurd to explore the concerns of the problem space of digital preservation, and arrives at some humorous brainstorming ideas that might have a kernel of truth to them. Thinking about the idea of ‘what do we want to make sure people in the future know about us, and how can we make sure it survives?’ we are inspired by Afrofuturism, Charles Darwin, Carl Sagan, and others who are looking at the past and future in creative ways.

Preparing for certification as a trusted data repository

Mikaela A Lawrence, Janet K Applegate

CSIRO, Australia

In preparing the Data Access Portal (DAP) for certification as a trusted data repository and to publish externally owned datasets we present the challenges of the project and benefits of working with an interdisciplinary team. The DAP is an institutional repository archiving and publishing Commonwealth Scientific and Industrial Research Organisation’s (CSIRO – Australia’s national science agency) data assets since 2012. The DAP is developed within CSIRO by software engineers, storage infrastructure specialists and repository staff. It is a self-service interface and includes a workflow for peer review of datasets by a scientific group leader. A challenge to current DAP processes and procedures is a change in scope from institutional repository to publish externally owned datasets.

Certifying the DAP as a trusted data repository is part of the strategy to attract externally owned nationally significant datasets that align with CSIROs functions. To prepare for certification we identified, updated and developed documentation, policies and procedures to comply with the accreditation requirements, as well as, meet CSIRO’s business needs. The project team liaised with an interdisciplinary team of staff and representatives of externally owned data to overcome the challenges of the project.

But Why, Though? : An Evaluation of Functional Requirements, Tools, and Workflows in Digital Content Management

Andrea Green

State Library of North Carolina, United States of America

The State Library of North Carolina's Government & Heritage Library (GHL) preserves and facilitates public access to NC State Agency Publications and North Carolina heritage materials, regardless of format. Over the last decade, GHL has acquired, processed, preserved, and made accessible a variety of digital content (digitized and born digital). This presentation will provide an overview of a recently formed working group’s progress in evaluating our functional requirements, tools, and workflows regarding digital content management. Since so much can change in 10 years (staff, technology, priorities, etc), the group has stepped back from our current practices and assumptions to ask the broader who, what, why, how questions to help frame our current goals and priorities. Work thus far includes creating an updated list of functional requirements, reviewing all tools currently used in workflows, and implementing necessary changes to ensure long-term access to and identification and efficient processing of digital content.

Evaluating Repository Systems for Research Data Management: Don’t forget the look!

Kai Wörner

Universität Hamburg, Germany

The center for sustainable research data management at the Universität Hamburg did an evaluation of repository software solutions to be used as the main research data repository for the whole institution. After some candidates fell out of the grid due to not fulfilling certain “hard criteria” (like metadata standards supported, file storage specification etc.), the final decision was largely based on the accessibility of the user interfaces.

No Contribution, no Data: Building Resources Along the Lines of a “Take & Share” Approach

Hagen Peukert

Universität Hamburg, Germany

The idea behind Open Data is one of the most valuable tenets of research carried out in the digitalization age. It is both chance and key to give modern scientific thinking a headway of broad advancement backed by a larger research community. A central argument brought forward against open data by rather small research communities, usually not subject to large funding, is that high quality data is expensive to collect, but easy to exploit by others even if one’s own research is not finished. This often leads to an undue delay in the publication of data and impinges on the advantages of Open Data. In this talk I like to suggest a possible solution to small-scaled projects, in which the resources are especially costly. By constraining the general principle of accessibility in due proportion to the user’s ability and demand, an incentive is set to publish data that would otherwise not be publicly available. It is important to note that the idea of “Open” Data is not questioned. The plausibility of this approach is made by reference to the literature in Social Psychology together with a short presentation of the the adjusted repository software.

