Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session Overview
Wed 1b: Techniques (1): Interacting with the web
Wednesday, 28/Sep/2016:
1:00pm - 2:30pm

Session Chair: Fabio Ciotti, Università Roma Tor Vergata
Location: Sitzungssaal (Board room)
Dr. lgnaz Seipel-Platz 2, 1010 Vienna, 1. floor

Show help for 'Increase or decrease the abstract text size'

Applying Standard Formats and Tools

S. Dumont, S. Haaf, T. Kraft, A. Czmiel, C. Thomas, M. Boenig

Berlin-Brandenburg Academy of Sciences and Humanities, Germany

In the project “Travelling Humboldt – Science on the Move” (here: AvH) of the BBAW a digital edition of those writings of Alexander von Humboldt are provided, which emerged from his journeys to America and Russia. The project is supported by TELOTA, which develops tools for the creation of digital scholarly editions, and DTA, which provides a platform for large TEI corpora of historical texts.

The source material is very heterogeneous: Humboldt rearranged and complemented his journals subsequent to his journeys. Nevertheless, it was possible to base the main part of the annotation guidelines on the existing DTABf-M; only a small amount of project specific additions were necessary. The digital edition is created in ediarum, which has been adapted for the DTABf-M and the project specific annotations. The project’s encoding demands are examined for their generalizability. If they may serve other projects as well, the respective adaptions are implemented in the DTABf-M and ediarum.BASIS. This way, standards are directly optimized based on their usage within a project.

The resulting standard-based digital resources may in turn be re-used in diverse contexts. For instance, the DTABf-conformant texts may be added to the DTA corpora, may there be combined esp. with the apostilles of Humboldt’s Cosmos lectures and from there integrated in the CLARIN infrastructure. With the web service correspSearch the edited letters can be linked to other letters already published elsewhere in editions.

The current use case is an example for consequent reuse of existing TEI resources, workflows and tools within multiple projects. This way, efforts are not concentrated on new developments but rather on the improvement of existing standard tools and formats as well as the handling of project specifics. The creation and usage of interoperable TEI resources is an importrant preliminary in this context.

Capturing the crowd-sourcing process: storing different stages of crowd-sourced transcriptions in TEI

R. Bleier1, R. Hadden2

1University of Graz; 2Maynooth University

The Letters of 1916 is a project to create a collection of correspondence from around the time of the Easter Rising, written in Ireland or with an Irish context. The project uses a crowdsourcing approach to transcription, inviting members of the public to contribute by transcribing letters and correcting those that have already been transcribed. Transcribers use a transcription desk with features borrowed from the Transcribe Bentham project. The back-end, based on MediaWiki, stores each saved revision separately, along with relevant metadata.

During our editing workflow, all data is extracted from MediaWiki’s database and injected into TEI documents for long-term storage and web presentation. The final crowdsourced transcription is checked by a member of the editorial team prior to inclusion in our online archive. In addition to storing the final marked-up version of the text, each revision stage is injected and logged in the TEI file. This affords researchers an invaluable resource to study the progress of crowd-encoding, its efficacy, and accuracy over time.

The storage of the different versions of transcriptions in TEI documents is a challenge as, being crowdsourced, they are seldom well-formed. As an intermediate measure, to enable storage and limited access to the crowd-sourced versions, the <revisionDesc> element is employed to record the ID of the transcriber/editor and the time of the revision. The revision itself is “dumped” into the <revisionDesc> element inside comment tags to sidestep issues of well-formedness.

This paper will explore more robust solutions for storing and marking-up these XML-like fragments within a TEI document; it will examine possibilities and issues for storing crowd-sourced transcription versions, and how they might be mined for insight into transcription habits.

Wiki2TEI: Wikipedias as sources for language research

K. Moerth, D. Schopper

Austrian Academy of Sciences, Austria

Wikipedia has become a synonym for encyclopaedic knowledge to a broad public; Wikipedias are more and more being used for a wide range of applications, among others as source material for research projects. For many languages of the world, the respective Wikipedia is the only freely available digital language resource.

To author Wikipedias, a so-called lightweight markup language, Wiki markup, is used which has a simple syntax, which is supposed to ease editing of web-content to be directly translated into HTML. Unfortunately, Wiki markup has a serious drawback: the lack of consistency in its application which is mainly due to the fact that it is applied manually without the help of programs checking the digital text’s structural integrity (wellformedness) and/or logical consistency (validity). While in the past, processing of digital texts often proceeded from plain text, XML technologies have become quite pervasive in many applications. Well-formed XML can be processed in many ways and ensures a higher degree of interoperability.

We have seen many projects aiming to convert Wikipedias into other formats. The probably best-known one is DBpedia, the machine-readable extract of Wikipedias’ structured portions, which is a cornerstone of knowledge modelling in the Semantic Web. The goal of our project was to put together a workflow that would allow us to transform the texts contained in Wikipedias into a format that would allow us to use our corpus creation and processing tools: tokeniser, tagger, indexer and digital reading environment. As all workflow steps in our environment have been geared towards TEI, we worked on routines to convert Wikipedias into this more expressive, more reusable format.

Our paper will discuss existing tools to perform this task, our own approach in converting Wiki markup into TEI and it will give examples of how this data can be used for research.

Contact and Legal Notice · Contact Address:
Conference: TEI Conference and Members' Meeting 2016
Conference Software - ConfTool Pro 2.6.109
© 2001 - 2017 by H. Weinreich, Hamburg, Germany