ELTeC: A Comparable Corpus of Novels in Many European Literatures

Christian-Emil Ore1, Berenike Herrmann2, Carolin Odebrecht3, Diana Santos1

1University of Oslo, Norway; 2Basel University, Switzerland; 3Humboldt-Universität zu Berlin, Germany

Digital methods allow for new, additional and complementary ways of analysing texts. Tapping into recent advances in Digital Humanities, we present our collaborative work on the design and creation of a European corpus of novels published between 1840 and 1920 for evaluating distant reading methods in our network COST Action Distant Reading (CA16204).

Since the end of the 1990s there has been a massive and increasing digitization of various genres of texts establishing huge collections of digital texts. Well known examples are the ( and Google Books ( National libraries across Europe have initialized similar initiatives, e.g. BNF’s project Gallica ( with 2.4 million texts and the more modest Bookshelf project of the Norwegian National Library ( with at least 500,000 digitized books. There are many such digitization projects, although their size and quality vary from country to country.

Besides these efforts, it has not been possible to obtain ‘ready made’ texts for our project. In many cases it has been necessary to digitize texts found only in printed editions to comply with the selection criteria.

The aforementioned digitization projects tend to collect texts without previously defined sampling criteria, using an ‘opportunistic’ corpus design. Despite some problems, this approach has made first available a huge variety of unknown and forgotten texts which have not been studied since they are not a part of the canon. We apply Distant Reading (Moretti, 2013), and approach emancipating non-canonical literature for large scale analysis in a straightforward manner in our project, applying clear criteria for corpus construction. In our view, Distant Reading is necessarily complemented by Close Reading - they are situated at the ends of a methodological continuum that can be applied to digital text corpora for different research contexts. These methods allow for purpose-tailored access to corpus data through analysis and visualization methods in literary studies.

With our approach, we combine a rigorous approach to corpus design with the application of a diversity of methods. We are a multidisciplinary group of European scholars who work together in a COST Action (, called Distant reading for European Literary history (Distant-Reading). The COST Action (CA16204) was initiated in 2017 and will last for four years. Our COST Action Distant Reading has three main objectives (Cf. and MoU

1. build a multilingual European Literary Text Collection (ELTeC), ultimately containing around 2,500 full-text novels in at least 10 different languages primarily from the period 1850 to 1920, permitting to test methods and compare results across national traditions;

2. establish and share best practices and develop innovative computational methods of text analysis adapted to Europe’s multilingual literary traditions;

3. consider the consequences of such resources and methods for rethinking fundamental concepts in literary theory and history.

In the following section of this paper, we will focus on the first objective and discuss the collaborative effort of building an open access multilingual corpus of European novels (the European Literary Text Collection - ELTeC). We present the work done so far within Working Group 1 ‘Scholarly Resources’ of the COST Action Distant Reading. Specifically, we address the link between the practical and technical aspects of corpus design on the one hand and the theoretical discussion on computational modeling of literature across languages and cultures on the other. This means paying attention to differences and similarities across different literary theoretical paradigms when setting up the corpus as a resource for Distant Reading, addressing the active role of corpus design for periodization and canonization in European literary history.

Working Group 1 is responsible for the development of the corpus design, the encoding schema and the workflow for data creation, maintenance and publication. The working group consists of European researchers from 23 countries and reflects an extraordinary field of different research disciplines such as corpus linguistics, computer linguistics, literary studies, social sciences, library science and philological studies. This European international scientific team is able to build a corpus that allows a European perspective on the novel.

The multilingual European Literary Text Collection (ELTeC, Odebrecht, Burnard, Navarro Colorado, Eder & Schöch, 2019) is an open access (CC-By 4.0) of European novels from the period from 1840 to 1920. This is a period with large cultural and language changes and a formative phase for many European languages. For example, in this period the Norwegian written languages underwent very large changes: from pure Danish to an adapted Norwegian written standard at one hand and an introduction of a parallel dialect based Norwegian written standard on the other. The period is interesting also from a lexical/lexicographic point of view and raise large challenges for the design of language analyzing tools like lemmatizers and morphological analyzers. ELTeC will be used as a benchmark corpus to evaluate distant reading methods, and to discuss and even challenge established literary systems and publication history of the novel ( We organize the corpus in languages collections covering Romance, Slavic, Germanic and Finno-Ugric language families. Currently, we include Czech, English, French, German, Greek, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian, Slovenian and Spanish novels.

In order to allow comparability among the different European novels and foster the interoperability of the corpus, we use the TEI to encode the digitized texts. The aim is not to make a collection of scholarly text editions nor of plain texts: “we aim to facilitate a richer and better-informed distant reading than a transcription of lexical content alone would permit” (Burnard, Schöch & Odebrecht, 2019). We are therefore not aiming to represent various text structural or graphical features of the text but to enable a uniform and consistent encoding across the language collections. We are using ODD chaining to handle three encoding levels (cf. Rahtz & Burnard, 2013): 1) a basic level covering the body of the texts including e.g. paragraphs, headers, and highlighted items, 2) a richer level including gaps, notes and quotes and 3) a third level with lexical information ( In addition, we provide basic text displays for each novel ( for the current state of ELTeC and for a Portuguese example of a text display

Building ELTeC as a unique resource requires to reassess and even discard some established ways of defining literary systems and publication histories (see the MoU). In the corpus design, we thus maintain a metadata-based approach that allows for representing the diversity of novels published between 1840 and 1920 across the multilingual, transnational, and pluri-cultural topologies of Europe. At the most general level, this approach addresses common textual and contextual features instead of solely relying on canonical definitions of novels in literary history. We defined sampling and balancing criteria that use metadata such as publication date and place, text length, reprint counts and authors’ gender.

We especially include novels not previously incorporated in the literary canons of the European countries. In our approach, ‘canonization’ cuts across ‘popularization,’ operationalized in terms of reprints. We assume that the different types of canons - defined nationally or by language, incarnated in the form of educational syllabus policies and reading lists at schools and universities - correlate with a relative high number of book reprints, documented in library records.

Using the TEI for encoding data fosters interoperability. By using the ODD mechanism, we also set a focus on clear schema definitions and documentation. The data and metadata of ELTeC are created collaboratively via GitHub ( Our working group provides an open access extensive documentation for (meta-)data schema, decisions and workflows ( We archive versions of ELTeC via Zenodo ( Thus, our (meta)data are re-usable, interoperable, accessible and findable (cf. FAIR Guiding Principles Wilkinson et al., 2016)

In the creation of ELTeC we do not aim at deductively defining what a ‘novel’ is, but to allow for different approaches in literary theory and history to be inductively explored and tested. This is likely to entail a re-evaluation and redefinition of key concepts for literary history, including genre, style or authorship as well as a debate about the advantages as well as limitations of Distant Reading methodologies and approaches to the study of European literary history.


The research described in this paper was conducted in the context of the COST Action "Distant Reading for European Literary History" (CA16204 - "Distant-Reading"). Find out more at: COST is funded by the Horizon 2020 Framework Programme of the EU


Subject Indexing: The Challenge of LGBTQI Literature

Jenny Bergenmar1, Koraljka Golub2

1University of Gothenburg, Sweden; 2Linnaeus University, Sweden

Despite a growing number of digital LGBTQI (lesbian, gay, bisexual, transsexual, queer, intersex) history ar-chives, and research-driven digital LGBTQI initiatives, queer perspectives have not been prominent in the digi-tal humanities. Furthermore, investigations of LGBTQI in literary history is hampered by the fact that, to date, there are no broad scholarly inventories of such literature. Research on subject indexing has also revealed that controlled vocabularies in use are too general to describe LGBTQI themes, motifs, and characters in a relevant manner. The purpose of this paper is to discuss how LGBTQI literature can be made more searchable, and more visible through the development of a quality-controlled subject specific database (QUEERLIT database) in which specialized subject indexing is applied. Methodological challenges pertaining to indexing of queer liter-ary texts are discussed, as well as theoretical considerations raised when assigning certain contemporary subjects to historical texts.

Emotional Imprints: Letter-Spacing in N.F.S. Grundtvig's Writings

Katrine F. Baunvig1, Oliver S. Jarvis2, Kristoffer L. Nielbo2

1Aarhus University, The Grundtvig Study Centre, Denmark; 2Aarhus University, Centre for Humanities Computing, Denmark

Undertaking a distant reading of letter-spacings in the digitized and annotated N.F.S. Grundtvig data, this paper targets a trait of an overall romanticist

emotionalizing trend in a corpus of 19th century literature: It proposes to analyze the letter-spacings as a deposition of heightened attention to subjective emotional experience in printed matter and typesetting in the writings of the Danish poet, priest and politician N.F.S. Grundtvig (1783-1872), who is widely regarded as the central figure in the 19th century Danish religious development andnation building process. As such this paper sketches the temporal and semantic contexts of the letter-spacings.

Inheriting Digital Projects: How to Keep Ibsen Alive Online

Nina Marie Evensen

University of Oslo, Norway

This paper addresses the challenge of managing digital projects on a long-term scale. In most digital projects there is no strategic plan for the afterlife and maintenance of the project results, leaving them to an uncertain fate. This can be illustrated by the inherited digital resources hosted by the Centre for Ibsen Studies at the University of Oslo, and the challenges they represent when it comes to functionality and maintenance. Due to the rapidly increasing number of digital projects, many institutions will be asking the same questions as we do: How do we keep digital resources alive and up to date in a continuously changing digital reality?

Óravíddir: Interactive Exhibition about the Icelandic Language

Trausti Dagsson, Jón Hilmar Jónsson, Eva María Jónsdóttir

The Árni Magnússon Institute for Icelandic Studies, Iceland

This paper describes an interactive exhibition about the vocabulary of the Icelandic language. The exhibition is called Óravíddir - Orðaforðinn í nýju ljósi (e. Vastness - The Vocabulary in a New Light) and was opened at the Culture House in Reykjavík, a part of The National Museum of Iceland in May 2019. The exhibition used data from the word database Íslenskt orðanet (The Icelandic Word Web) and illustrates semantic relations between words in a three-dimensional visualization. The paper introduces Íslenskt orðanet followed by a description on how the data was used to create the network graph visualization. Then we discuss the setup of the exhibition and finally we conclude by reflecting on future possibilities and further development.

Gulag Literature: Looking through the Glass of Digital Humanities

Kseniia Alexandrovna Tereshchenko

ITMO University, Russia

There are several Digital Humanities projects that are dedicated to Stalin’s terror such as «Это прямо здесь» (“This is right here”, URL:, «Открытый список» (“The open list” URL:, «Бессмертный барак» (“Immortal Gulag” URL:, Gulag online (URL: etc. These projects mostly focus on the Russian history, providing information about Russians, Russian cities and so on. Also, they bring to light the phenomenon of Stalin’s terror by means of history and historical facts. In my research I suggested taking a look at the same topic from a different angle.

First of all, in my project I emphasized that not only Russians suffered from Stalin’s terror, but also citizens of other Union republics. Secondly, I researched this topic by applying mainly cultural rather than historical approach, as there is not only historical data available but also cultural artefacts (eg books) covering the theme of Stalin’s purges. In the case of my project, I focused on the literature. No similar projects were discovered during the research.

Considering all this, the main goals of ongoing project is to propose a new approach to presenting the results of literary studies research, as well as attract more attention to Gulag literature studies overall and Estonian Gulag literature in particular.

These goals are to be achieved by:

сollecting the information about the Estonian camp prose authors;

collecting their books that are dedicated to this theme;

providing a brief literary analysis of the books;

preparing illustrative materials;

connecting the means of storytelling, illustration and literary studies;

presenting all of the above digitally (as an interactive website).

As a bachelor in Finno-ugric philology who have dedicated a thesis to Estonian camp prose, I continue researching this genre. My thesis was one of the first papers that are dedicated to Estonian camp prose as this topic remains not very well researched. This among other things makes current project relevant.

In the process of working on the thesis I have compiled an overview of the camp prose, including most noted authors, books, genre features etc. List of Estonian camp literature consists of books by such authors as J. Kross, A. Viirlaid, A. Kask, A. Helm, A. Uustulnd, R. Kaugver and some others.

When it comes to the definition of camp prose, it can be seen that often this genre is defined as one of literary movements in the history of Russian literature. In my paper though, camp prose is considered to be a literary genre that includes all books dedicated to life in Soviet camps, no matter what language they were written in and to which national literature they belong. This approach allows to analyze this genre most efficiently, as it allows to avoid excluding books that were written by authors of non-Russian descent.

During the research camp prose was analysed on the basis of a novel “Forty Candles” (“Nelikümmend küünalt”, 1966) and a short story collection “Letters From the Camp” (“Kirjad laagrist”, 1989), both written by Raimond Kaugver, one of the most well-known Estonian authors who have dedicated several books to the Gulag topic. Analysis of the stated books has led me to believe that Estonian camp prose is characterized by the following features: autobiographism; simplicity of language; fragmented composition, retrospection. It was also noted that not all of the books are wholly dedicated to the life in a camp; some of the texts were modified due to censorship; character’s emotions are often neglected. Camp prose portrays events that actually took place, but elements of fiction are also being used for various reasons. Some of those statements can be confirmed and/or illustrated by means of digital technologies. Therefore, I propose the usage of digital technologies to continue researching this genre.

In the process of research I have made a decision not to concentrate on computational methods at this stage of a project due to the small amount of available digital versions of chosen books. Instead, the focus of the project has switched to visual aspects and storytelling as tools of presenting manual literary analysis.

As it was already stated, one of the key features of R. Kaugver’s style is fragmented composition: short chapters, unexpected endings, lack of connection between the ending of a chapter and the beginning of a new one. Therefore, as a tool to visualize such texts it was chosen to create a number of collages that illustrate episodes of the book, as this form of visual art is also determined by creating something whole out of separated fragments.

Storytelling part of the project is presented in a form of annotation of said collages. Certain parts of illustrations are being connected with a piece of related text (an excerpt of literary analysis).

As a result, project’s goal of presenting a literary analysis in a modern way is achieved through illustrations (collages) and storytelling(literary analysis presented as collages’ annotations). At this point, a prototype of described website is available.

I believe that usage digital technologies as well as illustrative material and storytelling can be useful as tools to refresh the way we perceive literary analysis and allow to demonstrate research results not only to fellow researchers, but to a wider audience. Therefore, this project can become an example of modern approach to literary studies, for the form in which the literary analysis is presented is not conventional. To make it more suitable for common use and not only to scholars, the literary analysis was divided into short paragraphs that give some clues for interpreting the text. Such form does not provide in-depth literary analysis, but attracts attention to the topic and encourages people to continue analyzing the text themselves. Artworks are also helpful as means to illustrate events of the books, because most texts chosen for this project are not translated to English which makes them not available to foreign readers. Hopefully it will also attract more attention of Estonian researchers, for at this point few papers covering Estonian Gulag literature topic are available.