Conference Agenda

Poster Slam
Thursday, 19/Mar/2020:
3:50pm - 5:20pm

Session Chair: Annika Rockenberger
Location: Ziedonis Hall
Ground floor


Towards Better Structured Online Data With the Project “News, Opinions or Something Else? Modeling Text Varieties in the Multilingual Internet”

Veronika Laippala1, Saara Hellström1, Sampo Pyysalo1, Liina Repo1, Samuel Rönnqvist1, Anna Salmela1, Valtteri Skantsi1,2

1University of Turku, Finland; 2University of Oulu, Finland

1. Introduction and objectives

The Internet has brought revolutionary potentials for many fields benefiting from textual data. The masses of text englobe new ways of writing and present unprecedented possibilities to explore, e.g., language, communication and culture (Berber-Sardinha, 2018; Biber and Egbert, 2019). Furthermore, thanks to the billions of words of data available online, the quality of many NLP systems, such as machine translation, can be improved tremendously (Tiedemann et al., 2016; Srivastava et al., 2016). Importantly, almost anyone can write on the Internet. Therefore, the web provides access to languages, language users, and communication settings that otherwise could not be studied.

Despite the potentials, the use of web data is currently very restricted. Above all, the diversity of different kinds of texts on the web imposes serious challenges. Currently, all the texts have a similar status in the web language resources, and there is no information on the origins of the texts, or, specifically, on their situational and communicative specificities - on their register (Biber 1988; Biber and Conrad 2009). Lack of understanding of register may lead to wrong conclusions about the text, as we do not know how to interpret it (see, e.g., Koplenig, 2017). For instance, we read news, discussion forum messages and encyclopedia articles very differently. Furthermore, register is one of the most predictors of linguistic variation (Biber 2012) and critically impacts NLP: for example, methods developed to process legal texts perform poorly on texts from social media (Webber, 2009) Register would thus offer important information to develop web data from masses of raw text to organized collections that can serve specific purposes and research questions.

This poster presents the newly started project News, opinions or something else? Modeling text varieties in the multilingual Internet running at University of Turku, Finland. In this project, the objective is to analyze and characterize the full range of registers found on the Internet, and to develop a system that could automatically detect them from online language resources. As a practical outcome, the project applies the developed system to detect registers from Universal Parsebanks (UP), a collection of web corpora we have compiled in our research group in previous research. The project focuses on the French, English, and Swedish UP collections, as well as on the Finnish Internet Parsebank, which can be referred to as UP Finnish.

2. Raw data

The raw online data analyzed in this project comes from Universal Parsebanks (UP), which is a collection of billion-word automatically collected web corpora, developed in the previous projects of our research group and widely used by linguists, lexicographers, and NLP researchers, among others (see Zeman et al. 2017). UP includes 64 languages and almost 100 billion words. The most frequently used of the language-specific collections is Finnish Internet Parsebank (Luotolahti et al. 2015), which was originally collected in a project funded by Kone foundation. UPs are freely usable online at As a result of the current project, register information will be added to the UP collections in French, English, Finnish and Swedish.

3. How to know what registers include?

As the project objective is to model the full range of registers found in the Internet, a key question is what these registers are. To this end, we profit from the online register taxonomy developed by Biber et al. (2015) for the Corpus of Online Registers of English (CORE). CORE is based on a near-random sample of the English-speaking web, and manually annotated for registers by four coders. The register taxonomy is created in a data-driven manner to cover the full range of register variation found on the web.

The CORE taxonomy is hierarchical and consists of altogether eight main registers and ~30 sub-registers. These are described below in Table 1.


News reports/News blogs, Sorts reports, Personal blog, Historical article, Short story / Fiction, Travel blog, Community blog, Online article

Informational Description

Description of a thing, Encyclopedia articles, Research articles, Description of a person, Information blogs, FAQs, Course materials, Legal terms / conditions, Report


Reviews, Personal opinion blogs, Religious blogs/sermons, Advice

Interactive discussion

Discussion forums, Question-Answer forums


How-to/instructions, Recipes

Informational persuasion

Description with intent to sell, News+Opinion blogs/Editorials


Songs, Poems


Interviews, Formal speeches, TV transcripts

Table 1: Register taxonomy applied to model the linguistic variation on the Internet

We use the CORE taxonomy to create similar, manually register-annotated corpora for the other project languages, Finnish, Swedish and French. These will be described in the next section.

4. Current status: creating multilingual online register corpora to allow register detection

We are currently manually annotating registers in samples of the Finnish, Swedish and French UP collections. The objective is to develop similar corpora than CORE for these languages.

So far, we have made the most progress with FinCORE, the Finnish register collection. The first version of FinCORE consisting of 2,200 documents was already used in a study on cross-/multilingual register detection (Laippala et al. 2019). Currently, FinCORE covers 7,500 documents, of which 1,100 have been annotated as machine translations or machine-generated texts.

For French and Swedish, we have started the register annotations with some hundreds of texts annotated. As a positive outcome of this, we can already say that the CORE taxonomy suits relatively well the registers found in these languages. However, the distribution of registers seems to vary, which may complicate their cross-lingual modeling.

Once the annotations are ready, we can start to analyze the registers and develop classifiers to identify them in the raw UP datasets. We have already done experiments on the English CORE using Bert (Devlin et al. 2018). The task is relatively difficult, because the corpus consists of a near-random sample of the web, and the documents are not selected in the corpus to represent a certain category. However, the preliminary results are encouraging. In the poster, we will present the performance of the classifier.


Biber, D. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8, 9–37.

Biber, D. and Egbert, J. 2018. Register variation online. Cambridge: Cambridge University Press.

Biber, D., and Conrad, S. 2009. Register, genre, and style. Cambridge: Cambridge University Press.

Koplenig, A. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities, 32(1), Pages 169–188.

Laippala, V., Kyllönen, R., Egbert, J., Biber, D. and Pyysalo. S. 2019. Toward Multilingual Identification of Online Registers. Proceedings of Nordic Conference on Computational Linguistics (NoDaLiDa), September 2019.

Luotolahti J, Kanerva J, Laippala V, Pyysalo S. and Ginter, F. 2015. Towards Universal Web Parsebanks. Proceedings of the Third International Conference on Dependency Linguistics, Depling 2015. Uppsala: Uppsala University.

Srivastava, A., Rehm, G., & Sasaki, F. 2017. Improving Machine Translation through Linked Data, The Prague Bulletin of Mathematical Linguistics, 108(1), 355–366.

Tiedemann, J., F. Cap, J. Kanerva, F. Ginter, S. Stymne, R. Östling, and M. Weller-Di Marco 2016. Phrase-based SMT for finnish with more data, better models and alternative alignment and translation tools. Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, 391–398. Berlin, Germany. Association for Computational Linguistics.

Titak, A., & Roberson, A. 2013. Dimensions of web registers: an exploratory multi-dimensional comparison. Corpora, 8(2), 235–260.

Webber, B. 2009. Genre distinctions for discourse in the Penn treebank. Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP. Association for Computational Linguistics, 674–682.

Zemand, D. et al. 2017. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies


Evaluating, Monitoring and Regulating the Identification of Offensive Content

Thomas Mandl1, Prasenjit Majumder2, Sandip Modha2, Mohana Dave3

1University of Hildesheim, Germany; 2DA-IICT, Gandhinagar, India; 3LDRP-ITR, Gandhinagar, India

The amount of Hate Speech online has become a threat for communication and a concern for politicians. The use of algorithms for automatically detect-ing hate Speech and other offensive content and handling it appropriately has become common practice. Many algorithms are being developed and tested for such tasks. Data challenges have been created for comparing the performance of algorithms. However, the transfer from testbeds to realistic scenarios is still a challenge. The systems should ensure good prediction quality, run bias free, provide some level of transparency and adopt to a changing environment. These challenges are also of great relevance for other areas in the digital social sciences. We present some ideas and preliminary results on how to approach these issues using current data collections. Based on the data and results for the HASOC track (Hate Speech and Offensive Content Identification in Indo-European Languages) we show how fusion systems and an analysis of detailed performance metrics can be used to monitor algorithms better.


In order to contribute to research in the identification of offensive content, the HASOC (Hate Speech and Offensive Content Identification in Indo-European Lan-guages) initiative created a testbed from data on Twitter and Facebook. Datasets were generated for German, English, and Hindi. The dataset provided for training contains 17.000 tweets altogether. The entire dataset was annotated and checked by the organizers of the track. HASOC consists of three tasks, a coarse-grained binary classification task, and two fine- grained multi-class classifications. The main task was the classification of Hate Speech (HOF) and non-offensive content. Example for tweets are shown in the table 1.

The use of supervised learning with the annotated dataset is a key strategy for ad-vancing such systems. There has been significant work in several languages in par-ticular for English. The objectives of HASOC are to stimulate research for further languages and compare performance in one language to that in English. Other data collections for hate Speech include GermEval [1] and SemEval [2].

Table 1. Examples for Posts form the HASOC Dataset

NOT In case you missed this! You're uniting the country

alright Ireland

HOF America is the only country in history to fight a war to end slav-ery. Democrats won’t fight the predominantly black, Muslim run countries that continue slavery today. Nor do they care about kids made sex slaves by the drug cartels. Democrats are dog

The identification of Hate Speech within a collection or a stream of tweets is a chal-lenge because systems cannot rely solely on the content. Hate text can be about many issues and hate often has no clear signal words and word lists as in sentiment analysis are expected to work less well.

The performance of the best system for identification of Hate Speech for English, Hindi, and German was a Marco-F1 score of 0.78, 0.81 and 0.61, respectively. Most approaches model language as a sequential signal. That means they encode the se-quence within which a word occurs as input.

The approaches used most often were bi-directional Long-Short-Term memory (Bi-LSTM) networks processing word embedding input. The best system was the recently developed BERT (Bidirectional Encoder Representations from Transformers). Further details and results can be found in the overview article [3].

3 Performance Analysis and Monitoring

HASOC provides an opportunity to analyze the behavior of algorithms. One core issue is that many experiments deliver very similar performance. We need to ex-plore how much they differ and whether it is possible to combine the output of sev-eral systems into one fusion system. Such meta-experiments can also help to moni-tor and regulate systems. Having several algorithms available is an opportunity for comparative analysis.

However, the distribution of predictions for one document is highly diverse. We have shown that few systems agree on documents. We further developed majority vote runs from experiments submitted at HASOC. We selected five runs of top per-forming teams. In order to increase diversity of approaches, we selected the best experiment of the five top teams (MajVoteTop). For comparison, we also created majority vote runs for all systems, including the less successful ones (MajVoteAll). By changing the number of required votes for a document, we created different artificial experiments and compared their performance to the best runs.

It becomes obvious, that concerning the combined F1 metric the top runs cannot be outperformed by the fusion systems presented in table 2. However, for the important value Recall in the class Offensive and Hate, a fusion of all systems delivers the best performance (85%). In a realistic scenario, the preference for metrics needs to be clearly defined. Regulation can demand a better trade-off between recall and preci-sion which fits the need to encountering enough hate posts (recall) while not label-ling too many posts which could be acceptable (precision). The analysis of many algorithms can help in finding this trade-off.

4 Transparency

Hate Speech detection algorithms might unduly limit expressions of citizens, they potentially affect the right for Free Speech and should be carefully monitored and evaluated in order to investigate whether they are free of hidden bias. Within the taxonomy of Lipton [4] the most adequate way to achieve transparency seems to be “Explanation by Example”. Current research also investigates “Post-hoc Interpreta-bility”

An experiment found that textual description and explanation can improve ac-ceptance of a system [e.g.5]. However, the relation between explanation and system functions are unclear. A future challenge will be the delivery of similar examples which are acceptable for a user. This is difficult due to the intransparency of deep learning systems. E.g. the similarities of some based on word embeddings are hard to interpret for humans. Because word embeddings are developed based on the se-quence of words, even words with very different meaning (good, bad) which appear nevertheless in similar contexts do not have a very low score.

5 Conclusion

Of course, freedom of speech needs to be guaranteed in democratic societies for future development. Nevertheless, the offensive text which hurts others’ sentiments needs to be restricted. As there is such an increase in the usage of abuse on many internet platforms, technological support for the recognition of such posts is necessary. HASOC contributes to research in classification. We showed how a benchmark can be used to explore further issues relating to the realistic application of content moderation systems. Such challenges also appear in many other areas in digital social sciences


The Grammar of Politics. Modelling Technocratic Speech and Argumentation in Parliamentary Debate 1918–2017

Ruben Ros

Utrecht University, The Netherlands

The rise of populism, the decline of ideology and the challenges of transnational governance have brought the concept of technocracy back to the centre of academic attention (Esmark, 2016). The concomitant narrative of “technocratization” involves the erosion of democratic debate, following from either the (institutional) displacement of decision-making powers or the diffusion of technocratic ideas. Technocracy, however, is hard to trace. It operates not so much on the level of what is said, but how it is said. This project therefore studies technocracy as a specific mode of thinking by computationally modelling argumentation in Dutch and British parliamentary proceedings between 1918 and 2017. The computational identification of different types of argumentation and the examination of connections between these different types, political actors and specific concepts allows a systematic analysis of the rise of technocratic modes of thinking in this period.


This project employs the digitized parliamentary proceedings of the Dutch First and Second Chambers (Handelingen der Staten-Generaal, 1814-2017) and the transcripts of the debates held in the House of Commons and the House of Lords (Hansard Corpus, 1803-2005 & HanDeSet, 1997-2017). Both datasets consist of full-text proceedings accompanied by metadata on speakers, topics and the structure of debates (Alexander and Davies; 2015, Marx et al., 2012). The data has been semantically and syntactically enriched, which allows the division of the data based on topics and linguistic features (Nanni et al., 2019). All datasets are fully accessible through APIs.

Parliamentary debate has long been regarded as a mere veil obscuring power interests. In the wake of the linguistic and cultural turns of the 1990s, parliamentary sources have been rediscovered as objects in the study of political culture (Mergel, 2002). The 2000s saw a greater focus on parliamentary language, involving such approaches as discourse analysis and cognitive linguistics (Ihalainen et al., 2018; Bayley, 2004).

This project considers Dutch and British parliamentary proceedings from the period between 1918 and 2017 because in this era, modern democracies formed. Universal suffrage was established in the late 1910s in both nations. During the twentieth century, parliamentary culture developed in different ways in the Netherlands and the United Kingdom. Nevertheless, they shared important political events and developments: from the postwar period of reconstruction and the consensual nature of parliamentary debate to the conservative turn in the 1980s and the resurgence of populism in the 2000s.


To connect arguments and argumentative types to the topic of this research, I identify three elements of technocracy/technocratization that relate to argumentation. First, technocratic thinking can be understood as a way of managing contingency and the denying the existence of political alternatives (Sánchez-Cuena, 2017). Second, technocratic thought is marked by frequent references to external actors and factors (McKenna & Graham, 2000). Third, this generalizing dynamic leaves the politician with a smaller “space” in which to operate (Palonen, 2003). This tripartite definition of technocracy, combined with the insights derived from the literature leads to a hypothesis with the following components:

Parliamentary debate slows a gradual increase in technocratic argumentation starting in the 1930s and accelerating in the 1950s.

Technocratic argumentation is originally found in debates on macro-economic policies and welfare state provisions.

Technocratic argumentation is not restricted to specific political parties.

In the 1970s and 1980s, technocratic argumentation “spills over” into other policy areas, leading to convergence in the form of technocratic argumentation and thus a technocratization of political debate.


The project studies the topic of technocracy and technocratization from the perspective of political grammar. It does so by focusing on a specific unit of analysis: the argument. Arguments can be defined as linguistic expressions “aimed at convincing a reasonable critic of the acceptability of a standpoint by putting forward a constellation of propositions justifying or refuting the proposition expressed in the standpoint” (Van Eemeren and Grootenhorst, 2004). Philosophers distinguish between various argumentative forms and composed many so-called “argument schemes” (Lumber, 2016). The proposed research uses a recently proposed scheme: the Periodic Table of Arguments (PTA). This classification has been formulated as a synthesis between existing approaches and is particularly suitable for computational analysis (Wagemans, 2019). Compared to other frequently-used schemes, such as the Toulmin Scheme or Waltons scheme, the PTA focuses on the linguistic aspects of arguments, which results in a scheme that is more easily translated to models and text classification algorithms.

Recently, argument schemes have been fruitfully applied to “argument mining”: the computational extraction and classification of arguments from textual data. Argument mining has been highly successful in detecting and classifying arguments in for example online discussions and scientific papers (Janier and Saint-Dizier, 2019). The present project breaks down the process of argument mining into a specific series of “subtasks” (Stede & Schneider, 2018). First, argumentative sentences have to be identified and segmented into so-called “Argumentative Discourse Units” (ADUs). This entails the identification of a conclusion (“the statement that is doubted”) and its premise(s) (“the statement that is supposed to take away that doubt”) (Wagemans, 2019). Subsequently, the nature of the relationship between these ADUs needs to be determined. Premises must be tied to conclusions and different arguments must be separated. Based on the classification of ADUs, the relations between them and the specific configuration of linguistic entities such as subjects and predicates, an argument type will be assigned (Figure 1).

Performing the subtasks computationally requires the generation of a data-driven model for argument classification, which is dependent on manually annotated training data. For this project, manual annotation is considered an added value: it facilitates a first iteration of close reading and hypothesis testing. Annotation guidelines will be developed based on the existing documentation on the PTA and other argument mining projects (Stab and Gurevych, 2014; Visser et al., 2018). After establishing a sufficient level of Inter-Annotator Agreement, the annotations will be exported in the standard Argument Interchange Format (AIF) (Lawrence et al., 2016).

The data structure that consists of argumentative texts, conclusions, premises and metadata will subsequently be used to investigate three components of technocratization. First, I investigate the general change in parliamentary argumentation. This entails mapping the changing prominence of argument types, looking at concepts in specific argument types and investigating the order of arguments in debates. Second, I examine whether arguments and types related to technocratic reasoning originate in specific policy areas. Lastly, I relate my findings to parties and politicians and ask to what extent technocratic argumentation runs through the traditional ideological divides or correlates with specific political actors.

This project looks at political language from a radically new perspective. By applying argument mining to historical data, new light is shed on political language and the changing ways in which politicians have argued. As such, this project contributes to the integration of the rapidly innovating field of argument mining and political history.


Alexander, W. and Davies, M. (2015). The Hansard Corpus 1803-2005 <>

Budzynska, K., & Reed, C. (2019). Advances in Argument Mining. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 39-42.

Esmark, A. (2016). Maybe it is time to rediscover technocracy? An old framework for a new analysis of administrative reforms in the governance era. Journal of Public Administration Research and Theory, 27(3), 501-516.

Ihalainen, P., Ilie, C., & Palonen, K. (eds.). (2016). Parliament and Parliamentarism: a comparative history of a European concept. London/Frankfurt: Berghahn Books.

Janier, M., & Saint-Dizier, P. (2019). Argument Mining: Linguistic Foundations. London: John Wiley & Sons.

Lange, M. van , & Futselaar, R. (2019). Debating Evil. Using Word Embeddings to Analyse Parliamentary Debates on War Criminals in the Netherlands. Contributions to Contemporary History, 59(1).

Lawrence, J., Duthie, R., Budzynska, K., & Reed, C. (2016). Argument Analytics. Proceedings of the Sixth International Conference on Computational Models of Argument 2016. Amsterdam: IOS Press.

Victor, J. N., Montgomery, A. H., & Lubell, M. (Eds.). (2017). The Oxford Handbook of Political Networks. Oxford: Oxford University Press.

Marx, M. Doornik, J. van, Nusselder, A. and Buitinck, L. (2012). Dutch Parliamentary Proceedings 1814-2012, nonsemanticized, Distributed by DANS EASY.

McKenna, B. J., & Graham, P. (2000). Technocratic discourse: A primer. Journal of Technical Writing and Communication, 30(3), 223-251.

Mergel, T. (2002). Überlegungen zu einer Kulturgeschichte der Politik. Geschichte und Gesellschaft, 28(4), 574-606.

Nanni, F., Menini, S., Tonelli, S., & Ponzetto, S. P. (2019). Semantifying the UK Hansard (1918-2018). Proceedings of the ACM/IEEECS Joint Conference on Digital Libraries (JCDL’19). New York.

Stede, M., & Schneider, J. (2018). Argumentation Mining. Synthesis Lectures on Human Language Technologies, 11(2), 1-191.

Visser, J., Lawrence, J., Wagemans, J. H., & Reed, C. (2018). Revisiting Computational Models of Argument Schemes: Classification, Annotation, Comparison. Proceedings of the Sixth International Conference on Computational Models of Argument 2018. Amsterdam: IOS Press.

Wagemans, J. H. (2019). Four basic argument forms. Research in Language, 17(1), 57-69.


Live Sentiment Annotation of Movies via Arduino and a Slider

Thomas Schmidt, David Halbhuber

Media Informatics Group, University of Regensburg, Germany

In this late breaking poster, we present the first version of a novel approach and prototype to perform live sentiment annotation of movies while watching them. Our prototype consists of an Arduino microcontroller and a potentiometer, which is paired to a slider. We motivate the need for this approach by arguing that the presentation of the multimedia content of movies as well as performing the annotation live during the viewing of the movie is benefi-cial for the annotation process and more intuitive for the viewer/annotator. After outlining the motivation and the technical setup of our system, we will report upon studies we plan to validate the benefits of our system.


Object Recognition in Illustrated Children Books: Challenges of Applying Computer Vision Systems

Thomas Mandl1, Im Chanjong1, Helm Wiebke2, Schmideler Sebastian2

1University of Hildesheim, Germany; 2University of Leipzig, Germany

Research on children’s and youth literature in the 19th century can be supported by advanced algorithms. Modern deep learning images processing can contribute to quantify knowledge about the visual components in books. In particular, object recognition systems can identify the portfolio of objects in book illustrations. In a study with several hundreds of books, we applied systems to find illustrations and classify them. First results are shown and discussed. They show that persons are shown in illustrations within fiction books with a higher frequency than in non-fiction books. Some of the challenges for analyzing the historical appearance of objects are elaborated

Digital Humanities is having a considerable impact on humanities research related to text. Many text mining tools have been developed and are currently being applied to genuine research questions in the humanities. This trend is currently contributing to a larger variety of methods being used. There has been no comparable paradigm shift in research related to visual material. Digital historical corpora allow the auto-matic access to images and their analysis in great numbers. This can lead to new innovative research questions and quantitative results. Especially, the analysis of digitized historical books with rich visual materials can be of a great value.

Research on historical children’s and youth books has yet not often been the sub-ject of digital humanities (DH) studies. This research requires processing for both text and images. Children books typically contain more images than adult books typically. As a consequence, they are of special interest for an analysis of images. In addition, they form a closed category on the one hand which contains sufficient vari-ety on the other hand [1,2].

Illustrated books have played a significant role in knowledge dissemination. The declining production costs for printed images have led to a growing exposure of more and more people to rich visual resources. Research in this area can identify trends in the objects depicted. The algorithmic analysis seems promising [3].

State of the Art

Few researchers have processed large amounts of book images to address issues of style or objects. The HBA data challenge for old books intends to improve algo-rithms for separating illustrations from text automatically [4].

One experiment in the art domain by Salah & Elgammel is dedicated to classify the painter of artistic work. Such work is highly dependent on the type of paintings in the collection [5]. An approach to identify objects within art work has also been presented. Similar to our approach, it needs to deal with the domain shift and apply current technology to historic print [6].

A recent project is focusing on graphic novels. Current state of the art CNNs are applied to tasks like author identification with very good success [7]. In addition, the processing is aimed at measuring the drawing style of a graphic novel in order to find similar books. A study of modern children books based on information available in catalogues has analyzed market structures and book formats [8].

One of the first goals for the research of images in historical children books lies within the production technologies. As a classification problem with few classes, it seems like a challenge which could be solved with current technology. However, detailed analysis of production technology in the 19th century is still a hard task [9].

Our research is exploiting two collections of mainly German children books that are partly digitalized. The first collection is the Wegehaupt corpus maintained by the Staatsbibliothek in Berlin [10].

The second data collection is based on the Hobrecker collection. This collection of books is archived in the library of the Technical University of Braunschweig. A subset has been digitized and is available online [11].

Both collections are of great interest for cultural research. They contain a rich va-riety of different types of children books mainly from the 19th century: e.g. alphabet-ization books, biographies, natural history descriptions as well as adventure and travel stories.

4 Results and Analysis

Convolution Neural Networks (CNN), the recent state of the art technology is known to be very effective in automated feature detection and subsequent classification in many domains [e.g. 12]. In the approach presented in this paper, CNNs have been used as the processing model for classification. First, we classify whether a page contains an image and locate it on the page. Second, we apply the object recognition technology Yolo to the images and record the recognized object types.

The first analysis is done on 678 books of the Wegehaupt collection. Training data was generated by students. The classification is very reliable. This allows a look at the development of illustrations within the 19th century. In the second half, the im-provement of technology for printing led to more images per page on average (see figure 2)

The results of Yolo [13] have been recorded for a subset of 321 books from the Hobrecker collection. The classification is not very reliable; a detailed evaluation is ongoing. However, for a statistical analysis, it seems sufficient. We manually classi-fied the fiction books and non-fiction books. The analysis shows that there is no difference in the overall number of illustrations for the two classes. However, the fiction books contain more humans and horses. Non-fiction displaying many different objects and animals seem to cause that difference.

Most of the research in image processing is currently being carried out for photographs. These collections vary greatly from the non-realistic drawings and illustrations which can be found in children books. It is unclear how well the model transfer actually works. Often unrealistic objects occur like anthropomorphic animals or fairy tale figures. These can obviously not be recognized with systems trained on contemporary photographs.

In addition, the type of material also leads to many other challenges. The distribu-tion of the class frequencies is highly skewed. The most frequent classes are humans and a few animals. This does not allow a quantitative tracing of many different mo-tifs through the century.

5 Outlook

For further analysis, other object recognition systems will be applied. Also transfer learning will be used. For that, a small number of object is labelled and the algorithm is re-trained.

Future work needs to also address stylistic and artistic aspects of the children books. We intend to analyze page patterns and their development over time. A deeper analysis of content on a page and in particular of frequent classes (primarily pictures of humans) offer great potential for advanced analysis tools for digital humanists.


Studying Vernacularization and the Development of the Public Sphere through Modal Expressions in Written Finnish, 1820–1917

Antti Kanner, Hege Roivainen, Tuuli Tahko, Jani Marjanen

University of Helsinki, Finalnd


The early modern and modern periods in Europe entailed a transformation in linguistic geography, where diglossic systems (Ferguson 1959; Fishman 1967; Hudson 2002) that had prevailed for centuries started to erode and new national languages based on local vernaculars were adopted and promoted as languages of administration, law, politics, learning and the arts. This process is sometimes labelled as vernacularization. In a three-year project at the University of Helsinki, we combine theories of nationalism and the sociology of language to build our understanding of the complex interplay of state, language and public discourse in the dynamic linguistic environment of vernacularization. We do this by using newly available digitized historical corpora that can be accessed with computational text mining. The project brings together perspectives from 1) studying bibliographic information to look at changes in the publishing landscape, 2) historical analysis of changing past perceptions of different languages, and 3) linguistic analysis of how the features of a language change when it becomes more readily used in a written, more elevated form in public discourse.

This poster focuses on the third perspective by analyzing changes in the Finnish language in a diachronic corpus consisting of digitized newspapers from 1820–1917. In the Finnish context, vernacularization happened in two steps, first with Swedish being adopted more and more as a language of administration and science in the eighteenth century, and second, in a more powerful and rapid transformation, when Finnish developed into a national language with state and nation bearing functions during the course of the nineteenth century. From having been an underdeveloped and underprivileged written language, Finnish had, due to active promotion and development of written standards, become a state-bearing language by the early twentieth century (Engman 2016; Huumo, Laitinen & Paloposki 2004). Our hypothesis regarding the development of modern standard Finnish is that once Finnish became more readily used in public debate, more nuanced and complicated structures emerged to countenance the newly arisen rhetorical needs. The increase in printed material required a more elaborate notion of the public, and the potential audience and readership of the texts – an audience that was largely unknown to the author.

In the study, we are especially interested in the linguistic structures expressing epistemic modality and other forms of evidentiality. We hypothesize that these specific linguistic resources, used to align authors’ views with those of their perceived audiences, are robust markers of larger linguistic change that took place when language transcended from mostly agrarian spoken language to a literary and administrative language. Hence we study the emergence, frequency, and distribution of epistemic modal expressions. We examine this hypothesis by comparing it against two separate baselines. The first one of these establishes the general confidence interval of temporal variance and is made out of a random sample belonging to comparable grammatical categories. The second one seeks to establish the general temporal pattern of emerging public discourse, by matching the aforementioned grammatical constructions with key terms related to the overt discourse on the social shifts of the public sphere (especially the rise of terminology like julkisuus ‘public sphere’ or julkiso ‘the public’) and variables connected to the development of the concrete material limits of public discourse (ie. growth in book printing and newspaper publishing (Tolonen et al. 2018; Marjanen et al. 2019)), respectively.

Materials and methods

Our data set consists of newspapers published in Finnish between the years 1771 and 1917 (with first Finnish-language publications from 1820 onwards). These have been digitized and made available for data and text mining by the National Library of Finland. The bulk of the newspapers are in Finnish and Swedish. They consist of 5.2B (in Finnish) and 3.4B (in Swedish) token words and provide a nearly complete record of newspapers and periodicals in the country published in this period. The newspaper corpus as such cannot be seen as representative of the Finnish language in general, but it is the best historical corpus available, as newspapers covered a wide range of topics and recorded new features in language by publishing everything from poetry to reports on political events and reflections on academic texts.

As linguistic markers, a catalogue of 92 central modal expressions has been selected based on established descriptive grammars of Finnish. Not all of them have epistemic or evidential uses in modern Finnish, but as we are building a scalable approach it is advantageous to look at a wider range of expressions. Furthermore, a possible implication of our working hypothesis – that epistemic and evidential expressions developed their fine-lined ecologies as an outcome of the vernacularization process – is that it is quite plausible that the division of labour between expressions devoted to deontic, dynamic and epistemic modalities was more fuzzy and dynamic before the onset of that process. Our catalogue of modal expressions includes morphological moods, modal verbs and verb constructions (voida ‘can’, on tehtävä ‘must be done’), modal adjectives (e.g. todennäköinen ‘probable’, ilmeinen ‘obvious’) and modal adverbs (e.g. oletettavasti ‘presumably’, tuskin ‘hardly’). As is often the case with vernacularization, the sources of newly developed expressions fall roughly under three categories. The first are expressions that already existed in the language, either in earlier written forms or in spoken dialects only but which undergo either a semantic or syntactic change (or both). The second group are translations and calques from languages that are further ahead in the vernacularization process (in the case of Finnish, mostly from Swedish), these languages providing models for what kind of expressions are presumably required to fulfill literary, administrative and public functions. Finally, the third group are productive formations based on the language’s own resources which have not been in use earlier nor have any obvious outside models.

In analysing the development of modal expressions the key objects of interest are 1) the overall frequencies of the modal expressions, relative to the amount of text in general and relative to each other, and 2) the scope of use of each expression. The hypothesis of the study dictates that there should be considerable changes in the ecology of epistemic modal expressions, these changes mainly taking the shape of the specialization of functions for a number of expressions and that shifts in these patterns should happen in concordance with other variables describing the emergence of the public sphere. The most robust signals for these changes are perhaps the expression’s relative frequencies, mapping which is relatively trivial task (given the common technical reservations relating to unreliable OCR results and the temporally uneven distribution of the text mass). A more detailed picture is achieved by looking at the use of modal expressions as a whole, which is much more demanding undertaking: the expressions’ grammatical, contextual and semantic features have to be tracked simultaneously and in correspondence to each other. The overall approach is akin to behaviour profile analysis, where occurrences of the studied linguistic items are examined across a wide range of variables and then subjected to scrutiny by univariate, bivariate and multivariate statistical tests (eg. Divjak & Gries 2006; Arppe 2008).

Concluding discussion

Our study traces historical changes in language features and relates them to the development of the public sphere in Finland. Detailed analysis, based on large-scale historical corpora, of where and when modal expressions have come to be used in written Finnish is a major contribution to the study of the Finnish language, even if the central aim of our study resides in understanding the historical process of vernacularization. We have already found that changes in the relative frequency of a selection of the modal expressions increase in conjunction with key stages in the development of the Finnish press, which supports our original hypothesis. However, a full analysis, including multivariate statistical tests, is still needed for making this argument in full. We further believe that testing our hypothesis on the Finnish case, which is rather straightforward with a relatively rapid vernacularization process, may lead to testing the hypothesis for other languages as well. If the growth of modal expressions is a feature that reflects the writers’ increasing need to take into account an abstract notion of the general public, we should see similar developments elsewhere as well.


Adapting a Topic Modelling Tool to the Task of Finding Recurring Themes in Folk Legends

Maria Skeppstedt, Rickard Domeij, Fredrik Skott

The Institute for Language and Folklore, Sweden

A topic modelling tool, which was originally developed for performing text analysis on very short texts written in English, was adapted to the text genre of Swedish folk legends. The topic modelling tool was configured to use a word space model trained on a Swedish corpus, as well as a Swedish stop word list. The stop word list consisted of standard Swedish stop words, as well as 380 additional stop words that were tailored to the content of the corpus and therefore also included older spelling versions and grammatical forms of Swedish words. The adapted version of the tool was applied on a corpus consisting of around 10,000 Swedish folk legends, which resulted in the automatic extraction of 20 topics. Future versions of the tool will be extended with text summarisation functionality, in order to retain the text overview provided by the tool also when it is applied on longer folk legends.


The New Possibilities for Philological Research in the Digital Archive: the Case of “The Voices of Spring” by Maironis

Magdalena Slavinska

Vilnius University, Lithuania

It is widely understood that digital media may allow users of a digital archive to thoroughly examine its digitized objects and their data. Moreover, connecting separate objects (witnesses) into a hyperlink network may provoke the reader to reflect upon these connections as well. The latter might be visualized to the reader in different ways: by hypertextual links or by side-by-side comparison on the screen. Connecting visible elements may be even more important in some type of archives, e. g. the genetic archive of a literary work shows different versions of the same text to be perceived by the user as a continuous process.

The aim of this paper is to present the first digital genetic archive in Lithuania – a digital scholarly edition of “The Voices of Spring” ( by Lithuanian poet Maironis (1862-1932). The 1st and the 5th authorial editions of “The Voices of Spring” are separated by thousands of textual variants, and the number of poems in the collection increased from 45 to 131. Development of the poems took four decades (since the first published verses in 1885 until 1927) and it demonstrates the effort to modernize the language of the verses accordingly to the processes of linguistic modernization, which coincided with the period of Maironis creative activity.

One may ask, how such numerous differences between the versions of the poems might be clarified in an archive. The striking amount of different changes between the 1st and the 5th edition calls for step-by-step demonstration of continuous writing and editing processes. There is also a need for a conceptualized presentation that would allow to summarize the genetic process. This scholarly edition aims to carry out a genetic reconstruction of Maironis’ poetry by presenting the facsimiles, the XML‐encoded versions and commentaries.

The user of the genetic archive encounters with two types of visualizing connections between the objects: the spatial and the hypertextual. The two objects connected spatially might be compared and examined visually on the screen. Hypertextual links attached to an isolated object signalize the reader that there is another object which can be compared to, but the comparison between equally visible objects will not be possible until the user makes another move to examine the signalized object. The archive offers two types of comparisons to reflect upon. The user might examine the identities and significant distinctions between different digital representations of the same witness, e. g. between the facsimile (image) and the xml file (text), and deepen his understanding of one particular witness. The archive could also differently visualize the genetic connections between the versions (various witnesses).

By collecting all the witnesses of the text creation, a genetic archive seeks to visualize the process of writing for the reader, that is, offers him a convenient platform for viewing and analyzing the processes involved in the creation of the text. Therefore, it is equally important to allow the user, on the one hand, to deepen the knowledge about each version (separately taken from the linear or more complicated genetic sequence), and, on the other, to more thoroughly reflect on the transitions between the witnesses (authorial revisions of the poems).

In a genetic dossier one looks for the genetic relations between the variants. The identified genetic relations are further investigated and interpreted; there are attempts being made, to represent their structures in the archive. Finding the solution for the best visualization might be viewed as a two-way process. One attempts to comprehend the concept of genetic relation in the case of a particular archive. Furthermore, one must consider several possible graphical forms that would withstand the problem of conveying the process of changes in the text.

Highlighting the elements that differ in each version might serve as a point of departure for the genetic reconstruction. The sequence of those elements might form a certain logic, e. g. it might resemble the more general historical-linguistic process. However, the sequence of linguistic changes viewed separately from the texts may be treated as a process involved in the creation, but not a complete representation of the genesis. Linguistic changes might not cover all the elements in the text that provoked the revisions. Therefore, it is useful to literally keep in sight both the text and those elements involved in the genesis that have been stated so far. In the digital scholarly edition of “The Voices of Spring” these functionalities are achieved through slightly modification of EVT 2 (developed by Edition Visualization Technology Project team, the leader – Roberto Rosselli Del Turco). The user reads and interprets the genetic archive partly by means of computational methods. Digital representations of the textual witnesses and the results of computational processing are displayed to be grasped together. The computer-generated results of such functionalities as word concordances and visual comparison of different versions may augment the perception of the textual variation.

The archive aims to provide a platform for scholarly research, that could be developed in the future, when the new sets of data and commentaries could be added. Visualization of the linguistic tendencies involved in the genetic process of “The Voices of Spring” can serve as a material for further investigation. E. g. syntactical changes made by Maironis may not only indicate improving poetry, but also signify the transition from foreign syntactic structures – a process fully documented through his five authorial editions.

Finally, a genetic archive, of which concern is to collect all the indices of the author’s thought, may take flexible forms unfamiliar to the previous readers of „The Voices of Spring”. The archive also incorporates significant versions of the poems from periodicals and books which provide more details on the chronology of changes. Between two subsequent editions several poems were published together as a part of another book, and some of them were further changed before the publication in the new edition of „The Voices of Spring”. Therefore, a distinct genetic trajectory might be demonstrated through each of the poems in the collection, though they all have been published in the final authorial edition in 1927.


Bleier, Roman, et al., editors. Digital Scholarly Editions as Interfaces, Norderstedt: Books on Demand, 2018.

Drucker, Johanna. “Performative Materiality and Theoretical Approaches to Interface.” Digital Humanities Quarterly, Vol. 7, Issue 1, 2013. 13 Jan. 2020 <>

McGann, Jerome. A New Republic of Letters: Memory and Scholarship in the Age of Digital Reproduction. Cambridge, MA: Harvard University Press, 2014.

Rosselli Del Turco, Roberto. “Designing an advanced software tool for Digital Scholarly Editions: The inception and development of EVT (Edition Visualization Technology).” Textual Cultures, Vol. 12, No. 2, 2019. 15 Jan. 2020 <>


Digital Maps for Linguistic Diversity

Coppélie Cocq

University of Helsinki, Finland

Language maps have a central role in educational books, atlases, etc, illustrating pedagogical efforts toward a presentation of linguistic data. The characteristics of languages are however problematic to render on a map: flows and movements as well as the lack of clear borders, for instance, demand contextualization and clarifications that can hardly been rendered. Language maps have therefore been criticized for being “generalized snapshots in time of a variable that is in constant change” (Luebbering et al 2013a: 383; 2013b).

This poster takes its point of departure in a research project investigating linguistic landscapes, that is landscapes constructed by the combination of “road signs, advertising billboards, street names, place names, commercial shop signs, and public signs on government buildings” in a given “territory, region, or urban agglomeration” (Landry & Bourhis, 1997:25). By studying which languages materialize in our surroundings, we can reach an understanding about which languages that are used and represented in a society, and thereby about which languages that are allowed to be seen and which are not, providing information about language discourses, policy and power relations.

Digital mapping for visualization can offer solutions for meeting the challenge of successfully representing linguistic diversity – and this is what this poster proposes to discuss. One of digital humanities most valuable contributions is within the area of visualization. It is not only a mode to convey scientific results in graspable packages – it is also a way to raise new questions, make visible new patterns and causal relations between variables. Digital maps, more specifically deep maps (Least Heat- Moon 1991; Bodenham & Corrigan 2015) can combine complex layers based on various data linked to objects and enable the user to interactively compare these different layers. Geographic Information Systems (GIS) allow for a greater flexibility in the use of data in term of accessibility, analysis and display, which in one of the reasons why digital cartography is today an increasing area in visualization studies (Foka, Buckland, Cocq, Gelfgren, forthcoming; Luebbering et al 2013a; 2013b).

Language maps have been criticized for being oversimplified (Mackey 1998), for failing to represent today’s diverse linguistic environment and for embedding issues of power and perception, for instance in cartographic decisions (Luebbering, Kolivras and Prisley 2013b), with implications for the representation of various groups of language speakers. Here, we will seek to discuss digital forms of visualization that are non-authoritative and allow to render the flows and dynamism of languages and language use.

Such form of visualization shall not only serve as methodological tools, but also as a means for communication knowledge about the presence of languages and their speakers. Thereby, with this proposal we wish to contribute to an increased awareness about linguistic diversity and multilingualism - a first step for developing means for inclusion, for understanding place-making processes and apply this knowledge to the creation of public spaces that are inclusive and provide a better understanding and the prevention of segregation.


Bodenham, D. Corrigan J. & Harris T. (2015). Deep Maps and Spatial Narratives. Indiana University Press.

Foka, Anna, Cocq Coppélie, Buckland Phillip I. & Gelfgren Stefan Mapping Socio-ecological Landscapes: Geovisualization as Method. In: Routledge Research Methods Handbook: Digital Humanities, eds. Stuart Dunn and Kristen Schuster (Routledge, forthcoming).

Landry, R. & Bourhis, R.Y. (1997). Linguistic landscape and ethnolinguistic vitality: An empirical study. Journal of language and Social Psychology, 16, 23-49.

Least Heat-Moon, William PrairyErth: A Deep Map. (1991) Boston: Houghton Mifflin Company.

Luebbering, Candice R., Korine N. Kolivras & Stephen P. Prisley (2013a) The lay of the language: surveying the cartographic characteristics of language maps, Cartography and Geographic Information Science.

Luebbering, Candice R., Korine N. Kolivras & Stephen P. Prisley (2013b) Visualizing Linguistic Diversity Through Cartography and GIS, The Professional Geographer, 65:4, 580-593.

Mackey, W. F. 1988. Geolinguistics: Its scope and principles. In Language in geographic context, ed. C. H. Williams, 20–46. Philadelphia: Multilingual Matters.


Birth Certificate Enslavement – A Conspiracy from the Archives to the Internet

Rikard Lars Friberg von Sydow

Södertörn University, Sweden

Someone that surfs the Internet today has a chance to meet many conspiracy theories online. If the surfer understands English and visits sites where this is the main language of communication, there are a great possibility that a lot of the conspiracy theories s/he meets originate in the USA and in a North American political context. This should make us eager to investigate these conspiracies, their eventual interoperability to other political and administrative context, and the context in which they are created.

The “Birth Certificate Enslavement”-conspiracy is such a conspiracy, mostly set in an North American context. To Archival scientists, as myself, it is interesting because it both involve what is usually called a vital record – the birth certificates – and the Internet – which is the main area where the conspiracy is spread ( 2019). Thus connecting two different spheres of information – the administrative records and the new electronic media. Connecting these two spheres is interesting from a couple of positions. One position is the administrative agencies and their staff. How are they viewed by the proponents of the conspiracy theory? As enemies or as useful idiots to an evil mastermind? This is of interest because it might give us insight into possible threats towards the agency and its staff. Conspiracy theories have caused violence before: the gunman who attacked the Comet Ping Pong Restaurant because it was targeted as a participator in the Pizzagate child sex ring being one example (Haag et al 2017). Another interesting position to view conspiracies from is the globality of the internet and the often very specific contextual nature of administrative procedures. Administrative procedures and the document they create: a birth certificate for instance, are often deeply connected to national law and may vary a lot between different states and regions. What do this component add to the conspiracy theories. Are they pointed towards a smaller group of people because of the different administrative circumstances these people live under, or are these differences ignored by the proponents of conspiracy theories? Are they even aware of these administrative differences

The conspiracy of birth certificate enslavement is connected to, among others, a small but rather violent group: the Sovereign Citizen-movement (SPLC 2019), and has through this connection been observed by organizations that monitor violent extremists, like The Southern Poverty Law Center (SPLC). SPLC has described the conspiracy regarding birth certificates as this. According to the believers, the creating of a birth certificate starts the life of an evil administrative doppelgänger to the newly born. A “corporate shell”. This corporate shell is then sold by the Federal Reserve to foreign inverstors as a form of financial security. One of the administrative proof of this, according to the believers of the conspiracy theory, is that the birth certificate is written in capital letters and that bond papers and watermarks are used. This part of the conspiracy, the “proof in the design of the administrative document” is connected to an interpretation of Admiralty law that the sympathisers believe are valid regarding birth certificate (SPLC 2010)

The research will be accomplished in the following way. I will analyze four Youtube videos that aim to explain the Birth Certificate Enslavement-belief from a sympathisers perspective. The videos will be chosen through popularity and the four most viewed videos regarding the subject on will be analyzed. The videos clips will be analysed with the help of three research questions (Research question 1-3). I believe that four videos is an appropriate amount of videos and that it would create the possibility to find differences between various proponents of the conspiracy. At least as introductory research adequate for a poster presentation. Choosing the most popular videos is a way of reflecting what would be seen by someone casually surfing the web, trying to find explanations of how the world works according to different world views.

The research questions are formulated as a heading with different subquestions.

R1) References done in the video clip to public administration and archives

How are public administration and archives viewed in the message of the video clips? How are the employees of the agencies responsible for creating, storing and administrating birth certificates viewed. Are they in any way seen as recipients of the message in the video clips? Are there any appeals to confront the agencies or their employees.

R2) References to other conspiracy theories

Are there any references to other conspiracy theories? Do they differ from video to video depending on the creator? One of the differences between various conspiracy theorists are that they put different “masterminds” behind the execution of the conspiracy. Aliens, Freemasons, Jews, Reptiles, the Catholic church, the Red shoe men and the Bilderberg group et cetera.

R3) References to other political and administrative contexts than those connected to the United States of America.

Are there any references to other political and administrative contexts than those of the United States of America? Do the proponents of the conspiracy theory believe that the theory interoperate with other political and administrative contexts or are they unique to an United States context?

Possible this research will give us an better insight into how the proponents conspiracy theories argue for their world view in connection to actually occurring administrative procedures. How they connect this particular theory of birth certificate enslavement to other conspiracy theories, arguing for specific masterminds that are working behind the scenes. And hopefully also how they contextualise this theory in reference to it being spread on the global internet. Do they stick to explaining it from an American perspective, or is the theory de-contextualised to interoperate with other political and administrative contexts? All questions which answers would be valuable to our understanding of fake news and conspiracies on the internet in general.


Haag, Matthew and Salam, Maya (2017) “Gunman in ‘Pizzagate’ Shooting Is Sentenced to 4 Years in Prison” New York Times. Viewed 2019-09-05

SPLC (2010) “The Sovereigns: A dictionary of the peculiar”, viewed 2019-09-05

SPLC (2019) “Sovereign Citizens Movement”, viewed 2019-09-05. - “Replace your vital records”, viewed 2019-09-01

Short Paper (10+5min)

Collecting and Storing the Historical Statistics Data on Baltic Countries in 1897–1939

Giedrius Žvaliauskas

Vilnius University, Lithuania

The third wave of democratization and collapse of communism in Eastern Europe in 1989-1991 opened the “windows of opportunity” to restore independent states for all three Baltic countries after 50 years of foreign Soviet and Nazi occupation. Last year all three Baltic nations celebrated 100 year anniversary of the since proclamation of state independence in 1918. This is proper occasion to take stock of their turbulent history in the long-run time perspective.

Since 2018 a team of Vilnius University researchers lead by prof. Zenonas Norkus implement the research project "Historical Sociology of Modern Restorations: A Cross-Time Comparative Study of Post-Communist Transformation in the Baltic States". This research is funded by the European Social Fund according to the activity "Improvement of researchers’ qualification by implementing world-class R&D projects" of Measure No. 09.3.3-LMT-K-712 "Improvement of scientists', other researchers' and students' qualification by implementing practical scientific activities".

The ultimate aim of the project is to construct and empirically apply a new sociological theory whose subject matter is social restorations in the modernising and modern societies. Baltic countries are interesting as most conspicuous cases of modern social restorations, with Latvia still living under constitution accepted in 1922. The fruitfulness of new theory is explored by the comparative study, comparing the economic, demographic, social, and political development of Baltic countries during post-communist and interwar periods. The main obstacle for such comparison is the scarcity of cross-nationally and cross-time comparable data about the pre-WWI and interwar periods. They remain dispersed in the national statistical publications of Baltic States of interwar period or those on the pre-WWI time in the statistical publications and archives of Russian Empire.

During nearly 30 years no digital collections of historical statistical data (with partial exception for Lithuania) on the history of Baltic countries during first independence period were published. So interwar development of Baltic States still remains invisible for the broad international research community, accustomed to make research mainly using data, available in the electronic databases.

Therefore, one of the several activities of the researchers implementing research project is systematic collection data of historical statistics on the comparative development of Baltic States in 1897-1939. Most interesting historical statistical data are published in the Lithuanian Data Archive for Humanities and Social Sciences, LiDA (

LiDA is a national social science and humanities research infrastructure that was developed in 2006-2008 by Policy and Public Administration Institute at Kaunas University of Technology in partnership with Vilnius University, Institute for Social Research, the Republic of Lithuania Ministry of Education and Sciences a project funded from EU Structural Funds. A second phase of the LIDA development took place between 2009 and 2012 and was also funded from EU Structural Funds. Currently, LiDA operation maintenance depends only on financial resources of the Kaunas University of Technology.

LiDA provides virtual digital infrastructure for acquisition, preservation and dissemination of digital social sciences and humanities data in Lithuania. LiDA is targeted the social sciences and humanities community, both institutional and individual researchers as well as students seeking to support their scientific and educational needs. It is also a useful collection of data for general public and governmental institutions.

LiDA have three data catalogues:

• Social survey data catalogue (QUANT) contains 314 data sets by October 2019;

• Catalogue of Data about Lithuanian political system (POLSYS) contains 217 data sets by October 2019;

• Historical statistics catalogue (HISTAT) contains 118 data sets by October 2019.

As a result of the project "Historical Sociology of Modern Restorations: A Cross-Time Comparative Study of Post-Communist Transformation in the Baltic States", the first data sets were published in June, 2018, and 200 data sets of historical statistics data of three Baltic States will be published until the end of the project in late 2021. Circa 50 data sets published by October 2019 are divided into four thematic sections:

• Thematic collection "Economy: agriculture, forestry and fishing" contains data about land-tenure, farming manufacture, crop, harvest and fertility, number of birds and efficiency, wood area and wood industry, fishery and fishing, melioration, etc.

• Thematic collection "Population" contains data about population size, density, population size by place of residence (city/village), gender, confession, ethnic/national, social/estate cast; demographic historical data (birth rates, mortality, marriages, divorces, etc.); data about population migration; census data, etc.

• Thematic collection "Finance" contains data about central and local government revenue and expenditure, fiscal politics, financial institutions, money turnover, banks, deposits, credits, etc.

• Thematic collection "Prices" contains data about prices of goods and cost of living indexes, etc.

First data sets for new thematic section on education is in preparation.

Data archiving of historical statistics data in LiDA is based on the NESSTAR system and FEDORA repository. Data sets in the catalogues of LiDA are documented according to the Data Documentation Initiative (DDI) metadata standard (1.2). Data description is documented bilingually, in English and Lithuanian.

NESSTAR data catalogue (containing all the archived data) can be accessed at from the front page of the portal. The LiDA portal has clear terms and instructions of data use. They are available both in English and Lithuanian. All the information on how to search (see for example,, use or analyse data online in Nesstar platform is available in English, too.

The whole IT infrastructure of LiDA was built with the basic aim to be highly interoperable. Therefore, each data set has its unique PID which is constructed to reflect the main attributes of the data set (see for example, All the files of the data set can be accessed by registered LiDA users by following the standardized rules. For example, data in Excel format can be accessed by adding '/EXCEL.01.001' (for example to the PID and DDI file – by adding '/DDI'. So external users or other infrastructures can easily access data and metadata stored in LiDA catalogues.

Historical statistics data stored in LiDA are available without restrictions for the registered users for non-profit purposes (such as research, self-education and training), except embargo period of data sets, which are published as a result of the project "Historical Sociology of Modern Restorations: A Cross-Time Comparative Study of Post-Communist Transformation in the Baltic States". However, embargo will be lifted by the very end of the project.

Regardless of access restrictions to data files all the metadata are freely available without restrictions. LiDA IT infrastructure allows all the users around the world to access the data and metadata stored in the LiDA catalogues. The data are also documented in English, which makes the data sets potentially interesting for the international community.


Computer-Based Identification of Metric Verse Structures in Literary Prose of Portuguese Language

Joao Queiroz1, Ricardo Carvalho2, Angelo Loula3

1Federal University of Juiz de Fora, Minas Gerais, Brazil; 2State University of Feira de Santana, Bahia, Brazil; 3State University of Feira de Santana, Bahia, Brazil

Metric verse structures in Portuguese prose are still a phenomenon unexplored by philosophy, theory, and history of literature, and the automatic mining of such structures has not been tried in Computational Linguistics and Digital Humanities. The MIVES (Mining Verse Structure) system was developed for computational scansion of metric verse structures in Portuguese language prose (Carvalho, Loula and Queiroz, 2019). Unlike many computational systems already developed for scansion of metric poems, MIVES was designed to scansion metrical structures in prose, an operation that Augusto de Campos (2010, p.14) called "verse-spectral reading."

MIVES extracts and processes sentences from the text, identifies and classifies metric structures that are searched by the user, and provides a view of the results obtained. The greatest challenge lies in the process of identifying metric structures, since a single, unambiguous, context-independent result does not arise from scansion. And considering that we are mining prose, there is no clear demarcation of the beginning and the end of structures, such as those easily found in the verses of a metric poem. In prose, the metric structure can be formed by a complete sentence, or by a sentence segment.

The processing begins with the extraction of sentences from a text file. Each sentence is then segmented into words for syllabic separation and identification of tonic syllables.

Although protocols for separating poetic and grammatical syllables do not necessarily produce coincident results, grammatical separation of syllables is an initial step towards poetic separation. At this stage, a sentence such as “Hipóteses sobre a sua gênese.” Can be scanned as “Hi/p#ó/te/ses/ s#o/bre a s#u/a g#ê/ne/se”, where / indicates a syllable separator and # a tonic syllable marker. But the scansion does not end in the phase of syllabic separation of the words. The sentence is subjected to scansion, which considers normatively accepted variations of syllabic separation, considering intervocabular phenomena and intravocabular phenomena, which can fuse or intervocabular and separate syllables. The scansion process does not produce unambiguous results and the intervocabular and intravocabular phenomena can be considered or not, thus multiple scansion possibilities are performed. It is then determined whether the sentence, or an excerpt from it, has a metric pattern and possible alternatives with different syllable counts are indicated. The text, whose metric sentences and variations have been identified, is sent to an interface for visualization, navigation and analysis of results.

The search for metric structures is not restricted to complete sentences. Initial or final sentences segments can be evaluated according to user decision. For a beginning or end of sentence, punctuation marks such as semicolon, colon, ellipsis, and exclamation are used as delimiters. For the identification of excerpts, the sentences are scanned until a metric structure is found that is adequate to the standards designated by the user.

The system is able to identify, classify and compare, frequency, density, and dispersion of heterometric verse structures, distributed at different scales of observation, from one work or author to aesthetics movements. Here we present preliminary results analyzing three works by Euclides da Cunha (Os Sertões, À Margem da História, Contrastes e Confrontos). They were selected because constitute the main corpus of one of the most important Latin-American writers from XX century, and because Os Sertões was the object of a manual "verse-spectral reading" by the Brazilian poet and translator Augusto de Campos (2010, p.14). Such operation by Augusto de Campos revealed "more than 500 decasyllables in the book”, among sapphic and heroic structures, and more than two hundred dodecasyllables. On the other hand, MIVES processed 8564 sentences and found 652 (7,6%) full sentences, 1746 (20,4%) initial segments of sentences and 1728 (20,2%) final segments of sentences with structure between 10 and 12 metric syllables, a surprising rate, with much higher density when compared to results exhibited by Augusto de Campos.

Similar high density is also present in the other works by Euclides da Cunha, as reported by MIVES. A total of 1066 sentences were processed from À Margem da História, and the system found 76 (7,1%) full sentences, 227 (21,3%) initial segments of sentences and 219 (20,3%) final segments of sentences with metric structure. From 1598 sentences processed from Contrastes e Confrontos, MIVES found 82 (5,1%) full sentences, 282(17,6%) initial segments of sentences and 25 (14,7%) final segments of sentences with metric structure.

To evaluate the distribution of metric structures along the books, the distance between sentences with metric structures was measured from results obtained by MIVES. The average distance between such full sentences with metric structures in Os Sertões was 12,63±12,88 sentences, i.e. there were on average 12,63 (with standard deviation 12,88) non-metric sentences between each occurrence of metric structures of full sentences. In the book À Margem da História, the average distance between such full sentences with metric structures was 14,62±16,89, and in the book Contrastes e Confrontos, such distance was 18,38±19,61. Overall, these distances along with graphs of local density values along the books reveal that the distribution of metric structures throughout the books has a great variance, with regions of greater concentration and regions of scarce presence of such structures.

Obviously, MIVES can perform scansions in much larger quantities than any human agent. But even more interesting, as a tool, is the ability that MIVES inaugurates to identify, quantify, and display patterns of distribution of versification structures throughout the text, numerically, with descriptive statistics and distance attributes, and visually, through dispersion and frequency throughout the works. It is not an exaggeration to say that the system is capable of opening a new direction in the investigations on literary prose, in Portuguese language.


Augusto de Campos (2010) Transertões. In: CAMPOS, A.; ALMEIDA, G. Poética de Os Sertões. São Paulo: AnnaBlume.

R. S. Carvalho, A. Loula, J. Queiroz (2019) . Identificação computacional de estruturas métricas de versificação na prosa literária de Euclides da Cunha. Revista de Estudos da Linguagem, aop14918.2019.

> MIVES is available at (project) and (source code)


Starting Points in French Discourse Analysis’ Lexicometry to Study Political Tweets

Marge Käsper, Liina Maurer

University of Tartu, Estonia

In the Nordic countries, French Studies is not probably the first field one would look at when thinking about digital humanities. This, however, might be a mistake. From its very beginning in the 1960s until today a part of what is called the French School in Discourse Analysis has been using various machine-based methods to measure the social impact of words in discourse. Since Michel Pêcheux’s (1969) theories about an imaginary automatic tool to detect ideology and the first works in political lexicometry at St. Cloud (Maldidier 1969, Marcellesi 1971), the methods have been discussed, developed and diversified (for these discussions, see Guilhaumou 2002), to create various “textometric” (Salem 1987), “logometric” (Mayaffre 2004) or “ideometric” (Longhi et al 2017) analyses. We will examine some examples of these using Lexico 5, a key tool in the field today, to construct our analysis of a corpus of tweets by the French president Emmanuel Macron (#EmmanuelMacron).

Today some of the most significant work in lexicometry is produced by Damon Mayaffre (2004; 2007, etc.), who has analyzed comparative recurrences and vocabulary patterns in the speeches of all French presidents from de Gaulle to Emmanuel Macron. Mayaffre (2004) points out, for example, the most frequent words of the presidents (“problème” for Giscard d’Estaing, “civilisation” for Pompidou, “naturellement” for Chirac, etc.). He complements the quantitative analysis with a discussion of the significance of these vocabulary “over-uses” on the qualitative level. Mayaffre has written extensively about the nature of the corpora and the methods for exploring them (for example Mayaffre 2007) but the core of his analysis is always based on the speeches of French presidents.

We claim, however, that an important part of today’s political communication is performed through social media like Twitter, Facebook, etc. Julien Longhi (2013), indeed, considers the tweet as a sub-genre of political discourse like an interview or a speech and does not see it just as the "reduction of thought to 140 characters”. Thus, during the French presidential elections in 2017 Longhi et al have launched a platform called # Idéo2017 to analyze what the candidates say on social networks with the help of lexicometric tools to identify the lexical fields and dominant themes of the different candidates and to establish thus “the linguistic profile” of the candidates and what differentiates them. We will use this platform to test Macron’s discursive profile in comparison to other candidates in 2017, and also in comparison to his actual tweets. In this textometric approach it is possible, for instance, to follow, by a “section map” option, the topography of lexemes selected in corpora. In our tweet corpus, we can thus follow the continuity of the lexicon used by Macron in his tweets.

In general, however, the aim of our analysis is not a definitive analysis of Emmanuel Macron’s ideological positions but a better understanding of the functioning of political communication. Thus, in a further analysis, we plan to compare the tweet corpus also to a media corpus of comments to analyze (by the detection of repeated segments) the extent to which the tweets form the core of the press commentaries.


Guilhaumou, Jacques 2002. Le corpus en analyse de discours : perspective historique. Cor¬pus, 1. [En ligne] URL :

Longhi J., Marinica C, Hassine N., Alkhouli A., Borzic B. 2017. The #Idéo2017 platform, 5th conference CMC and Social Media Corpora for the Humanities, Bolzano, Italy, 3rd and 4th October 2017 – Conference proceedings, pp. 46–51. halshs-01619236.

Longhi, Julien 2013. Essai de caractérisation du tweet politique, L’Information grammaticale, 136, pp. 25–32. halshs-00940202.

Longhi Julien 2017. ‪Humanités, numérique : des corpus au sens, du sens aux corpus‪, Questions de communication, 2017/1, 31, pp. 7–17. URL:‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬

Longhi Julien 2018. Tweets politiques : corrélation entre forme linguistique et information véhiculée, in Mercier A. et Pignard-Cheynel N. (dirs.), #Info. Partager et commenter l’info sur Twitter et Facebook, Paris : Editions de la Fondation MSH, pp. 295–314.

Lorriaux, Aude 2017. Le « je » d'Emmanuel Macron. Interview avec Damon Mayaffre. Sciences Humaines. Août-septembre 2017.

Maldidier, Denise 1969. Analyse linguistique du vocabulaire politique de la guerre d’Algérie d’après six quotidiens parisiens. Thèse de doctorat. Disponible sur :


Marcellesi Jean-Baptiste 1971. Éléments pour une analyse contrastive du discours poli¬tique. Langages, 23 « Le discours politique », pp. 25–56.

Mayaffre, Damon 2004. Paroles de président. Jacques Chirac (1995-2003) et le discours présidentiel sous la Vème République. Paris : Champion.

Mayaffre, Damon 2007. L’analyse de données textuelles aujourd’hui : du corpus comme une urne au corpus comme un plan : Retour sur les travaux actuels de topographie/topologie textuelle. Lexicometrica, André Salem, Serge Fleury, 2007, pp. 1–12. hal-00551468.

Salem, André 1987. Pratique des segments répétés. Essai de statistique textuelle. Paris : Klincksieck.


"Sampo" Model and Semantic Portals for Digital Humanities on the Semantic Web

Eero Hyvönen

University of Helsinki (HELDIG) and Aalto University, Finland

This paper presents the vision and longstanding work in Finland on

creating a national Cultural Heritage ontology infrastructure and semantic portals

based on Linked Data on the SemanticWeb. In particular, the “Sampo” series

of semantic portals is considered, including CultureSampo (2009), TravelSampo

(2011), BookSampo (2011),WarSampo (2015), BiographySampo (2018), Name-

Sampo (2019), WarWictimSampo (2019), FindSampo (2019), and LawSampo

(2020). They all share the “Sampo model” for publishing Cultural Heritage content

the Semantic Web that involves three components: 1) A model for harmonizing,

aggregating, and publishing heterogeneous, distributed contents based on

a shared ontology infrastructure. 2) An approach to interface design, where the

data can be accessed independently from multiple application perspectives, while

the data resides in a single SPARQL endpoint. 3) A two-step model for accessing

and analyzing the data where the focus of interest is first filtered out using faceted

semantic search, and then visualized or analyzed by ready-to-use Digital Humanities

tools of the portal. This model has been proven useful in practise: Sampo

portals have attracted lots users from tens of thousands to millions depending

the Sampo. It is argued that the next step ahead could be portals for serendipitous

knowledge discovery where the tools, based on Artificial Intelligence techniques,

are able to find automatically serendipitous, “interesting” phenomena and

research questions in the data, and even solve problems with explanations.


A Workflow for Integrating Close Reading and Automated Text Annotation

Maciej Janicki1, Eetu Mäkelä1, Anu Koivunen2, Antti Kanner1, Auli Harju2, Julius Hokkanen2, Olli Seuri3

1University of Helsinki, Department of Digital Humanities, Finland; 2University of Tampere, Faculty of Social Sciences, Finland; 3University of Tampere, Faculty of Information Technology and Communication Sciences, Finland

Digital Humanities projects often involve application of language technology or machine learning methods in order to identify phenomena of interest in large collections of text. However, in order to maintain credibility for humanities and social sciences, the results gained this way need to be interpretable and investigable and cannot be detached from the more traditional methodologies, which rely on close reading and real text comprehension by domain experts. The bridging of those two approaches with suitable tools and data formats, in a way that allows a flow of information in both directions, often presents a practical challenge.

In this poster, we present an approach to digital humanities research that allows combining computational analysis with the knowledge of domain experts in all steps of the process, from the development of computational indicators to final analysis. Our approach rests on three pillars. The first of these is an interface for close reading, but crucially one which is able to highlight to the user all results from automated computational annotation. Beyond pure close reading, through this interface, the user is thus also able to evaluate the quality of computational analysis. Further, the interface supports manual annotation of the material, facilitating correction and teaching of machine-learned approaches.

The second of our pillars is an interface for statistical analysis, where the phenomena of interest can be analyzed en masse. However crucially, this interface is also linked to the close-reading one to further let the users delve into interesting outliers. Through this, they are not only able to derive hypotheses and explanations of the phenomena, but can also identify cases where outliers are more due to errors and omissions in our computational pipeline.

Finally, our third pillar is an agile pipeline to move data between these interfaces and our computational environment. In application, this third pillar is crucial, as it allows us to iteratively experiment with different computational indicators to capture the objects of our interest, with the results quickly making their way to experts for evaluation and explorative analysis. Through this analysis and evaluation, we then equally quickly get back information on not just the technical accuracy of our approach, but also if it captures the question of interest. Further, beside direct training data, we also get suggestions on new phenomena of interest to try to capture.

By maintaining from the start interfaces that allow both computer scientists and social scientists to not only view, but highlight to each other all aspects of the data, we also further a shared understanding between the participants. For example, social scientists are easily able to highlight to the computer scientists new phenomena of interest in the data derived from their close reading, while the computer scientists can easily show what they are currently automatically able to bring forth from the data. Through this, everyone is kept on the same page, misunderstandings are avoided, and the most fruitful avenues for development can be negotiated in a shared space where everyone contributes equally.

Combined with the capability for agile development and experimentation, this provides a versatile template for an iterative and discursive approach to digital humanities research, which moves toward questions of interest both fast, as well as with high capability to truly capture the phenomena from all viewpoints of interest.

In this poster, we present insights into the interaction between close reading and computational methods gained from the work in our current project: "Flows of Power: media as site and agent of politics". The project is a collaboration between journalism scholars, linguists and computer scientists aimed at the analysis of the political reporting in Finnish news media over the last two decades. We study both the linguistic means that media use to achieve certain goals (like appearing objective and credible, or appealing to the reader’s emotions), as well as the structure of the public debate reflected there (what actors get a chance to speak and how they are presented).

As many research questions in our project concern linguistic phenomena, a Natural Language Processing pipeline is highly useful. We employ the Turku neural parser pipeline [2], which provides dependency parsing, along with lower levels of annotation (tokenization, sentence splitting, lemmatization and tagging). Further, we apply the rule-based FINER tool [3] for named entity recognition.

Our primary toolbox for statistical analysis is R. This motivates using the ‘tidy data’ CSV format [4] as our main data format. In order to keep the number and order of columns constant and predictable, only the results of the dependency parsing pipeline are stored together with the text, in a one-token-per-line format very similar to CONLL-U ( All additional annotation layers, beginning with named entity recognition, are relegated to separate CSV files, where tuples like (documentId, sentenceId, spanStartId, spanEndId, value) are stored. Such tabular data are easy to manipulate within R.

For visualization, close reading and manual annotation, we decided to employ WebAnno [1] ( While this tool was originally intended for the creation of datasets for language technology tasks, its functionality is designed to be very general, which enabled its use in a wide variety of projects involving text annotation (see: In addition to the usual linguistic layers of annotation, like lemma or head, it allows the creation of custom layers and feature sets. WebAnno has a simple but powerful visualization facility: annotations are shown as highlighted text spans, feature values as colorful bubbles over the text, and the various annotation layers can be shown or hidden at user’s demand. This kind of visualization does not disturb close reading. It allows to concentrate on the features that are currently of interest, while retaining the possibility to look into the whole range of available annotations.

An important advantage is WebAnno’s low barrier of entry. It is a Web application, meant to be deployed on a server and used through a Web browser. This kind of usage requires neither technical skills nor any installation on the users’ machines. It provides user account and project management. The application can be also run locally, in form of a JAR file, which is useful for trial and demonstration purposes.

WebAnno supports several data formats for import and export. All of them assume one document per file. Among others, different variants of the CONLL format are supported. WebAnno-TSV is an own tab-separated text format, which, as opposed to CONLL, includes the custom annotation layers. Because it is a text format and is well documented, we are able to implement a fully automatic bidirectional conversion between our corpus-wide, per-annotation CSV files and per-document WebAnno-TSV files.

Thus, using WebAnno as an interface to interact with the domain experts who perform close reading and manual annotation, we are able to exchange our results quickly and with a high degree of automatization.

We applied the methodology outlined above in a recently conducted case study. The subject of the study was the use of affective and metaphorical language in a media debate about a controversial labour market policy reform, called ‘competitiveness pact’ which was debated in Finland in 2015-16.

The linguistic phenomenon in question is complex and not readily defined. It is also highly subject-dependent: ‘the ball is in play’ is metaphoric when referred to politics, but not when referred to sports. There is no straightforward method or tool for automatic recognition of such phrases. Therefore, we started the study with a close reading phase, in which the media scholars identified and marked the phrases they recognized as affective or metaphorical in the material covering the competitiveness pact. The marked passages were subsequently manually post-processed to extract single words with ‘metaphoric’ or ‘affective’ charge. The list of words obtained this way was further expanded with their synonyms, obtained via word embeddings. Using this list, we were able to mark the potential metaphoric expressions in the unread text as well.

The final step, which is still in progress, is to validate the automatic annotation via close reading of another set of articles. WebAnno’s functionality of highlighting and manual correction of annotations greatly facilitates such work.


[1] Richard Eckart de Castilho, Éva Mújdricza-Maydt, Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych, Anette Frank, and Chris Biemann. A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 76–84, Osaka, Japan, 2016.

[2] Jenna Kanerva, Filip Ginter, Niko Miekka, Akseli Leino, and Tapio Salakoski. Turku neural parser pipeline: An end-to-end system for the CoNLL 2018 shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 133–142, Brussels, Belgium, October 2018. Association for Computational Linguistics.

[3] Teemu Ruokolainen, Pekka Kauppinen, Miikka Silfverberg, and Krister Lindén. A Finnish news corpus for named entity recognition. Lang Resources & Evaluation, August 2019.

[4] Hadley Wickham. Tidy data. Journal of Statistical Software, 59(10), August 2014.


A Particularly Synergistic Triplet: Speech Technology, Digital Humanities and Accessibility

Jens Edlund1, Johanna Berg2, André Costa3, Rickard Domeij4, Martin Duneld5, Nikolaj Lindberg6, Jan Pedersen7, Heidi Rosén8, Christina Tånnander1,9

1Speech, Music and Hearing, KTH, Stockholm, Sweden; 2Swedish National Museums of World Culture, Stockholm, Sweden; 3Wikimedia Sweden, Stockholm, Sweden; 4Institute for Language and Folklore, ISOF, Stockholm, Sweden; 5Department of Computer and System Science, Stockholm University, Stockholm, Sweden; 6Speech Technology Services, STTS, Stockholm, Sweden; 7Institute of Interpreting and Translation Studies, Stockholm University, Stockholm, Sweden; 8National Library of Sweden, Stockholm, Sweden; 9Swedish Agency for Accessible Media, MTM, Malmö, Sweden

In an attempt to raise interest and spark discussion, we propose that the fields of speech technology, digital humanities and accessibility has much to gain from collaboration with a concerted effort at resource sharing. Projects working in the pairwise intersections of these fields often rely on very similar resources – e.g. the same types of annotated language data, speech processing tools, dictionaries – and moreover their outcome is often useful in itself in more ways than one. Our grandest hopes when shining a light in this direction is that these communities find a forum to discuss issues of mutual benefit.


A Proposed Workflow for Future Monograph Digitization Projects

Niklas Kristian Alén

Suomalaisen Kirjallisuuden Seura / Finnish Literature Society, Finland

A proposed workflow for future monograph digitization projects

As the cost of digitization has come down and the open access movement has gained worldwide momentum, many learned societies have started looking into digitizing their own publications. In this poster session we'll be taking a closer look at the Finnish Literature Society's (Suomalaisen Kirjallisuuden Seura, SKS ) and the Finnish Historical Society's (Suomen Historiallinen Seura, SHS ) joint-project to digitize all of the SHS's publications during the period 1866–2000. The poster will first present the realized workflow, and then propose an improved one for future projects.

During the last decade there has been a substantial amount of scholarship to a varying degree of granularity, ranging from practical guides to white papers and scholarly research, on the topic of digitization . The Finnish Literature Society anchored its workflow on current best practices. Based on our realized workflow, we found that the most time consuming part of the process was securing authors’ consent to publish their work online. We would therefore strongly suggest that this phase be prioritised in any future digitization projects.

The Finnish Literature Society is a learned society founded in 1831. Today it consists of a library, an archive, a publishing house, an expert organization for the export of Finnish literature (FILI) and a research department.

In 2016 the Finnish Literature Society received a grant from the Finnish Association for Scholarly Publishing for the digitization and publication of the Finnish Historical Society's published works ranging from 1866 to 2000. The main objective of the project was to digitize and publish in open access form all monographs and edited volumes published in 8 different SHS scholarly series. The digitization project was carried out by the publishing house and the research department.

The digitized material constitutes a major historical and cultural corpus that incorporates the main research output of historical research in Finnish. As an established open access publisher the SKS also knew that by offering this valuable resource online, free of charge and under a Creative Commons license, it would greatly facilitate Finnish historical research and also improve its dissemination, visibility and its potential impact. Successful open access publishing is, however, far more complicated than simply uploading digitized material to the internet. To be successful the material has to be indexed and distributed through the optimal channels. Search engines must be able to index the whole texts (SEO, Search Engine Optimization), the books need to have adequate metadata and they need to have persistent identifiers of one kind or another. Next we'll be describing the projects workflow and after this we will be offering a revision of the workflow.

The 1st phase in the digitization project was to chart the number of books published and the corresponding number of books available in libraries and archives. The necessary information was collated from library catalogues, SHS online book lists and old publication databases. All of this data was then merged into a relational database that gave us a general overview of the situation. Most of the books could be found in the SHS publication archive. There were, however, gaps in the records and these where filled with book loans from different libraries.

The 2nd phase was to decide on the optimal file format. Here it was crucial to settle on an adequately futureproof approach. The chosen file format had to comply with national long-term digital preservation standards (KDK-PAS ), the image quality and resolution should also be adequate to ensure that at least in the short term there would be no need to rescan the material. Besides these archival requirements, the file format also had to be user friendly. It had to be a well-established and widely used file format that scholars were used to. It was also important that the file format could aid in the discoverability and better usability of the book by integrating an OCR-text layer with the image layer. For these reasons the PDF file format was chosen. It was also decided that two PDF versions would be produced at the same time: the first would be a KDK-PAS compliant high resolution colour scan PDF with an OCR text-layer (PDF/A-1b), and the second would be a web optimized PDF also equipped with an OCR text-layer. The web optimized PDF file had to be a good compromise between file size and resolution. Here it was settled on a MRC (mixed raster content) PDF file. This file format segments each image into different layers, and then applies an optimal compression to each layer . The addition of an OCR-layer also made it possible to produce a corpus-like XML-file .If required this file could be used as research data in linguistic research.

The 3rd phase was to choose a digital repository that would support persistent identifiers, high quality metadata and have a good SEO. After careful consideration SKS chose to use a DSpace instance provided by the Finnish National Library. This instance allows for the use of URN-PID's administered and maintained by the Finnish National Library . It also supports the Dublin Core vocabulary which met our metadata requirements. The DSpace instance also allows for its material to be indexed in the national Finnish portal for libraries, archives and museums Finna and BASE , which dramatically improves the dissemination and visibility of the monographs.

The 4th phase was to produce the required metadata for the books. As most of the books were catalogued in the National Bibliography of Finland (Fennica) which has an open API, it only made sense to harvest these records. Here the SKS's library could lend its expertise, in describing which MARC fields were critical and to which Dublin Core elements they could be mapped. After this a programme was written that harvested the records in MARCXML and mapped them to a Dublin Core dialect that DSpace understands. After examining the records it was, however, decided that supplemental cataloguing was required. To facilitate this, an online metadata editor application was developed. This editor was used by a third party to check and supplement the metadata. The application allowed multiple specialists to edit the metadata simultaneously. After editing, the application allowed the export of all records in a DSpace compliant format . The supplemented metadata was then uploaded alongside the web optimized PDF file to SKS's DSpace instance.

The 5th phase, which was going on in conjunction with steps 3-4, consisted of acquiring authors' consent to license their works under a Creative Commons –license as well as the right to upload their monographs to the internet. This phase of the project proved to be much more time consuming and complex than what we had anticipated. After quickly making contact with well-known and well-established authors as well as identifying all authors whose copyright had expired, the project began to face mounting difficulties. Many authors had passed away and contacting all of their heirs proved to be too time consuming, others had retired after only authoring a couple of articles during their career, others could not simply be found (their names are too common, they've moved or changed names etc.), and last but not least there were a sizable number of foreign authors.

Our experience shows clearly, and we would strongly suggest, that any future digitization efforts should focus on phase 5 first. This is the most time consuming phase as technology can't really facilitate it in any way. A thorough investigation in this phase may even dictate whether or not the digitization effort is feasible. From our experience all other issues can be solved by innovative use of technology.


Limits of Authenticity of Digitized Objects

Alžbeta Zavřelová, Petr Žabička

Moravian Library in Brno, Czech Republic

The Moravian Library in Brno is the second largest library in Czech Republic - a legal deposit library that holds over 4 million volumes including valuable historical collections of old maps, incunabula, medieval manuscripts and old prints. It is also a research organisation whose main purpose is to carry out basic, applied and experimental research as well as software development, and to disseminate the results by means of education, publications and transfer of technologies.[1]

Our poster discusses the limits of authenticity of digital objects and the impact of currently available possibilities to modify or forge digital historical documents using methods of machine learning. The research results of the Moravian Library cooperation projects provide us with tools that make us reconsider the importance of digital interpretation and processing of the digitized objects.

In the last decades, mass digitization of library collections has become a necessity in memory institutions all around the world. During the digitisation process, many objects must be modified for higher accessibility or legibility. Such interventions include simple image enhancements (brightness, contrast or colour corrections, image cropping or stitching, etc.) or edits made using innovative digital tools. In our research practice we find multiple cases when we need to edit an image for better usability or legibility by advanced methods employing machine learning technologies. In special historical collections, basic tools are used to flatten curved pages, unfolding text lines, or scan text in narrow book binding. Many documents have been copied on microfilms and only these secondary copies could have been digitized, although the quality of the microfilm copy was not very good. Likewise, digitisation of old audio recordings (e.g. shellac discs) extracts specific information thereby suppressing the authenticity of the original.

Noise reduction, quality enhancement and content reconstructions improve the comprehensibility and may also influence the quality of automatic conversion of the digitized object to text. Tools for automatic classification or full text indexing of the OCR often work with language models that may highly affect the content of the resulting text. Although such technologies do not misguide users intentionally, they may significantly influence relevance ranking or findability of a given document. Since the 90s, the scientific community has drawn attention to the possibilities of open digital images misuse[2], which has expanded significantly with technology development.


While working with digital libraries we always have to think about the relationship between a digital object and its (analogue) source. Many people incorrectly attribute the same value to the original analogue and digital copy of an object. We should bear in mind that each collection is a subjective interpretation of the collection creator’s point of view or ideological attitude. Moreover, until any given collection has been digitized if full, there are layers to its online presentation: Has the catalogue been converted into a searchable metadata database? Are all the catalogue records of the same or similar quality? What was the selection process when parts of the collection have been digitized? What is the quality of the conversion to text? The user of any collection should also bear in mind that any human-created metadata might reflect the bias of the cataloguer who created the record.

The intentional document forgery has always been around. Nowadays, as in the past, the main motivation for document forging is either financial profit, privilege, power or influence gain. With the ready availability of online content, the ties between the physical document and its digital surrogate are weakening. In some societies, outright censorship can readily be applied. However, without tangible assets readily accessible, there is a growing opportunity to influence what has long been perceived as reliable information sources. Even a small modification in the text, e.g. a change in names or in linguistic and typographic phenomena, may produce the desired results for a certain group. Digitized historical collections are believed to be fully credible by its nature. Today, the general public learns how to critically assess the resources of social space, yet the level of critical approach to digitized historical documents is still limited.


The digitized collection of the Moravian Library in Brno contains, among other documents, digitized microfilms of several newspaper titles. The microfilms were created in-house in the 1990s. At that time a lot of effort was put into finding all missing issues, which were borrowed from a number of other institutions. The microfilms themselves are therefore unique in their relative completeness. The quality of the pictures on the film, on the other hand, varies. When the films were scanned, it was impossible to have a usable OCR from the scans; and in some cases, the texts on the scans were very difficult to read even for humans. In some cases, the pages of the original newspapers have been damaged and parts of the text were missing [3].

To solve this problem, the library partnered with the Brno University of Technology. The “PERO - Advanced content extraction and recognition for printed and handwritten documents for better accessibility and usability” project aims to create technology and tools to improve accessibility of digitized historical documents based on state of the art methods of machine learning (convolutional neural networks), computer vision and language modelling.[4] The results of the project are available at GitHub[5] and will be integrated into Lindat/Clariah-CZ - Digital Research Infrastructure for Language Technologies, Arts and Humanities [6].

The main result of the project will be an automatic, machine-learning-based OCR tool for printed documents[7] and a semi-automatic handwriting recognition tool for current manuscripts. These tools will be complemented by a system for the improvement of the quality or rather readability of images of scanned text. The tool is based on Generative Adversarial Networks (GANs) and its aim is to automatically propose replacements for missing or illegible text string by methods of text reconstruction and language modelling, and then to fix the image using the reconstructed text. At this moment there is a need for manual supervision but the benefits of this method for document legibility are obvious. Also obvious are its dangers: the operator can easily change the text in any way just by editing it and the software will then reconstruct the relevant part of the image so that the change will be indistinguishable from the surrounding unmodified text. Of course, the quality of the patch will be dependent on the tool having enough scans to learn from, which is not a problem when dealing with scanned periodicals. The moral implications are clear.

To sum up, the main purpose of the poster is to contribute to the discussion on the limits of authenticity of digitized objects and current possibilities of text manipulation. Our research demonstrates various forms of interventions to digitized documents are possible, as well as an extreme case of text manipulation using the tools developed to improve the accessibility of our digital library data. To combat this, some kind of visual cues or other functionality should be developed. To avoid Orwellian future, the users must have a way to check the authenticity of a digitized document and perhaps also be allowed to see the unmodified images as well and perhaps also have a visual cue with the modified parts of the image highlighted.

[1] Further information on the website of the Moravian Library in Brno: Research and Development lab <> or <>

[2] This paper discusses the current possibilities of breaking the authenticity of documents, it does not open up any specific ethical issues related to technology development.

[3] Newspapers were printed on acidic paper which led to faster degradation and fragility of the physical objects.

[4] Further information about the project “PERO - Advanced content extraction and recognition for printed and handwritten documents for better accessibility and usability” on <>.

[5] You can easily download applications and tools from the PERO project at GitHub: <>.

[6] The aim of LINDAT/CLARIAH-CZ, large research infrastructure which acts as a distributed national node of the pan-European DARIAH-EU network, is to enhance the accessibility and usability of open digital data sets, resources and tools for Digital Humanities. <>

[7] An automatic technology for content extraction and an excellent OCR for early printed books (-1800/1860).


Legacy Data in a Digital Age

Ellert Thor Johannsson, Simonetta Battista, Tarrin Wills

University of Copenhagen, Denmark

In this poster presentation, we study the data used for the making of a historical dictionary and its development during three different periods. The focus is on A Dictionary of Old Norse Prose, which covers the language of Iceland and Norway in the Middle Ages. This dictionary has evolved from a rather straightforward collection of citations through a multi-volume, but incomplete, print publication to its current state as a dynamic online lexicographic tool, providing detailed information about the vocabulary of Old Norse and its textual foundation in medieval manuscripts and documents. Even though the dictionary is not finished, its wide scope is evident by the fact that its archive of around 800,000 example citations represents an estimated 7% of the entire 10 million word corpus of Old Norse. The long history of the project gives a unique opportunity to study the development of the data and how it has been used throughout the decades while the project has been in existence.


Work on this dictionary began in 1939 long before computers and databases became available. The nature of the material and the editorial principles set out by the founders of the dictionary demanded a wide variety of data be collected and organized (cf. Widding 1964). This primarily involved excerpting the source material for examples of word use, which were then copied by hand onto slips and filed alphabetically in a physical archive. In addition to this, various other data were gathered about the medieval texts, such as the dating of manuscripts, bibliographic references to scholarly editions and secondary literature, information about foreign sources in case of translations, and various other supplementary information. As with the example citations, all this information was also registered on paper through various filing systems.

The advent of computers opened up new ways to keep track of all this information. It was clear that the nature and scope of the material lent itself well to be organized in a database. The challenge became to convert all the existing information into a digital form and organize it in a database structure suitable for lexicographic work. This process gradually led to the development of an elaborate data structure and a tailor-made dictionary editorial system based on the information from the database.

The data

The core of the dictionary is the collection of 800.000 example citations, each of which is provided with a sentence illustrating a specific form and/or meaning of the headword, a detailed reference showing the work of origin as well as page and line number. Even in the days of early computing, it was difficult to take advantage of the new technology because of the nature of the material, especially the widespread use of non-standardized characters and symbols.

Besides the dictionary citation archive, it was important to keep track of various information relating to the source material. Structuring this data involved creating an index of all the different medieval works, which had been excerpted for dictionary citations. The citations included a reference to scholarly editions as well as the manuscripts these were based on, so all this information had to be registered as well. This work was also done by hand.

The database

In the 1980s the dictionary staff had realized the potential advantages of working with the data in a database structure. An evaluation report of the project from 1993 gives an insight into the thought process and considerations behind the design of the database (ONP 1993). The database needed to keep track of all the dictionary citations as well as the data related to the source material. This involved creating many different tables where all the bits of information are stored in separate fields, such as wordlist table, headword table, definition table, citation table. Moreover, there are additional tables that hold references to the literature and other glossaries. The tables were then linked together through the headword field common to all tables. This allowed for additional information to be added relating to both the source material and each citation. The most important of those was noting the geographical provenience of the manuscripts and the grouping of the source material by literary genres.

The benefit of organizing the information in the database was immediate. Once the content of the hand-written index registry had been entered into the database the information was made available as a printed volume published in 1989 (ONP Registre). Even though this work is primarily designed to facilitate the use of the paper dictionary it stands alone as an independent reference work over Old Norse prose texts and their manuscript origins.

Online dictionary

The database work facilitated the preparation and eventual publication of dictionary entries. Once the citations had been keyed into the database the editors could proceed with the structuring of dictionary entries, supplying extra grammatical information as well as information about collocations and syntactic relations.

After the publication of three printed volumes of the dictionary from 1995 to 2004, containing entries that cover the alphabet from a-em, the project underwent another restructuring process, which resulted in an online version made available in 2010. An important step in the conversion of the paper dictionary to a fully digital online dictionary was the scanning of ca. 500.000 non-typed paper slips, which were integrated into the database and linked to the same fields as the typed citations (cf. Johannsson 2019).

After this restructuring, the database became an essential part of the dictionary as online users could query the database directly and search the data in different ways being no longer limited by the printed alphabetical list of dictionary entries. Besides headword search, the database structure makes it possible to search the data by several criteria, such as the dating of the original manuscript, country of provenance, work, literary genre, and so on.

Enhancing the data

Since 2010 the online version has gradually expanded with new edited articles and it has been redesigned and improved with additional search options. The dictionary database is no longer the only source of information that is available to the user of the dictionary. The data have been enhanced through various ways of linking them internally and to other digital resources. There are now links to other dictionaries as well as digital editions of Old Norse texts (cf. Wills et al. 2018). We demonstrate how the data can be used in different ways and how they are displayed in the current online version of the ONP dictionary, e.g. by the reader feature which provides glossaries to scholarly text editions (cf. Wills & Johannsson 2019). In this way, ONP remains an important research tool for scholars in medieval Scandinavian language, literature, and culture.

The current study demonstrates how legacy data, originally only organized in a paper filing system, have been structured in a database and improved in various ways through three main periods in the project’s history. We show that even though the original data still provide the basis of the dictionary they have been built upon and enhanced by innovative use of the information from a specialized database as well as by external digital sources.


Johannsson, E. (2019). Integrating analog citations into an online dictionary in C. Navarretta, M. Agirrezabal , B. Maegaard (eds.) Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, pp. 250-258.

Johannsson, Ellert Thor & Simonetta Battista (2014). “A Dictionary of Old Norse Prose and its Users – Paper vs. Web-based Edition”, in Andrea Abel & al. (eds.): Proceedings of the XVI EURALEX International Congress: The User in Focus, 15-19 July 2014, Bolzano/Bozen, 169-179.

Johannsson, Ellert Thor & Simonetta Battista (2016). “Editing and Presenting Complex Source Material in an Online Dictionary: The Case of ONP”, in Tinatin Margalitadze & Georg Meladze. (eds.): Proceedings of the XVII EURALEX International Congress: Lexicography and Linguistic Diversity, 6-10 September 2016, Tbilisi, 117-128.

ONP = Degnbol, H., Jacobsen, B.C., Knirk, J.E., Rode, E., Sanders, C. & Helgadóttir, Þ. (eds.). Ordbog over det norrøne prosasprog / A Dictionary of Old Norse Prose. ONP Registre (1989). ONP 1: a-bam (1994). ONP 2: ban-da (2000). ONP 3: de-em (2004). Copenhagen: Den Arnamagnæanske Kommission.

ONP 1993 = Evaluation of the Production Plan for the Dictionary of Old Norse Prose. Copenhagen: Ministry of Education and Research.

Wills, T., Jóhannsson, E., & Battista, S. (2018). Linking Corpus Data to an Excerpt-based Historical Dictionary. In J. Čibej, V. Gorjanc, I. Kosem, S. Krek (eds.) Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts. Ljubljana: Ljubljana University Press, Faculty of Arts, pp. 979-987.

Widding, Ole (1964): Den Arnamagnæanske Kommissions Ordbog, 1939–1964: Rapport og plan, Copenhagen: G.E.C.GADS Forlag.


Library Loan Data as a New Resource for Studying Literary Culture

Kati Johanna Launis, Erkki Sevänen

University of Eastern Finland, Finland

(Presentation, not a publication-ready long paper)

Our presentation is based on the interim results produced by the consortium LibDat: Towards a More Advanced Loaning and Reading Culture and its Information Service (2017–2021, Academy of Finland). The researchers in this consortium come from the University of Eastern Finland, Åbo Akademi University, the Technical Research Centre of Finland and Vantaa City Library in the Helsinki metropolitan area. The project not only works with the digitally born library loan data having been collected by Vantaa City Library since 2016 (seel also Neovius et al.2018), but also utilizes the digital data collected by the joint Helsinki Metropolitan Libraries (HelMet libraries, 17 million loans yearly).

The project consists of three tasks. First, we ask what kind of picture of current Finnish reading culture this data mediates. Second, we clear up the ways in which Finnish public libraries’ information services function and how they can be elaborated on. Third, we develop concrete methods by which scholars can analyse and interpret the huge and mainly quantitative data concerning libraries’ loans. We wish to be able to show how large digital material, computational methods and literature-sociological research questions can be united in the study of literary culture. The presentation at hand focuses on the first task. consequently, we attempt to show how digitally preserved huge data material can change and deepen our understanding of current literary culture.

Our data indicates that a clear change has occurred in Finnish reading culture since the 1970’s and 1980s. In her well-known studies, Suomalaiset kirjanlukijoina (The Finns as Book Readers, 1979) and Lukijoiden kirjallisuus Sinuhesta Sonja O:hon (Readers’ Literature from Sinuhe to Sonja O, 1990), Katarina Eskola concluded that, in the 1970s and 1980s, Finnish readership was characterized by the uniformity of literary taste and the popularity of the “national classics”. By “national classics” he meant mainly male authors such as Eino Leino, Mika Waltari, Väinö Linna and Kalle Päätalo which were central figures in the 19th and 20th centuries’ Finnish literary culture. In Eskola’s studies, a clear majority of Finnish readers named these authors as their favorite authors. Thus, both male and female readers were fond of them.

As we have shown earlier on the basis of the same loan data (Launis et. al 2018), Finnish reading culture IS CURRENTLY fragmented. The commonly known, widely read “classics”, that is, books belonging to the literary canon of Finnish literature, no longer attract library users, and the literary taste of library users is more heterogeneous. A striking feature in current Finnish readership is the dominance of women. IT is middle-aged female readers who today maintain literary culture in Finland: between July 2016 and October 2017 there were about 1.5 million loans of fictive literature in Vantaa City library and 76% of these fiction loans were done by women. During this period, the most popular fiction book among women was an entertaining domestic historical novel Ruokarouva (2016, “The Housekeeper”), written by the popular female author Kirsti Manninen under the pen name Enni Mustonen. Loaners also favor novels published in series. In contrast, the young loaners between 15–19 years (mainly girls, 75% of all book loans) favor new translated Anglo-American Young Adult Fiction (John Green, Estelle Maskame), published also in series and adapted for film or television. They also read authors who are active in the social media (YouTube, Twitter) (Launis & Mäkikalli, forthcoming).

Thus, there is a clear difference between the literary taste of Finnish readership in the 1970’s and today, as well as a generation difference between the adult and young readers today: middle-aged women prefer domestic entertaining historical fiction, whereas young female readers are fond of translated Anglo-American YA-fiction. Even though the uniformity of the reading culture has turned into the diversity of it, some features in Finnish literary taste seem to be quite permanent. The depictions of Finnish history, narrated in a realistic manner and depicting hard work and the countryside (such as Ruokarouva mentioned above), still seem to tempt readers. On this part, our results are in line both with Kimmo Jokinen’s (1997) and Katarina Eskola’s (1979; 1990) studies. Those studies, in particular, Jokinen’s Suomalaisen lukemisen maisemaihanteet (The Ideal Landscapes of Finnish Reading, 1997) emphasize that Finnish readers are fond of books that describe our common history and social world in a realistic manner that avoids form experiments and artistic inventions. Likewise, library users quite much favor the brand new first-rate domestic Finnish fiction, for example, the winners of the annual literary prize (Finland-prize).

In earlier studies of Finnish reading culture, methods such as interviews and queries have been widely used. Since that date, the attempts to introduce quantitative methods into the study of literary culture have been hampered by the lack of suitable data. The situation has changed radically along the rise of the digital humanism: nowadays big data - e.g. library loan data used in the LibDat-project – constitutes a different, significant resource for understanding literary culture from a new and wider perspective, for reading it distantly (cf. Moretti 2000). At the moment, we are applying the social network analysis, as well the clusters analysis to the library loan data. Analysis based on the integration of large “born-digital” material, new computational methods and literary-sociological approach open a possibility for posing new questions in the humanities. By uniting quantitative analysis and qualitative interpretation, this sort of research is able to reveal new features in literary culture.

Our project started in 2017. Currently, we attempt to integrate a truly comparative element into it. In this sense we have, for example, gOt in contact with French, American and Australian scholars who work also with comparable digital data material. It will be interesting to see, in what respects other countries’ reading cultures have changed during past decades and how much their reading cultures resemble current Finnish reading culture.


Eskola, Katarina (1979). Suomalaiset kirjanlukijoina. Helsinki: Tammi.

Eskola, Katarina (1990). Lukijoiden kirjallisuus Sinuhesta Sonja O:hon. Helsinki: Tammi.

Jokinen, Kimmo 1997: Suomalaisen lukemisen maisemaihanteet. Jyväskylä: Jyväskylän yliopisto.

Launis, Kati, Eugene Cherny, Mats Neovius, Olli Nurmi & Mikko Vainio 2018: Mitä naiset lukevat? Kirjallisuudentutkimuksen aikakauslehti Avain 4/2018, 4–21.

Launis, Kati & Aino Mäkikalli (2019, forthcoming): Mitä tehdä, kun Shakespeare ei vloggaa eikä Waltari twiittaa? Koulu, kirjasto ja nuorten uudistuvat lukemiskulttuurit.

Moretti, F. (2000/2013). Distant Reading. London & New York: Verso.

Neovius, Mats, Kati Launis & Olli Nurmi 2018: Exploring Library Loan Data for Modelling the Reading Culture: Project LibDat. Proc. of Digital Humanities in the Nordic Countries 3rd

Conference, Helsinki, Finland, March 7-9, 2018,, online:


Open a GLAM Lab

Aisha Al Abdulla3, Sarah Ames12, Paula Bray10, Gustavo Candela5, Sally Chambers11, Caleb Derven4, Milena Dobreva9, Katrine Gasser1, Stefan Karner13, Kristy Kokegei6, Ditte Laursen1, Mahendra Mahey8, Abigail Potter2, Armin Straube9, Sophie-Carolin Wagner13, Lotte Wilms7

1Royal Danish Library, Denmark; 2Library of Congress Digital Innovation Lab, US; 3Qatar University Library, Qatar; 4University of Limerick, Ireland; 5Biblioteca Virtual Miguel de Cervantes, University of Alicante, Spain; 6History Trust of South Australia, Australia; 7KB Research Lab, The Netherlands; 8British Library Labs, UL; 9UCL Qatar, Qatar; 10State Library of NSW, UK; 11Ghent Centre for Digital Humanities, Ghent University, Belgium; 12National Library of Scotland, UK; 13ONB Labs, Austrian National Library, Austria

In the age of digital production and transformation, Labs are one of the most significant and disruptive influences on organisations such as Galleries, Libraries, Archives and Museums (GLAMs). All over the world, cultural heritage institutions are witnessing the value and dynamism Labs bring to their collections, making them more accessible, used, shared and enjoyed by their users, embracing innovation, development, experimentation, new ideas through disruptive thinking, and generating opportunities. Labs are living, progressive and transformational. They push boundaries, open up new perspectives, create content and encourage engagement with communities.

This poster will present a new book on GLAM Labs. The book is a collective outcome with contributions from 16 people from 14 cultural heritage organisations and universities around the world. The themes reflected in this book, such as being open to experimentation, risk-taking, iteration and innovation, also capture the methodology of the book, which was written in a collective process during five days.

The book describes what an Innovation Lab is in the GLAM context, what an Innovation Lab is for, and, how to make one happen. The book addresses characteristics, aims and objectives, processes and prospects, tools and services, as well as legal, financial and operational issues. Significantly, the book addresses how libraries, archives, museums, heritage institutions and users can operate and benefit from Innovation Labs.

More specifically, the following themes are covered in the book:

Introducing GLAM Labs

A Galleries, Libraries, Archives and Museums (GLAM) Lab is a place for experimenting with digital collections and data. It is where researchers, artists, entrepreneurs, educators and the interested public can collaborate with an engaged group of partners to create new collections, tools, and services that will help transform the future ways in which knowledge and culture are disseminated. The exchange and experimentation in a Lab are open, iterative and shared widely.

Building a GLAM Lab

Building a GLAM Lab involves defining its core values to guide future work, fostering a culture that is open, transparent, generous, collaborative, creative, inclusive, bold, ethical, accessible and encourages a mindset of exploration. The Lab should be grounded in user-centred and participatory design processes and its staff should be able to clearly communicate what the Lab is about. It's important to think big but start small and establish quick wins to get up and running. This chapter describes why and how to open a GLAM Lab and encourages participation in a movement that can transform organisations and the communities they partner with.

GLAM Lab teams

There are recommendations for the qualities and skills to look for in Labs teams, how to go about finding allies within and outside the institution, and ideas on how to create a nurturing environment for teams to thrive in. Labs teams have no optimal size or composition, and its team members can come from all walks of life. Teams need a healthy culture to ensure a well-functioning Lab which might be augmented intermittently by fellows, interns or researchers-in-residence. For a Lab to have lasting impact it must be integrated into the parent organisation and have the support of staff at all levels.

User communities

GLAM Labs will need to engage and connect with potential users and partners. This means rethinking these relationships to help establish clear and targeted messages for specific communities. In turn, this enables Labs to adjust their tools, services and collections to establish deeper partnerships based on co-creation, and open and equal dialogue.

Rethinking collections and Data

This chapter discusses the digital collections which are an integral part of Labs. It provides insights on how to share the collections as data, and how to identify, assess, describe, access, and reuse the collections. In addition, there is information about messy and curated data, digitisation, metadata, rights and preservation.


Experimentation is the core of the Lab's process. Insights about how to transform tools to operational services are demonstrated. It shows that experimentation can prepare the organisational culture and services for transformation. There is an examination of funding and the advantages and disadvantages of various models through discussion of the different mechanisms and options that an organisation can apply to Lab set-ups.

Funding and Sustainability

This chapter provides insights on how to plan for a Lab's sustainability as well as a step-by-step guide for when an organisation is retiring or decommissioning a Lab.

Curious? Come and see our poster or get involved!