 Workshop: Parallel Corpora as Digital Resources and Their ApplicationsOrganized by Natalia Perkova (Stockholm University / Uppsala University), Dmitri Sitchinava (Institute of the Russian language / Higher School of Economics)
Tutorial: An Introduction to Computer Vision for Working with MapsOrganized by Daniel van Strien (Living with Machines, British Library), Katie McDonough (Living with Machines, The Alan Turing Institute), Kaspar Beelen (Living with Machines, The Alan Turing Institute), Kasra Hosseini (Living with Machines, The Alan Turing Institute), Amy Krause (Living with Machines, The University of Edinburgh)
Tutorial: Hands-On Research Data Management for Digital HumanitiesOrganized by Annika Rockenberger (University of Oslo Library) and Philipp Conzett (UiT The Arctic University)
Workshop: TwinTalks 2. Understanding and Facilitating Collaboration in DHOrganized by Steven Krauwer (CLARIN ERIC / Utrecht University), Darja Fišer (CLARIN ERIC / University of Ljubljana)
Workshop and discussion: Digital Approaches to Endangered Language Communities: A Nordic PerspectiveThe workshop and discussion organized by the Livonian Institute of the University of Latvia.
Workshop: Higher Education Programs in Digital Humanities and Social Sciences: Challenges and PerspectivesOrganized by Koraljka Golub (Linnaeus University), Isto Huvila (Uppsala University), Olle Skold (Uppsala University), Nuno Otero (Linnaues University), Marianne Ping Huang (Aarhus University), Mikko Tolonen (University of Helsinki)
Workshop: Intellectual History and the Digital Humanities: Prospects and ChallengesOrganized by Benjamin Martin (Uppsala University), Mark Hill (Helsinki University)
 Opening
Dagnija Baltiņa, Director of Special Collections Department of the National Library of Latvia
Mikko Tolonen, Chair of the DHN Board
Ilga Šuplinska, Minister of Education and Science of the Republic of Latvia
Rolands Lappuķe, External Smart Technology Adviser to the President of Latvia
Stefan Eriksson, Director of the Nordic Council of Ministers Office in Latvia Not only are there large expectations of AI's potential to help to solve many current problems and to support the well-being of all, but also concerns are growing about the impact of AI on society and human wellbeing. Currently, many principles and guidelines have been proposed for trustworthy, ethical or responsible AI. In this talk, I argue that ensuring responsible AI is more than designing systems whose behavior is aligned with ethical principles and societal values. It is above all, about the way we design them, why we design them, and who is involved in designing them. This requires novel theories and methods that ensure that we put in place the social and technical constructs that ensure that the development and use of AI systems is done responsibly and that their behavior can be trusted. 10:50am - 11:20am Coffee Break Lobby (Level -1) 11:20am - 1:10pm Keynote 2, Long PapersSession Chair: Olle Sköld Ziedonis Hall Keynote speaker (50min)A Vaccine Against Fake News Jon Roozenbeek Cambridge Social Decision-Making Lab, University of Cambridge Jon Roozenbeek will be talking about online misinformation and what to do against it. The problem has been highly pervasive, and governments, social media companies, think tanks and civil society have found it difficult to find sustainable, scalable solutions. Jon will discuss what he and his colleagues have been doing to combat online misinformation, by combining insights from social psychology with gamification. Long Paper (20+10min)
Wrangling with Non-Standard Data
Eetu Mäkelä1, Krista Lagus1, Leo Lahti2, Tanja Säily1, Mikko Tolonen1, Mika Hämäläinen1, Samuli Kaislaniemi3, Terttu Nevalainen1
1University of Helsinki, Finalnd; 2University of Turku, Finland; 3University of Eastern Finland, Finland
Research in the digital humanities and computational social sciences requires overcoming complexity in research data, methodology, and research questions. In this article, we show through case studies of three different digital humanities and computational social science projects, that these problems are prevalent, multiform, as well as laborious to counter. Yet, without facilities for acknowledging, detecting, handling and correcting for such bias, any results based on the material will be faulty. Therefore, we argue for the need for a wider recognition and acknowledgement of the problematic nature of many DH/CSS datasets, and correspondingly of the amount of work required to render such data usable for research. These arguments have implications both for evaluating feasibility and allocation of funding with respect to project proposals, but also in assigning academic value and credit to the labour of cleaning up and documenting datasets of interest. Long Paper (20+10min)
Modal Grammar and Metaphoricity as Vehicles of Affectivity in Political Newspaper Journalism
Antti Kanner1, Anu Koivunen2, Eetu Mäkelä1
1University of Helsinki, Finalnd; 2Tampere University, Finalnd
This paper introduces an approach to derive computationally tangible markers and higher-level modes of social behaviour through the identification of pragmatic functions of language. The study presented here offers an example of such an approach. The context of the study presented here is the project Flows of Power: Media as Site and Agent of Politics (Academy of Finland 2019–2022), which investigates the agency of journalistic media in the flows of information, public opinion and power. Through a large-scale empirical analysis of Finnish journalism between 1998 and 2018, it seeks to explore the strategies of journalistic news media in staging and managing political processes. In that context, the role of the present study is that of a pilot case, exemplifying political and economic polarization in contemporary Europe and, as such, a challenge for media attempting to resist being polarized. The focus is on the media coverage of a spectacular political conflict in 2015–2016 between the Finnish right-wing conservative government and the trade unions. The conflict resulted in wage freezes, reduced pay for public sector employees, extensions of annual working time and increasing social security contributions from employees. As a theoretical starting point, the study takes the notion of affective economy (Ahmed 2004) from cultural theory, used to analyse affectivity not as properties of subjects or objects but as qualities in movement and circulation themselves. Our approach is based on the idea that the dialogic nature of language imposes writers to pre-emptively adapt to the perceived affective atmosphere of their readers, proposed in the field of discourse oriented linguistics and pragmatics (eg. White 2003). In newspaper reporting, this manifests through what Gaye Tuchman (1972) has termed the ‘‘strategic ritual of objectivity’’, the desire to appear to stand outside or rise above the subject at hand: to be dispassionate, disembodied and impartial. Traditionally, journalistic genres have carefully distinguished between news and opinion, relegating judgement and emotionality to columns, editorials and other opinion pieces. However, as Karin Wahl-Jørgensen (2019) argues, there is also a strategic ritual of emotionality operating alongside that of objectivity, entailing conventions and codes for incorporating affects into storytelling - and of hiding and displacing emotion. Coming from this background, we thus start from the assumption that affective flows and intensities are circulated in news media not just by overt expressions of sentiment, but through a specialized conventions which necessitate a form of affective linguistic labour from the part of writers. This paper thus seeks to develop a methodological approach and a corresponding workflow through which it becomes possible to recognize traces of such affective linguistic labour by grouping linguistic structures that correlate with structures that are known to perform affective functions. The full dataset of the FLOPO project consists of the whole published material of key Finnish news agencies (STT), newspapers (Helsingin Sanomat), public service broadcasting (YLE) and daily tabloids (Iltalehti). The subset used in the pilot project (of which this paper forms a part) covers news reporting on the topic of Competitiveness Pact from early 2015 until the end of 2016. The dataset used here is thus relatively stable concerning the themes discussed in the texts. The texts were assigned metadata categories relating to genre (news report, commentary, analysis and editorial), the temporal location concerning political events reported and outlet. Each text was also annotated to hold information about intratextual segments, allowing to observe whether a given word resides in quotations or the beginning of a text or not. The assumption was made, that journalistic conventions dictate how linguistic expressions mediating affective intensities may distribute across these variables. Presuming that adjectives with evaluative and emotive meanings are used in affective functions provided the analysis with a baseline against which other features could be compared. This presumption is not only corroborated by the wide use of emotive and evaluative adjectives in Sentiment Analysis but has also been explicitly observed in corpus-based studies in newspaper journalism (Huang 2018). Evaluative adjectives especially are a good entry point because their evaluative meaning is often their primary semantic component and is not based on interpretations of their use in context. Samples of texts were manually close read by experts to extract other passages of texts which had heightened levels of affective intensity. From these passages, the affective lexical core was extracted manually. A considerable number of these expressions were used metaphorically in their context and hypothesis was made that affectivity of these passages was somehow based on or tied to their that metaphoricity. This hypothesis was tested by expanding the set of words that were used metaphorically with other words belonging to same conceptual domains (often of sports, war or physical pain) using pre-trained word embeddings and analysing whether this list of words exhibited similar patterns of distributions as adjectives. According to the dialogic view of language, a considerable degree of the selections of linguistic structures in texts is not directly derived from their propositional content but has more to do with how that content is framed and how the writers align their position and the position of their perceived audiences in relation to that content. All this gives reason to assume that, alongside perhaps more obvious emotionally loaded vocabulary, grammatical structures also play an integral part in how writers at the same time adapt to and reproduce the affective intensities. This motivated another hypothesis according to which two related grammatical categories, evidential structures and epistemic modals, contributed to the conventions of affective mediation in newspaper journalism alongside overt emotive and evaluative vocabulary. The use of these structures, especially in Academic writing, has been analyzed with the concept of hedging, a politeness strategy through which writers pre-emptively make their claims less threatening by reducing their level of certainty or assurance. As things like certainty and trustworthiness are highly affective in also journalistic practices it seemed reasonable to assume that these structures would also be relevant in the conventions through which journalistic writing mediates affectivity. A wide range of linguistic markers, identified by established linguistic scholarship in Finnish to function (among others) as evidentials were tested – modal verbs and verb constructions, modal adjectives and adverbs, connectives and so on (eg. ISK 2004, Kangasniemi 1990, Laitinen 1989), building up to around 90 distinct linguistic markers. These structures were also compared with the affective baseline provided by the evaluative and emotive adjectives. The results seemed to confirm both hypotheses and point towards the interpretation that both metaphors and hedging strategies could be used as markers for identifying heightened affective intensity and concentrations of affective linguistic labour. Instead of cataloguing each affective expression in the data, the idea here was to chart the functional resources used in mediating affective intensities, as it is likely that observations about them retain their validity outside the studied case. While a metaphoricity of a given word depends on whether the news piece is about partisan politics, war or ice hockey, that metaphors, in general, have affective values in each of them is likely to remain true. The upside of this approach, we argue, is not that it would produce readily-usable resources applicable to other cases, but instead offers a framework through which content-sensitive expertise can intrude the computational operation. Thus our paper contributes to journalism studies, developing a theoretical and methodological approach to affectivity in news and actualities. Integrating discourse-oriented linguistics and pragmatics into affect studies entails re-introducing a linguistic model to a post-linguistic theory frame. This, we suggest, is necessary to understand affectivity as meaning-making, an important feature of news journalism beyond explicitly emotive storytelling. From the perspective of Digital Humanities, the study introduces two insights into how humanities expertise can be operationalized in the context of large scale computational analysis of complex discursive phenomenon, first by showing how focusing on presents opportunities not available in content-agnostic settings and, second, by showing how functional and abstract linguistic structures can become accessible by taking known example categories as a starting point. Ahmed, Sara (2004) The Cultural Politics of Emotion. Edinburgh: Edinburgh University Press. ISK 2004 = Alho, I. (2004). Iso suomen kielioppi. Helsinki: Suomalaisen Kirjallisuuden Seura. Kangasniemi, Heikki (1992). Modal expressions in Finnish. Helsinki: Suomalaisen kirjallisuuden seura. Laitinen, Lea (1989). Välttämättömyys ja persoona: Suomen murteiden nesessiivisten rakenteiden morfosyntaksia ja semantiikkaa. [Helsinki]: Helsingin yliopisto. Tuchman, Gaye (1972) ‘‘Objectivity as Strategic Ritual: an examination of newsmen’s notions of objectivity’’, American Journal of Sociology 77. Wahl-Jørgensen, Karin (2019) Emotions, Media, and Politics. Cambridge: Polity. White, P. R. R. Libraries and Digital Resources

Short Paper (10+5min)
Online Participatory Memory Work: Understanding the Potential Roles of Online Mnemonic Communities in Building the Collections of Public Memory Institutions
Ina-Maria Jansson, Olle Sköld
Department of ALM, Uppsala University, Sweden This task has been taken on by digital humanists as well, not seldom with sights set on better understanding the impact of information technology and networked communication on present-day human affairs (e.g., Kirschenbaum, 2008). This paper shares this ethos, and argues for the importance of understanding digital information infrastructures and platforms when seeking to include collective memories of diverse groups into the collections of public memory institutions (e.g., archives, libraries, museums). The paper presumes that participatory memory work of online communities are not isolated processes but elements of digital ecologies. A purposeful documentation and preservation of those ecologies are essential for contextualization and understanding of the outcome of community memory work. Technical changes in how people communicate create ripple effects that extend through a wide range of human endeavors and processes, including the shaping, communication, and re-formulation of collective memory (Hoskins, 2009). Collective memories are thus dependent on, and carried on by technological and structural frameworks (for example common entities like standards) that reconstructs shared values and concepts (Bowker, 2005). Just as a tool shapes the object of its creation so do technologies make their imprint on the information that is communicated through it. The increasing use and complexity of online digital platforms for communication within and between communities is here defined as such a technical innovation that shapes memory practices. Research has shown that networked forums constitute important arenas for minority communities in the for example social, cultural, gender, medical or socio-economic sense (Af Segerstad & Kasperowski, 2014; Boyd, 2010; Marwick & Ellison, 2012; Wagner, 2018). In these online spaces, communities engage in many different memory-making practices for a variety of purposes and intents (e.g, Sköld, 2015, 2017). Such community memory-work also plays important roles of social support for its members. It also increases a sense of identification with, and inclusion in, the community itself (Assmann & Czaplicka, 1995), as shared memories of a community can be used to socialize new members into a group (often termed ‘mnemonic socialization’) in order for them to identify with the group’s past (Misztal, 2003). It is clear that digital community platforms consist an essential part of digital existence. Here, an insight emerges with regards to participatory memory work in the memory institution-sector. The ability to support an inclusive and diverse public memory rests on key ongoings in the digital present being competently grasped, collected, and integrated into the collections of public memory institutions. Such an ambition can only be realized if it also includes the massively productive memory work communally conducted by online communities in the different spaces and services of the Internet. The memory work of online communities is however an understudied topic, and the opportunities and pitfalls present in the important endeavor to include the infrastructural complexity of shared memories of communities in the collections of public memory institutions are poorly understood (e.g., McDonough et al., 2010; Sköld, 2018a; Winget, 2011). 2. Aims, materials, and methods The aim of this study is to explore the information infrastructural premises for the memory work of online communities and how public memory institutions can succeed better in their efforts to create diverse and inclusive collections by recognizing and supporting those premises. The study is based on two case studies of memory work in online MMORPG videogame communities. Videogame communities offer an interesting case in relation to the aim of the study for several reasons. Firstly, videogames and videogaming are landmark features of digital culture today. Videogames and videogaming have impacted many arenas of contemporary life. Examples include technology development and adoption (Swalwell, 2007), management and organizational thought (Deterding et al., 2011), and the everyday interactions of many people by becoming sites of meaningful play and social interplay (Pearce, 2009), storytelling (Albrechtslund, 2010), learning (Barr, 2014), and knowledge production (Sköld, 2017). Secondly, and owning mainly to the ubiquity of the videogame phenomenon, videogame communities showcase many of the key issues and considerations that confront memory institutions aiming to build bridges between online-community memory work and institutional practices. Examples include ethical issues, legal and economic and ownership issues, and in the broader online space commonly occurring patterns of power relations and memory-making practices. The aim of the study is met in two steps (RQ-1 and RQ-2), and is guided by a basic tenet of preservational work: successful curation rests to a significant extent on sufficient knowledge of the material in focus (Mortensen, 1999; Kirschenbaum, 2008). ─ RQ-1. How are online communities conducting memory work, and what are the characteristics of the materials they produce as a result of this work? ─ RQ-2. What are the potential results, pitfalls, and opportunities of efforts seeking to integrate the memory work of online communities into the collections of public memory institutions? The materials of the first case study consist of 40 World of Warcraft blogs collected in 2011 (Sköld, 2011). The second case is a study of 140 discussion threads (containing texts, images, videos, and audio) posted on a City of Heroes (CoH) discussion forum between 2012 and 2013 (Sköld, 2015). RQ-1 is answered by reporting on the WoW and CoH communities’ practices of memory work, and typological analysis of the materials they produce. RQ-2 is met with guidance from theory of information infrastructures, the concept of institutionalization, and the results of previous research on videogame preservation (see e.g., Sköld, 2018b; Winget, 2011 for overviews). 3. Theoretical framing and discussion The concept of information infrastructures makes visible the otherwise often transparent foundations for information and communication (Star & Ruhleder, 1996). It is employed in this study to distinguish between different information spheres and to understand the conditions and the challenges that has to be overcome when including material produced by online communities in collections of memory institutions. It is used to highlight the differences in settings and practices between the community sphere and the institutional sphere. Furthermore, this bridging process of information spheres is discussed in terms of institutionalization, which denotes the integration process of material created within the online community, to become part of institutional collections of archives, libraries, or museums. As one of the many challenges and concerns for the institutionalization of online-community memory work, this paper argues that the organizational paradigms usually found in public memory institutions are among the most critical. For example, the multi-medial characteristics of online-community communicative memory may create difficulties for memory institutions whose collection management and mediation practices are centred on mono-medial materials. The benefit that the institutionalization of (the often very diverse) online-community memory work offer to memory institutions seeking to support inclusive memory politics however makes it worthwhile to strive to overcome such hindrances. Short Paper (10+5min)
Detecting Social Structures Using Library Loan Data
Olli Nurmi1, Kati Launis2, Erkki Sevänen3
1VTT, Finland; 2University of Turku, Finland; 3University of Eastern Finland, Finland
Finland is a country with high PISA rankings and a well-functioning publicly funded, free-of-charge library system. About 80% of the Finnish population use public libraries regularly, and during the last two decades 35-50% of Finnish people have loaned something at least once a year (books, journals, films, cd-records) from the public libraries. Against this background, it can be stated, that Finland has active reading culture. However, radical changes in time use, digitization, as well as attitudes towards reading have influenced our reading habits substantially. In this article, we study the current Finnish reading culture by analysing the loan data collected by Vantaa City Library in Finland’s metropolitan area. In earlier studies of the Finnish readership, methods such as interviews and queries have been widely used (see, for example, Eskola 1979). Since then attempts to introduce quantitative methods into the study of literary culture have been hampered by the lack of suitable data. The situation has changed radically along the rise of the digital humanism: nowadays big data - e.g. library loan data used in this article - constitutes a significant resource for understanding literary culture from a new and wider perspective. Integration of large “born-digital” material, new computational methods and literary-sociological research questions open a possibility to find new knowledge within the qualitative approach in humanities. Our method is to apply social network analysis to the data concerning public libraries’ loan activities in the Helsinki metropolitan area. Firstly, we draw a co-occurrence network based on the paired presence of books within a specified loan cart. We then apply the modularity maximization method to detect book clusters. Visual representations of book clusters are drawn to reveal associated cultural and literary phenomena. This paper shows that current Finnish reading culture is heterogeneous and consists of several sub clusters. It also shows that the library users favour the newest literature and typically borrow multiple books of the same series and the same writer. Our methodological contribution is to demonstrate how social network analysis and clustering technique can be applied to library loan data to characterize reading culture. Introduction Starting from a previous work done in the field of digital studies of cultural trends through quantitative analysis of digitized texts (Michel et al., 2010), we use social network analysis as a method for detecting changes in book reading culture and identifying reading subcultures. In literary research, social network analysis and community detection has been a popular method used to visualize certain structural features of a text or a corpus. A common usage is the visualization of relationships between the texts based on the similarities of the textual contents, and relationships between textual entities such as words (Jänicke, Franzini, Cheema, & Scheuermann, 2015). In this article, we use the visualization to disclose relationships between the books in the library collection. Firstly, we draw a co-occurrence network based on the paired presence of books within a specified loan cart. We then apply the modularity maximization method to detect book clusters. Visual representations of book clusters are drawn to reveal associated cultural and literary phenomena. This paper shows that current reading culture is a heterogeneous cultural phenomenon consisting of several different sub clusters. The position of national classics (such as Väinö Linna), popular among Finnish readership some decades ago, has radically weakened. Data source The largest public library network in Finland is Helsinki Metropolitan Area Library network (Helmet) consisting of the city libraries of Helsinki, Espoo, Kauniainen, and Vantaa. In this work, we had access to anonymized Vantaa City Library loan data. The Helmet collection, consisting of 3.4 million items, is available for the Vantaa City Library users through this network. Our data sample includes all the loan interactions of Vantaa City Library users during 20th July 2016 – 22nd October 2017 containing about 1.5 million records. We build our understanding on the library loan data because it gives an accurate, actual and much wider picture that interviewing a limited number of book readers. This work provides a reliable evidence basis for decision-making and development of effective policies in libraries. Results The analysis shows that the library users typically borrow the multiple books of the same series and the same writer: four of the six of the largest clusters are formed around contemporary female authors, writing entertaining fiction in series and under a pseudonym. This can be explained by the increased use of branding where a set of marketing and communication methods are applied to distinguish the author from competitors, aiming to create a lasting impression in the minds of the readers. An author brand is, in essence, a promise to its readers including emotional benefits. When readers are familiar with an author’s brand, they tend to favour it over competing others. The type of analysis used in this article, can also facilitate new ways to create book recommendations or place the books in the libraries. The books can, for example, be placed in libraries in clusters, which then may be sorted alphabetically. This facilitates the library users’ ability to shift smoothly from one cluster to another when a library user is searching and selecting new books. In addition, book series should be marked to enable the readers to locate them easily. The analysis may also help obtain the ‘market intelligence’ for a better understanding about the different book genres and subcultures performance and evolution. Several algorithms can be used to calculate the importance of any given node in a network. In libraries’ case, we can use these algorithms to identify books with influence over the whole network. By promoting these influential books, the librarians could increase their effect on the reading culture. The library collection consists of tens of thousands of books and no one is able to read through them all to get the "whole picture" of the literature available for the loaners and the reading culture based on it. The distant reading (Moretti 2000) of the contemporary reading culture - based on the big, digitized, daily loan data during the 1.5 years - is the method that makes this kind of definition possible. Using data analytics methods and social network analysis we can focus on a manageable piece of information and enable literary scholars to make surprising discoveries, generate new hypotheses or suggest further research. Short Paper (10+5min)
Automatic Morphological Annotation of Ego-Documents: Evaluating Automatically Disambiguated Annotation of Estonian Semper-Barbarus Correspondence Corpus
Olga Gerassimenko1, Kadri Vider1, Neeme Kahusk1, Marin Laak2, Kaarel Veskis2
1University of Tartu, Center of Estonian Language Resources, Estonia; 2Estonian Literary Museum, Estonia
The digitization of the cultural heritage is massive in Estonia: the national programme of mass digitisation started in 2018, and the creation of digital heritage resources is made a priority for Estonian memory institutions (Viires, Laak 2018). The quantitative methods, even as simple as word frequency analysis, are not possible for unannotated texts. The morphological annotation and disambiguation is an undisputed necessity for the digitized data, especially considering the rich morphology of Estonian and the great amount of homoforms. Big amounts of data need to be parsed and disambiguated automatically that implies some error rate but still makes corpus search, data analysis and data mining much more efficient. There are many challenges for morphological tagging of older cultural heritage sources (especially non-edited ego-documents such as letters and diaries). The authomatical morphological parser and disambiguator of Estonian ESTMORF has been created for contemporary Estonian and trained on the texts of second half of the 20th century, proving to be 99% efficient on the contemporary published texts (Kaalep, Vaino 2001). The efficiency of parcer has been tested on less normative texts types such as chatroom texts (Kaalep, Muischnek 2011), but ego-documents offer some specific complications: sentences are lengthy and syntactically complicated, and yet letters and diary entries may include ad hoc abbreviations, unmarked switching to other languages, specific orthography and punctuation. Is the automatic morphological annotation of such texts reliable enough for a decent corpus research and for comparison of the target sources with other corpora? We are exploring it on the data of the Correspondence Corpus of Estonian avant-garde poets and writers Johannes Semper and Johannes Barbarus in 1911-1940 (Laak et al. 2019). This is a unique and multidimensional collection of private letters, the hand-written originals of which are held at the Estonian Cultural History Archives of the Estonian Literary Museum in Tartu. The correspondence consists of 670 letters with more than 1,100 pages and more than 310 890 tokens (249 970 words). The range of subjects touched upon in the letters is extremely wide: Semper and Barbarus as friends and colleagues discuss all events in the Estonian cultural life, organize the publication of their books and discuss the problems of their contemporary literary and political life and even economics in Estonia and in other countries. The letters were already transformed to typewritten and then to electronic format; morphological categories had to be automatically annotated and disambiguated in them and metadata had to be described manually to transform the letters to a machine-readable format. Corpus is openly accessible through KORP query system and is currently being used by the literary scholars for textual search. In order to evaluate the quality of the morphological analysis of the Semper-Barbarus Correspondence Corpus, we are manually checking certain excerpts of the output and computating the error rate in general and the error rate for Estonian text only (there are lengthy foreign-language excerpts in the Semper-Barbarus correspondence). We are going to calculate and compare the error rate to the error rate of the texts of the same time period (1920-1930) from Estonian Literary Criticism Corpus containing published publicistic texts and to compare it to the previous work of Liba and Veskis (2008) on Estonian automatic tagger evaluation. The results of the study are going to be used to propose the systematic modifications of the morfological parser by manually adding words to the parser lexicon. The reliably annotated corpus can be used for quantitative research of phenomena mentioned in texts: for instance, we can evaluate the relative frequency of the words related to politics in the various decades of correspondence. Having a reliable automatic morphological annotation, we can annotate texts syntactically and semantically, and, in perspective, apply sentiment analysis to see whether the affective polarity of texts changes with time. References Barbarus-Semper Correspondence Corpus, https://doi.org/10.15155/9-00-0000-0000-0000-00190L, last accessed 2019-09-14. Estonian Literary Criticism Corpus, https://doi.org/10.15155/9-00-0000-0000-0000-00193L, last accessed 2019-09-14. Kaalep, Heiki-Jaan and Tarmo Vaino. 2001. Complete morphological analysis in the linguist’s toolbox. In Proceedings of Congressus Nonus Internationalis Fenno-Ugristarum, Pars V. 9–16. Kaalep, H.-J.; Muischnek, K. (2011). Morphological analysis of a non-standard language variety. Proceedings of the 18th Nordic Conference of Computational Linguistics: NODALIDA 18, Riia, Läti, 11-13. mai 2011. Ed. Bolette Sandford Pedersen, Gunta Nešpore, Inguna Skadina. Riia, Läti, 130−137. (NEALT Proceedings Series; 11). Laak, Marin; Veskis, Kaarel; Gerassimenko, Olga; Kahusk, Neeme; Vider, Kadri (2019). Literary Studies Meet Corpus Linguistics: Estonian Pilot Project of Private Letters in KORP. DHN 2019 Digital Humanities in the Nordic Countries, 2364: Proceedings of the Digital Humanities in the Nordic Countries 4th Conference Copenhagen, Denmark, March 5-8, 2019.. Ed. Costanza Navarretta, Manex Agirrezabal, Bente Maegaard. Copenhagen, Denmark: University of Copenhagen, Faculty of Humanities, 283−294. Viires, P., Laak, M.: Digital humanities meet literary studies: Chal- lenges facing estonian scholarship. In: Mkel, E., Tolonen, M., Tuomi- nen, J. (eds.) DHN Helsinki 2018. Book of Abstracts (2018), https://www.helsinki.fi/sites/default/files/atoms/files/dhn2018-book-of- abstracts.pdf, last accessed 2019-09-14. Veskis, K., Liba, E.: Automatic tagger evaluation. NLP course assignment report (2008), https://entu.keeleressursid.ee/public-document/entity-7052, last accessed 2019-09-14. Acknowledgements Research supported by the institutional research grant "Formal and Informal Networks of Literature, Based on Sources of Cultural History" (IRG22-2, Estonian Ministry of Education and Research), related to the Centre of Excellence in Estonian Studies (CEES) and by the programme ASTRA (2014-2020.4.01.16-0026) via the European Regional Development Fund (TK145). Development of KORP and adding corpora in Estonian is supported by the ERDF project "Federated Content Search for the Center of Estonian Language Resources" (2014-2020.4.01.16-0134) under the activity "Support for Research Infrastructures of National Importance, Roadmap". Long Paper (20+10min)
Hearth Tax Digital: New Narratives on Restoration England
Andrew Wareham1, Jakob Sonnberger2, Theresa Dellinger2, Georg Vogeler2
1Roehampton University, UK; 2Graz University, Austria
The Restoration hearth tax was the first Parliamentary tax to impose a direct levy upon householders in Britain and Ireland, which did not unleash major political unrest and/or a regime change (e.g. Poll Taxes of the late 1370s). The first part of the paper will discuss why it is useful to have hearth tax records in a digital format, and the second part will present some preliminary research findings from Hearth Tax Digital. This will not only use GIS to assess distributions of population and wealth iin diachronic and national contexts, but also draw upon extraneous data on occupations, rank and gender. Since 2000 there have also been important developments in digital transcription and archiving. On ScotlandsPlaces (National Records of Scotland (NRS) website), all the hearth tax returns arising from the 1691 collection can be searched; and on British History Online (BHO) there is an Access database of the 1666 Lady Day return for London and Middlesex. Between 2020 and 2015 Hearth Tax Online (HTO) made PDFs, reprinted from hard-copy hearth tax editions, available electronically. Each of these methods has distinct advantages and disadvantages, dependent upon the aims of BHO, HTO and NRS. BHO maximizes users’ ability to manipulate the data, but does not enable users to read the 1666 return in its original order. ScotlandsPlaces is at the opposite end of the spectrum, taking careful note of manuscript marks, but with limited facility to manipulate the data; and HTO was best used in tandem with the hard-copy editions from which the printed transcripts were taken. Hearth Tax Digital, arising from a partnership between the British Academy Hearth Tax Project/Centre for Hearth Tax Research at the University of Roehampton and the Centre for Information Modelling at the University of Graz uses the methods of an assertive digital edition to achieve 5 aims: 1. digital archiving and long-term preservation of hearth tax records 2. access to the digital transcripts in the original order in which they were written 3. manipulation of the statistical data synchronically/county based and diachronically/nationally 4. depiction and research enquiries on population/wealth distribution in GIS 5. searching based upon extraneous data on social conditions/rank/occupations etc. with standard data The new website is hosted by the FEDORA-based, OAIS-compliant humanities digital archive infrastructure of Graz University (GAMS), a repository both for long-term archiving and publication of digital humanities resources. Hearth Tax Digital, essentially, is built upon two types of digital sources. Firstly, for some regions we have been granted access to transcripts of the original records, which were produced for the print editions published by the British Academy Hearth Tax Project and the British Record Society. These transcripts are further encoded in XML, following the guidelines of the Text Encoding Initiative (TEI). Additionally, taking the ‘assertive edition’ approach, distinct semantic units are labeled using the ana-attribute. During the ingest process, a ‘toRDF’ stylesheet makes use of those labels, creating a graph database from the transcripts. For other regions, lists of taxpayers are only available, lacking any contextual information or initial order given in the original documents (‘Returns in database Format’). In this case, the data - usually given in database files or spreadsheets – are directly transformed to RDF/XLM, and joined with the graph data arising from the transcripts in our triple store, forming one sole semantic database. Notably, all these processes, once they have been set up for the project, automatically apply to all upcoming further data ingested to the repository following our schema, providing HTML and spreadsheet representations for both the transcripts and the ‘Returns in database format’, as well as adding the extracted semantic information to the database. According to the aims of the project, it can be said that: 1. The GAMS repository, certiﬁed according to the criteria of the ‘Data Seal of Approval’ as a trusted repository, guarantees long-time preservation and archiving of all records in scope. Additionally, users may easily access and download the source data (TEI/XML, RDF) of all documents. 2. The visual representation of the digital transcripts is kept as close to the original transcripts as possible, maintaining the initial order and spelling, obtaining all conveyed information as well as trying to reconstruct the original layout (e.g. columns) of the documents. But, as the aim of a digital edition goes beyond the mere digital reproduction of the print edition, all additional information like regularizations, editorial notes, geographical hierarchies etc. have been marked up and visualized by optical highlighting and tooltips. 3. We are also able to deliver any kind of statistical information on our data just by formulating suitable database requests. 4. By adding the geographical information on county/parish boundaries (GML, Shapefiles) provided for the print editions to our database, we can visualize almost every statistic projected on various different background maps (e.g. Open Street Map). Ranges and parameters therefor can be manipulated by the users, offering a vast playground for research beyond the standard parameters. 5. The database provides both a full-text search for any terms occurring anywhere in the transcripts, as well as a structured search based on categories like number of hearths, personal names etc. Currently (August 2019), Hearth Tax Digital holds more than 142,000 taxation entries, with further 46.000 in publication. Hearth Tax Digital means that for the first time it is possible to study the hearth tax in a national context, moving across county boundaries and returns between the mid 1660s and early 1670s. This paper will set out both the methods which have been used in developing this digital resource, and some preliminary findings on social and economic conditions in the Restoration age. Short Paper (10+5min)
In Quest of Transition Books
Denis Kotkov1, Kati Launis2, Mats Neovius1
1Åbo Akademi, Finland; 2University of Turku, Finland
Literature read by a person not only reflects, but also affects that person. In fact, certain books (transition books) might trigger this process of becoming interested in grownup's literature and therefore mentally becoming a grownup. In this paper, we detect books that are likely to be transition books or transition book candidates based on a loan dataset provided to us by Vantaa City Library. With four methods applied to this dataset we show what books and why are likely to be the candidates. We found the following candidate books: Tähtiin kirjoitettu virhe by John Green, Punainen kuin veri by Salla Simukka and Luukaupunki by Cassandra Clare. Parliamentary Corpora

Short Paper (10+5min)
Discourse on Safety / Security in the Parliamentary Corpus of Latvian Saeima
Ilva Skulte, Normunds Kozlovs
Riga Stradins University, Latvia
The discourse on (public) safety and (social) security in the political communication has an impact on constructing national identity and community feelings through the ideas of risk and emergency. Methodologically here the analysis combines critical discourse analysis (CDA) and corpus analysis and is based on the Corpus of Debates in Latvian Saeima (1993 - 2017 (http://saeima.korpuss.lv/)). By means of corpus analysis tools the categories and frames of representation of safety and security in the speeches of Latvian MPs are selected and described, and the qualitative analysis is carried out to understand and interpret differences and similarities in understanding and treating different aspects of safety adn security by MPs in the parliamentary discourse in Latvia and the changes in it during the time period after regaining independence. Short Paper (10+5min)
Analyzing Candidate Speaking Time in Estonian Parliament Election Debates
Siim Talts, Tanel Alumäe
Tallinn University of Technology, Institute of Software Science, Estonia
In this paper, we analyze the amount of speaking time by each candidate and political party during the election debates that aired in broadcast media during the Estonian 2019 parliament election campaign, using automatic speaker identification and weakly supervised neural network training techniques. Usually, speaker identification systems are trained on manually segmented and labelled training data: for each person that needs to be covered by the system, several speech segments which contain speech from this person are needed. This makes training data preparation costly and time-consuming, especially if a large number of speakers needs to be identifiable. In this work, on the other hand, we trained speaker models using the recently proposed weakly supervised training method which only needs recording level speaker labels: for each person, several recordings are needed where this person is one of the speakers while segment level labeling is not required. This makes training data creation less costly. Furthermore, often such training data can be constructed automatically, using metadata accompanied with speech recordings. The method relies on automatic speaker diarization of training data, i-vector based speaker embeddings and a special cost function that encourages a deep neural network to assign only one of the discovered speaker vectors to a particular speaker label. The Estonian 2019 parliament elections had 1084 enlisted candidates. We used YouTube and the Estonian Public Broadcasting (ERR) media archive to retrieve audio and video files that likely contained speech by each of the candidates. In the case of YouTube, we retrieved videos whose title or description contained the person’s full name. For ERR, we relied on the metadata of each media clip that listed the names of the persons speaking in the recording. Using such technique, potential training data was found for 810 candidates. However, only 317 candidates occurred in 10 or more recordings, as was required by our training method. We manually examined a small subset of the resulting dataset. We determined that 12% of the clips are false positives, meaning that they did not actually contain the person for whom they were retrieved for. After training speaker identification models on the automatically constructed training data, we validated the accuracy of the system using a set of four manually segmented and labelled election debates. The validation dataset contained speech by 26 unique candidates, 21 of which (78%) were covered by our system. The system correctly identified 24 of the candidates, resulting in a recall rate of 73% over all candidates. No false positives were returned, resulting in a 100% precision. The full speaking time analysis was performed over a set of 55 election debates from six different radio and TV stations, resulting in a total of 55 hours. 19% of the debates were in the Russian language, the rest were in Estonian. For each debate, a set of candidates who appeared there was manually constructed, with the help of metadata that came with the recording. A total of 123 unique candidates appeared in the debates, of which 69 (56%) were covered by our system. The analysis of speaking time over individual candidates brought no real surprises: the leaders of the eight political parties that participated in the elections with a so-called full list (i.e., at least 101 candidates) occupied the first seven places in terms of total speaking time. By aggregating the speaking time of individual candidates of the political parties, we calculated the total speaking time of different parties. At first, the results seemed to indicate a large bias: large and established parties received up to two times more speaking time than newer parties (even when limiting the analysis to “full list” parties). However, we acknowledged that this was partly due to the weakness of our training method: newer parties have more candidates that are fresh to politics, and have thus less exposure on YouTube and in the public broadcasting archive, increasing the risk that they are not covered by our model. Thus, we adjusted the results using the following method: all candidates who were present in the debates but not identified by our system, were assigned an estimated speaking time, calculated as an average over the speaking time of the persons in this debate who were identified successfully. The adjusted results show relatively little difference between political parties: all full list parties were assigned between 220 and 270 minutes of speaking time. We did not attempt to analyze the causality between candidate speaking time and election results, since there are several factors, such as prior popularity, speaking skills and experience in political debates, that affect both exposure in debates as well as the number of votes received. The experiments showed that it is possible to use methods of weak supervision to create a targeted speaker identification system with a high precision by using several potentially noisy data sources. However, it was also observed that for a large part of the candidates no training data could be automatically retrieved from public data sources and thus no speaker identification models could be trained for them. The analysis showed that the election debates were not biased from the speaking time point of view: all major political parties received around 245 (± 10%) minutes of speaking time across the debates. Long Paper (20+10min)
Digging Deeper into the Finnish Parliamentary Protocols – Using a Lexical Semantic Tagger for Studying Meaning Change of Everyman's Rights (Allemansrätten)
Kimmo Kettunen1, Matti La Mela2
1University of Helsinki, National Library of Finland, Finalnd; 2Aalto University, Semantic Computing Research Group, Finland
This paper analyses the protocols of the Finnish parliament 1907–2000. Long Paper (20+10min)
Keeping it Simple: Word Trend Analysis for the Intellectual History of International Relations
Benjamin G. Martin1,2
1Uppsala University, Department of History of Science and Ideas; 2affiliated researcher, Umeå University, Humlab
In my current research on the intellectual history of international relations, I aim to use digital methods of text analysis to explore conceptual content and change in diplomatic texts. Specifically, I am interested in the sub-set of bilateral treaties explicitly related to cross-border cultural exchange -- cultural treaties -- some 2000 of which were signed in the twentieth century. What methods and workflows seem most appropriate for this task? Our answer thus far has been to keep it simple. Inspired by recent work by Franco Moretti, Sarah Allison and others, we apply a straightforward form of quantitative word trend analysis, integrated with analysis of metadata about the corpus and tested (and expanded) through full-text searching. By formulating this approach in a specific relationship to the nature of the corpus and the historical questions I want to ask of it, we are able to get quite a lot out of this simple method. Social Media and Discourse

Long Paper (20+10min)
"Memes" as Activism in the Context of the US and Mexico Border
Martin Camps
University of the Pacific, US
Memes function as "digital graffiti" in the streets of social media, a cultural electronic product that satirizes current popular events, and can be used to criticize those in power. I believe these “political haikus” work as an escape valve for the tensions generated in the culture wars that consume American politics. The border is an “open wound” (Mexican writer Carlos Fuentes dixit) that was opened after the War of 1847 and resulted in Mexico losing half of its territory. Currently, the wall functions as a political membrane barring the “expelled citizens” of the Global South from the economic benefits of the North. Memes help to expunge the gravity of a two-thousand-mile concrete wall in a region that shares cultural traits, languages, and natural environment, a region that cannot be domesticated with symbolic monuments to hatred. Memes are rhetorical devices that convey the absurdity of a situation, as in a recent popular meme that shows a colorful piñata on the edge of the border, a meme that infantilizes the State-funded project of a fence. The meme’s iconoclastography sets in motion a discussion of the real issues at hand—global economic disparities and the human planetary right to migrate. The term meme was coined by Richard Dawkins, a British evolutionary biologist, in 1976 in his book The Selfish Gene as a unit of cultural transmission. He wrote: “We need a name for the new replicator, a noun which conveys the idea of a unit of cultural transmission, or a unit of imitation. ‘Mimeme’ comes from a suitable Greek root, but I want a monosyllable that sounds a bit like ‘gene’. I hope my classicist friends forgive me if I abbreviate mimeme to meme.” (The Selfish Gene 192). There are many popular memes that relate to different cultural trends, such as “Leave Britney Alone,” “Gangnam Style,” “Situation Room,” “Advise Dogs,” “LolCats,” “Success Kid”, but in this presentation, I will concentrate on the genre of “border memes”. Short Paper (10+5min)
Assembling the Unrepresentable: Allegories of Violence on Digital Platforms in Latvia
Kārlis Lakševics
University of Latvia, Latvia
While 'the Internet' is often criticized for providing access to graphic materials representing spectacular forms of violence, the designed, managed and loosely regulated spaces of social media, news platforms and e-learning interfaces engage with violence in a rather singular digital aesthetic (Galloway, 2012). In the paper I discuss 3 recent studies of violence within various digital platforms: (1) qualitative research on youth perspectives on cyberbullying on social media; (2) qualitative research on e-learning production and user experience of an e-learning course for kindergarten safety; (3) media analysis on representation of violence on Latvian news platforms. By comparing these cases, I argue that the affordances of platforms proliferate singular images of violence, thus demeaning it, and individualizing harm as a personal reaction ‘behind the screen’ that replicates neoliberal ideologies and other forms of victim-blaming. “Īt’s just the internet!”, a phrase often used by students, becomes both a digitally literate reading of an encounter and a demeaning of violence, harm and othering. Following the framework of Galloway, I argue that engagements with violence within the digital aesthetic of the studied interfaces work on the realm of the allegorical and unrepresentable. From one side, there are harmful variations of messaging, commenting and unsolicited publishing of content that can be recognized as having violent effects and that can proliferate on loosely regulated media. At the same time, these often consist of messages that relate to online othering and normative shaming and in this way are legitimized by people contributing this content. From the other side, there are designed and curated efforts to combat various forms of violence through media publications and e-learning courses that use the digital aesthetic for condemning violence and providing strategies for non-violent practices. If mean comments often feel the same and in prolificating cases become memes and objects or ridicule themselves, articles and e-learnings use singular affective fusions of data. What is common to both sides is the singularity of the image and ambiguity of violence that at the same time provokes and resists the control of the interface and moderating bodies. Short Paper (10+5min)
Impact of Technologies on Political Behaviour: What Does it Mean to be "Good Digital Citizen"
Ieva Strode
University of Latvia, Latvia
Although the opinion that political activity of society has declined is relatively common (also confirmed by studies), there exist objections that activity may not have diminished, but its forms have changed, replacing traditional forms of participation to others. Some changes in political behaviour means just the “movement” of existing norms and traditional behaviour patterns to the digital environment: interest in politics, expression of political views, participation in some political activities may essentially preserve traditional content, while activities take place using new tools, platforms (e.g. through social networks, new political communities etc.). However, the new digital environment also requires and creates new norms related to “good digital citizens’ ” duties (including behaviour (e.g. following etiquette, obeying laws and rules that are specific for digital world), obtaining specific education and knowledge (e.g. digital literacy etc.)) and rights (e.g. access to digital resources, equality within society regarding access to these resources etc.). It is also important to assess activity of digital citizens in the political events of non-digital world (e.g. elections if e-voting is not available). Short Paper (10+5min)
Human-Centered Humanities: Using Stimulus Material for Requirements Elicitation in the Design Process of a Digital Archive
Tamás Fergencs, Dominika Illés, Olga Pilawka, Florian Meier
Aalborg University Copenhagen, Denmark
This study proposes the use of so-called stimulus material during interviews for requirements elicitation as part of the design process of a digital archive. Short Paper (10+5min)
3D and AI technologies for the development of automated monitoring of urban cultural heritage
Tadas Ziziunas, Darius Amilevicius
Vilnius University, Lithuania
Preservation of urban heritage is one of the main challenges for contemporary society. It’s closely connected with several dimensions: global-local rhetoric, cultural tourism, armed conflicts, immigration, cultural changes, investment flows, infrastructures development and etc . Nowadays very often organizations responsible for heritage management constantly have to deal with lack of resources, which are crucial for proper heritage preservation, maintaining and protection. Particularly it is problematic for countries with low GDP or unstable political situation. The possible solution of these problems could be automated heritage monitoring software system, based on the 3D data and AI technologies, which increase monitoring efficiency (financial, timewise, and data objectiveness factors). The system prototype was developed and tested by Vilnius University and Terra Modus Ltd. in frame of project “Creation of automated urban heritage monitoring software prototype” (2014). Next step is creation of full-capability software which is under development by Vilnius University on framework of project “Automated urban heritage monitoring implementing 3D and AI technologies”. Project financed by Research Council of Lithuania (project time 2018-2022) . At this paper only general pipeline of the 1st stage of project is presented. Proposed digital monitoring technique is based on effective reality capture and comparisons of data in time. 3D laser scanners and digital photogrammetry are the most capable, accurate enough data collection methods. Collected information from different time period measurements could serve as data for artificial intelligence analysis, which can automatically identify needed valuable elements and its changes during the particular time period. Such monitoring can possibly be performed in a remote, non-destructive, and cost-effective way . Accordingly, main principles of suggested solution are listed below. Digital monitoring is based on seven conditions. First: all objects in the monitoring process are tangible. Second: physical valuables could be expressed as simple geometrical forms or mathematical expression. Third: monitored objects could be fully scanned of photogrammetrically processed. Fourth: data from Lidar devices and data derived from photogrammetry are same quality (density, coverage, etc.). Fifth: detection of cultural heritage could be analysed by static and machine learning algorithms. Sixth: digitally processed results should be able to be checked. Seventh: digital monitoring is based on non-destructive and non-invasive 3D view technologies and analytical technologies. Regarding of digital data there are two possible ways to perform detection and comparison of selected valuables. First case scenario mainly means lack of comparable data of the older status quo. This means that there are no earlier 3D data of selected cultural heritage. Newly collected data is compared with mathematical rules which can be written in coded form. These set of rules describes geometrical parameters of selected valuables of the cultural heritage. In the second case scenario there are two data sets from different time period. This data is compared with each other. In both cases comparison needs interpretation. The first level of interpretation is in demonstrating some facts of geometrical change. The second level depends on the particular legal status and local legislation for managing cultural heritage, e.g. meaning of detected changes depends on legislation). First level of interpretation could be evaluated by logical operators, for example alteration is described as “status quo unchanged”, “reduction in volume by 65%”, etc. Second level of interpretation could be legal analysis of first level results, for example, “reduction in volume = fact of illegal demolition works”. According to the most frequent alteration of the Vilnius Old Town’s buildings’ valuables, a list could be stated: a) elements of the roof; b) shapes of the roof; c) cornices; d) doors; e) gates; f) the primary height and width of height buildings; g) the primary housing intensity of site; h) windows; i) chimneys. These are main valuables which can be traced in the manner of geometrical changes. In order to perform the detection of valuables, we first need to train the AI algorithms to identify the desirable valuables from the data – 2D pictures or 3D point clouds. Google “Tensorflow” with DeepLab v. 3+ with default settings was used . These are semantical segmentation procedures where some already annotated and trained data could be used. However, there are very little open data quality content for such topic. Hence, for performing the digital monitoring processes, a new database was established. Concerning future software’s usage for different oldtowns of Europe, only database with additional 2D pictures of elements or 3D scans are needed. The newly established database consists of collected pictures from the main streets of the Vilnius’ Old Town. For data annotation, Labelbox is used. Currently there are 420 high-resolution photos (12 megapixels) where the first two classes (valuables) are created: windows and doors. All doors and windows are manually annotated in 420 photos. Annotations were performed so that an algorithm could identify the kind of pixels that denote windows as well as what pixels stand for doors. For performing the training task, the currently most powerful open data algorithms of Google’s Tensorflow were used. In this case, an XML file is the result of annotation. This means that the annotated information in the c++ language is described according to the standard of Pascal VOC. This standard is one of the most popular and widely used. To sum up, two types of files are exported from Labelbox: XML and JPG. The further process could be described as follows: 1. JPG and XML are converted into RGB. The results are PNG files with segmentation masks – SegmentationClass; 2. Additionally, some PNG raw files with a semantical segmentation object contour are exported – SegmentationClassRaw; 3. JPG, PNG files (SegmentationClass) and PNG files (SegmentationClassRaw) are manually separated into two parts: “Train” (for training) and “Val” (for validation). The Train part is also automatically separated into tech and test parts in order to identify how accurate the training results are compared with human manual annotation. Hence, some extra Train, Val and Train/Val index are generated; 4. According to an index of JPEG, PNG, and PNG (Raw) files, we generated files special formats that were required by Tensorflow training – TFRecord (Train, Val, and TrainVal); 5. The system is trained using TFRecord files. In order to get the most accurate results, many hyper parameters should be optimized. This process is analysed in detail by J. Bergstra and Y. Bengio . One of the biggest problems with hyper parameter optimization is overfitting. In the context of heritage monitoring, this would cause that newly presented valuables – windows, for example – could not be identified properly. In order to avoid overfitting, various techniques could be applied, e.g. early stopping. Once the progress shows that mistakes stopped reducing, all processes are then being stopped. That calculation of the quality of prediction is described as loss function. There are various methods on how to calculate the loss function, but in this experiment, a default “cross entropy” is used. The experiment results demonstrated that training progress was performed properly because the loss function was gradually decreasing and data were not overfitted. However, a powerful computer resources are needed for finalizing the whole experiment with all groups of valuables. Historical Studies, AI, Linked Data

Long Paper (20+10min)
Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters
Mats Dahllöf
Uppsala University, Sweden
The present study is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation. Short Paper (10+5min)
Linked Open Data Vocabularies and Identifiers for Medieval Studies
Toby Burrows1, Antoine Brix2, Doug Emery3, Mitch Fraas3, Eero Hyvönen4, Esko Ikkala4, Mikko Koho4, David Lewis1, Synnøve Myking2, Kevin Page1, Lynn Ransom3, Emma Thomson3, Jouni Tuominen4, Hanno Wijsman2, Pip Willcox5
1University of Oxford, UK; 2Institut de recherche et d'histoire des textes, France; 3University of Pennyslvania, US; 4Aalto University, Finland; 5The National Archives, UK
This paper examines the use of Linked Open Data in the research field of medieval studies. Short Paper (10+5min)
An Artificial Intelligence Approach to Segmenting Medieval Manuscripts with Complex Layouts
Lisandra S. Costiner1, Lizeth Gonzalez Carabarin2
1Merton College, University of Oxford, UK; 2Eindhoven University of Technology, The Netherlands Costiner (Merton College, Oxford) & Lizeth Gonzalez Carabarin (Eindhoven University) Summary: Digitization initiatives undertaken by libraries, museums and collections around the globe are rapidly increasing the number of manuscript images online. Given the large volume of such data, it is important to devise new ways to automatically process and extract relevant information from these images, saving valuable human time invested in manual transcription and image extraction. Digitized documents pose a number of challenges for the extraction of relevant information, the key ones being the location of areas of text and illustration. Medieval manuscripts are especially challenging for automatic segmentation. Each surviving book was hand produced so its page layout, script used, and illustrations widely vary. Furthermore, medieval decorations do not typically conform to uniform rectangular registers -- they can be unframed, be placed throughout the text at irregular intervals and extend into page margins. Given this, such documents pose particular difficulties for traditional methods of segmentation designed for printed text, requiring instead the development of customized algorithms. Although many techniques have been developed for image segmentation (Eskenazi et al, 2017), there is a need for a generic tool that is flexible in dealing with a range of documents, low on processing power, and white-box, allowing every step to be queried. This paper proposes such a technique for the automatic identification and extraction of images (illuminations or miniatures), and of lines of text from Western medieval manuscripts. Algorithms for the extraction of images and texts in layout analysis (segmentation) can be generally divided into three classes. Most of the approaches employed in document segmentation are adapted to specific types of records (Shafait et al, 2008), so there is a need for a global or generic approach that will be able to adapt to different types of documents. Older approaches rely on rule-based algorithms which have reduced versatility, generality, robustness and accuracy when segmenting hand-written documents (Shafait et al, 2008). Recent developments have tended to focus on the use of neural networks (Eskenazi et al, 2017) (Gao et al, 2017) (Ares Oliviera et al, 2018). Although effective, neural networks (NNs) require manually-annotated data for training, expending large amounts of human time; they are computationally heavy, and are black boxes, meaning that their inner workings are not understood. New approaches with increased versatility, stability, generality, ability to perform multi-scale analysis, and to handle color remain a desiderata (Eskenazi et al, 2017). The current approach proposes to address these needs. It is based on k-means algorithm with a very limited number of features. Although k-means has been applied for document segmentation previously, the number of features used in these approaches was large, increasing the computational cost. The current methodology relies on only three features. Although for the segmentation of historical documents with challenging layouts, a number of annotated datasets have been created (Gruning et al, 2018; Simistira et al, 2016), no such dataset exists for illuminated medieval manuscripts. For the current study a dataset was created containing manuscripts with a range of layouts, decorations, and containing a variety of texts (devotional and medical), produced in different regions in different time periods. The images, freely available (Digital Bodleian) derive from the following manuscripts in Oxford’s Bodleian Library: MS Canon. Misc 476, MS Add. A. 185, MS Ashmole 1462, MS Auct. 2.2, MS Buchanan e 7. As a pre-processing step, the image is converted into gray format, a uniform filter is then applied using a kernel size of 13 in order to obtain a smoother format. After pre-processing, three features are proposed for clustering. Once all features are computed and standardized, k-means algorithm is performed over 5 clusters. Additionally, after computing k-means for 120 images belonging to 4 different manuscripts, the centroids of each cluster are calculated and plotted. This approach uses clustering and filtering techniques for segmenting challenging illuminated medieval manuscripts. Traditional approaches to text segmentation assume that text regions are enclosed in rectangular shapes, which is not true for many illuminated medieval books. Although k-means and filtering have been previously used for this task, the uniqueness of this approach is its reliance on only three features. The strength of the method further lies in its transparency at every step of the process, low-memory use, potential to produce highly refined results, and versatility. This stands as an alternative to programs such as neural networks which are black-boxes, do not allow for querying of their decision-making process, are computationally intensive, and demand manually-annotated training sets. This approach, therefore, provides not only a solution for the segmentation of challenging images with mixed textual and visual content, but more importantly leads towards algorithms with improved robustness, stability and versatility. Short Paper (10+5min)
Linked Open Data Service about Historical Finnish Academic People in 1640–1899
Petri Leskinen1, Eero Hyvönen1,2
1Aalto University, Finland; 2University of Helsinki, Finland
The Finnish registries "Ylioppilasmatrikkeli"' 1640–1852 and 1853–1899 contain detailed biographical data about virtually every academic person in Finland during the respective time periods. Short Paper (10+5min)
Personal Names as Mirrors of the Past in Medieval Northwestern Russia
Jaakko Raunamaa, Antti Kanner
University of Helsinki, Finalnd
Name is a linguistic universal that occurs in all known languages of the world. Similarly, many Finnic and Sami (Finno-Ugric language groups) place names occurring in Northern Russia prove that Finno-Ugric tribes inhabited these areas earlier. In other words, names preserve information about their users and can give researchers clues on what has happened in the past (Ainiala et al. 2012: 13‒29.) This paper introduces the personal name system used at the end of the 15th century in Northwestern Russia. More precisely, the study focuses on the personal names attested in the census books of Novgorod (AD 1499‒1563). These contain over 10 000 personal names and cover large areas in Northwestern Russia. The aim is to examine what kind of personal names were used in the area and what kind of regional differences can be found in the name usage. The study concentrates in particular on the northern areas of Novgorod Republic that supposedly had Finnic population. The goal is to learn if personal names used in Finnic areas differ from other ones. Last, the results are compared to archaeological, genetic and linguistics researches and a broader overview of the settlement history in medieval Northwestern Russia is presented. Since Northwestern Russia, and especially its northern part, has been remote and loosely populated before the modern era, there are only limited amount of historically important sources, such as archeological finds or written documents. Thus, the history of Northwestern Russia is full of questions and uncertainties. For a long time already, researchers interested in history have used linguistics and onomastics in order to create a more comprehensive picture of the past (e.g. Rjabinin 1997 and Sedov 1982). However, the usage of names as a source material is, in many cases, small scale and limited. Either the studies are often regionally restricted or they have only limited amount of analyzed names. In addition, many history-oriented studies rely only on contemporary name data. To some extent, the above mentioned problems can be explained by the methods and materials that have been used in the past. More precisely, the analogical materials, such as written documents or hand-drawn maps, have not allowed researchers to create a compressive studies based on names. The situation is now different since digital methods can be used to overcome the problems that earlier studies had. Many tasks that were previously considered as too time-consuming, like collecting thousands of names from documents, can now be done on a computer. This study relies on methods that development of digital humanities have made possible. First, the research material is compiled from the editions of Novgorod census books by scanning the pages and using OCR-reading to create editable copies of texts. The census books from the area of the Novgorod Republic were a product of a certain order coming from the Grand Prince of Moscow. The Grand Duchy of Moscow had subjugated the city-state and its belongings before the end of the 15th century. The ruler wanted to know how much income the Grand Duchy of Moscow should acquire from the newly conquered area, and thus the Moscovites ordered the tax documentation after the conquest had been finalized (Nevolin 1851: i‒xii). The documents are written in (old) Russian. Sources chosen for this study are edited versions of 15th and 16th century census books (NPK III, IV; POKV; PKOP). These transcriptions were mainly done at the turn of the 20th century. The study area is presented on a map below (pdf-file). Material contains approximately three thousands pages, in which there are around 10 000 villages and over 20 000 homesteads. Tax payers are grouped into homesteads (in Russian dvor). One homestead usually contains one owner but sometimes there are other people named as well, such as the brother(s), adult son(s), nephews and other relatives of the owner. All the census books are divided into parishes (in Russian pogost), which are typically named after the location of the main church or after the monasteries or local nobles, who had the rights to collect the taxes. The structured pattern of census books simplifies the process of collecting taxpayers’ personal names. For example, in census book POKV, which covers the areas of Karelian Isthmus and western shores of Lake Ladoga, the pattern is almost always following: “Деревня Дуброва, (д) Фомка Ивашковъ, (д) Онтушко Ивашковъ;” (‘Village Dubrova, (d)vor Fomka Ivaškovŭ, (d) Ontuško Ivaškovŭ;’). A Python script was written to exploit the systematic formalities of this record to harvest the personal names mentioned. The output is a data matrix that contains frequencies of person names for each parish, including main names (e.g. Ontuško) and bynames, such as patronyms (Ivaškovŭ) or descriptive ones (Volkŭ ‘wolf’). This allows for a systematic statistical measurements of similarity across the parishes. Classification of names makes it possible to evaluate how the measured similarities are caused by names belonging to, for example, different parishes. Comparing for similarities of naming practices is not a straightforward task, since there is no straightforward definition for naming practice. However, simply applying different distance measurements highlights different aspects of the use of person names. Cosine similarity for highlights of the widest overall trends, Jaccard index for selection of names. Hierarchical clustering algorithm enables to cross-reference the similarity of naming practices with geographical data to see whether area based clusters emerge. Together these approaches contribute in forming a holistic interpretation of how names expressed linguistic and ethnic identities in northern areas of Novgorod Republic. One of the main aims of this study is to focus on the northern areas of Novgorod Republic that supposedly had Finnic population. This area was bordered in the northwest by the Diocese of Åbo that was eastern part of the Realm of Sweden. Mostly Finnic speaking tribes, such as Ingrians, Karelians and Savonians, inhabited the border area. The emergence of these groups is a continuously discussed question among scholars but it is known that they share many similarities in archaeological finds dated into Late Iron Age (AD 1000‒1200) (Uino 2003: 300‒400) and in linguistics as well (Frog & Saarikivi 2012). Thus, it is worthwhile to compare the personal names attested in the Novgorod census books to those that are attested in the Swedish taxation documents concerning the border area. The reference material, altogether approximately 2000 names, consists of personal names used in 1561 in parish Juva from Savo region and of names used in 1545 in parish Kivennapa located in Karelian Isthmus. Finnic names, such as main names or clan names, are particularly interesting because they have been used on both sides of the border: e.g. in Kivennapa Kaupi Nousia and in Kir’jažskij pogost (in Finnish Kurkijoki parish) Kiridko Novzejevъ. Measuring and evaluating the census book data and comparing it to material collected from Swedish documents creates many new valuable perspectives into the history of Northwestern Russia. The results demonstrate how different personal names were distributed and used in the study area. This outcome is compared to the latest archeological, linguistic and genetic research, which allows us to create a comprehensive picture of the directions of cultural impacts and settlement movements in medieval Northwestern Russia. In addition, the results reveal those areas that were inhabited by people using Finnic names or Finnic forms of the Christian names in the end of the 15th century. References: Ainiala, Terhi, Minna Saarelma & Paula Sjöblom 2012: Names in Focus. An Introduction to Finnish Onomastics. Studia Fennica. Linguistica 17. Helsinki: Finnish Literature Society. CHR = Maureen Perrie (ed.) 2006: The Cambridge History of Russia: Volume 1, From Early Rus' to 1689. Cambridge: Cambridge University Press. Nevolin 1853 = Неволин, К. А.: О пятинах и погостах новгородских в XVI веке, с приложением карты. Из Записок Императорского русского географического общества, Кн. VIII. Санкт-Петербург: Тип. Имп. Акад. наук. NPK III = Новгородские писцовые книги. Переписная окладная книга Водской пятины 1500(7008) года. Часть 1. Санкт-Петербург: Археографицеская Коммиссия. 1868. NPK IV = Новгородские писцовые книги. Переписная оброчная книга Шелонской пятины. 1498, 1539, 1552-1553. Санкт-Петербург: Археографицеская Коммиссия. 1886. PKOP = Писцовые книги Обонежской пятины : 1496 и 1563 гг. Ленинград: Академия наук Союза Советских Социалистических Республик. Newspapers

Short Paper (10+5min)
Can Umlauts Ruin your Research in Digitized Newspaper Collections? A NewsEye Case Study on 'the Dark Sides of War' (1914–1918)
Barbara Klaus
University of Innsbruck, University of Vienna, Austria
Digitized newspaper collections facilitate the access to historical newspapers. Even though they offer several useful possibilities regarding the research in historical newspapers and magazines, the (automatic) research in these col-lections is (still) full of limitations and pitfalls. Based on the research con-ducted on the platform AustriaN Newspapers Online (ANNO) for the NewsEye case study ‘the dark sides of war’, the main challenges of working with digitized newspaper collections will be discussed in this paper. Especial-ly two aspects – the fire catastrophe at the munitions factory Wöllersdorf (1918/09/18) in Lower Austria and the Austrian press coverage about war widows during the First World War – will be used as specific examples. The discussed limitations include the Optical Character Recognition (OCR) quali-ty, provided search options and metadata, as well as others. Short Paper (10+5min)
The Life and Death of Newspapers: Using Metadata to Assess the Outlook and Trajectories of Newspapers in Finland, 1771–1917
Zafar Hussain, Eetu Mäkelä, Jani Marjanen, Mikko Tolonen
University of Helsinki, Finalnd They recorded most societal events and thus are a rich source for historical findings, but they are also often identified as factors in major transformations in history such as the emergence of a bourgeois public sphere (Habermas 1962), the establishment of nation states (Anderson 2006) and the breakthrough of representative democracy (Keane 2009). Since an increasing amount of newspapers have been digitized in the past twenty years, there are huge amounts of studies that target detailed questions in the newspapers and the expectations for what can be analyzed through computational methods are sometimes rather unrealistic (Da 2018). For the analysis of large societal processes, like the transformation of the public sphere, there are, however, great obstacles with regard to data quality, coverage and uniformity. For a computational analysis of the public sphere, it is crucial that we are able to optimize the research question to the creation of sub-corpora through the available datasets. This balance between research questions and the corpus is particularly important because there is always bias in the historical records. This does not mean that it cannot be used in a meaningful way (as in any historical research). But if the research questions are framed in a manner that is too broad, we risk to make the existing bias part of our analysis. Our core assumption is that the public sphere cannot be accessed as one whole with the idea that the newspapers as such represent it in a meaningful way (Marjanen et al 2019). Instead, we aim to study particular manifestations of different types of public spheres in different locations and time focusing on changes at different scales. Thus, for example, instead of looking at Finland as a unified entity it makes sense to divide the analysis into different public spheres that are realised at different paces (Swedish vs. Finnish), town vs country, inland vs coast. We use existing metadata records to examine the shapes and boundaries of public discourse. In order to uncover and understand the complexity of the phenomenon, as reflected by our data, one part of our work has been to develop purely data-driven means to delineate and model the different types of newspapers in our dataset. In this study, we do so not by looking at the content of the newspapers, but by mapping the complexity of their material development. Here, we have extracted from the scanned ALTO/METS data for each X newspapers, information on their page size, number of columns, information density and frequency of publication. In earlier work, we have used this materiality information to chart general trends in how Swedish and Finnish newspapers developed in Finland during the 19th century (Marjanen et al. 2019, Mäkelä et al. 2019a, Mäkelä et al. 2019b). However, as part of that work, we discovered that the general trends belie a much more complex reality. For example, while the general trend in the page size of newspapers follows a linear increase (Fig. 1), if one looks more closely at the data, one can see that this is caused by an interplay of multiple different phenomena. First, a lot of the increase is caused by a clear increasing trend in page size for newly established newspapers, even though individual variance is also large (Fig. 2). When one looks to how already established newspapers switch sizes, a much more complex picture appears. Here, in Figure 3, we can see existing newspapers steadily move both to larger as well as smaller page sizes. The overall increase here only occurs due to the proportion of newspapers increasing in size being consistently more than those decreasing. Figure 1. Mean newspaper area by year (drop between 1890–1910 is an artifact in the data) Figure 2. Page sizes of newly established newspapers Figure 3. Proportion of news papers each year increasing (green) and decreasing (red) in size, as well as their difference (yellow) Based on these realizations, in this study we decided to see if we could categorise newspapers not on their absolute characteristics at a particular time, but instead by their behaviour during their lifespans. Thus, instead of absolute values for the page size, number of columns, information density and publication frequency, we took as features whether these values increased, decreased or stayed the same each year. While for the most part, robust analyses of this data is still underway, we do already have some preliminary and provisional results. First, to analyse the data, we did a hierarchical clustering to identify general trajectory categories. To be able to compare newspapers with different lifespans, we aggregated the data into proportional features, i.e. for example for page size, we calculated the percentage of years it increased, stayed the same or decreased. To maintain diachronic information, we also added a volatility measure, which denoted how often in its lifetime a paper switched between categories, e.g. from increasing to decreasing. We limited our analysis to the 301 newspapers in our dataset that were published for at least three years. Based on a preliminary analysis of the clustering results, we can identify five distinct major developmental categories, with different representations. First, 13 newspapers out of the total 301 clustered together into a category we describe as completely stable. For their lifetime, they do not change their material dimensions. The second category of 64 papers can be described as relatively stable, with only occasional forays either way in paper size. The third and fourth categories identified were mostly decreasing (32 papers) and mostly increasing (14 papers) respectively. Interestingly here, more papers were identified to be relatively constantly decreasing in page size as opposed to increasing in it. Finally, the largest category which contained a total of 177 newspapers was formed around papers with a high volatility, i.e. those which frequently changed between larger and smaller formats. When we delved deeper into this category, it split into two almost equal parts, one where volatility is extremely high, but the general trend is still increasing, and the other defined only by its volatility. Interestingly, none of the categories differed much in the lifespans of newspapers allotted to them. Beyond categorising the material trajectories, we’ve also been interested in their dynamics and interrelationships. To test for this, we are running experiments where we test if a machine learning classifier can predict certain categories of developments based on others. As an example of this, our preliminary results seem to indicate that a failure to increase throughput, whether by increasing publication frequency, paper size or information density, will increase the probability of a newspaper going out of business. Bibliography Anderson, B. (2006). Imagined communities: Reflections on the origin and spread of nationalism (Rev. ed). Verso. Da, N. Z. (2019). The Computational Case against Computational Literary Studies. Critical Inquiry, 45(3), 601–639. https://doi.org/10.1086/702594 Habermas, J. (1962). Strukturwandel der Öffentlichkeit: Untersuchungen zu einer Kategorie der bürgerlichen Gesellschaft. Herman Luchterhand Verlag. Keane, J. (2009). The life and death of democracy. London: Simon & Schuster. Marjanen, J., Vaara, V., Kanner, A., Roivainen, H., Mäkelä, E., Lahti, L., & Tolonen, M. (2019). A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917. Journal of European Periodical Studies, 4(1), 54–77. https://doi.org/10.21825/jeps.v4i1.10483 Mäkelä, E., Tolonen, M., Marjanen, J., Kanner, A., Vaara, V., & Lahti, L. (2019). Interdisciplinary Collaboration in Studying Newspaper Materiality. Proceedings of the Digital Humanities in the Nordic Countries 2019. CEUR-WS. http://ceur-ws.org/Vol-2365/07-TwinTalks-DHN2019_paper_7.pdf Mäkelä, E., Tolonen, M. & Kanner, A. (2019). Charting the Material Development of Newspapers. Short Paper (10+5min)
Foreignizing the Other: National Identity and the Concept of Aristocrat in Dutch Historical Newspapers
Leon Wessels
Utrecht University, The Netherlands
The Netherlands are commonly associated with a bourgeois culture. Several studies have shown, however, that the elites ruling the Dutch Republic went through a process of ‘aristocratization’. They evolved into a closed oligarchy, especially in the eighteenth century [6], and adopted an aristocratic lifestyle exemplified by their luxurious mansions in the countryside [7]. How did language reflect the social and cultural presence of elites? In this paper, I will present some of the results of my ongoing PhD research into the broader conceptual history of the term ‘elite’ in the Netherlands. I will seek to understand how the word ‘aristocrat’ was conceptualized in Dutch newspapers between 1840 and 1994, examining in particular its spatial (in this case national) connotations. The corpus consists of articles (advertisements have been excluded) from over 30 different national and regional newspapers and contains almost 15 billion words. Newspapers are particularly interesting to study the history of concepts, because their serial nature allows one to study change over time and because newspapers both produce and reflect public discourse [8]. Following the principles of Natural Language Processing suggested by Jurafsky and Martin [9], I have created a number of Python scripts to query the newspaper corpus. I started out by making a simple concordancer, similar to various openly available concordance tools [10]. Next, I wrote a script to generate frequency lists (per year) of words that occur close to the keyword “aristocrat”. This keyword was written as a Python list containing regular expressions that capture the Dutch words ‘aristocraat’, ‘aristocratie’, ‘aristocratisch(e)’ and compound words, in historical spelling variations. I applied this script to make frequency lists of words that occur within a window of three words of the keyword. For example, the sentence ‘De dwingelandij van de aristokratie van Spanje is alom bekend.’ (The tyranny of the aristocracy of Spain is widely known.) would add the following words to the frequency list: ‘van’ (2), ‘de’ (1), ‘dwingelandij’ (1), ‘is’ (1), ‘spanje’ (1). The next step was to build a historical gazetteer suitable for extracting spatial information from the word frequency lists. A gazetteer is a geographical dictionary containing references to countries, regions, place names, et cetera. To avoid so-called ‘temporal dissonance’ I did not use an existing modern Dutch gazetteer, but created a historical Dutch gazetteer following the principles of McDonough et al. [11]. This gazetteer includes historical spelling variations and references to states that no longer exist. Using this gazetteer, I extracted references to nations from the word frequency list and saved the results as a tabularized set of data. The resulting data were used to analyze how frequently references to various nations co-occurred with keywords related to the concept of aristocrat. Among other things, the analysis shows a clear tendency in Dutch newspapers to associate the concept of the aristocrat with foreign countries, in particular Great Britain. References to a domestic aristocracy on the other hand are only marginally present. My research thus shows that the concept of the aristocrat – as the counterpart of the burgher – was effectively foreignized. This conclusion is in keeping with the generally held image of the Dutch as thoroughly bourgeois, in spite of the actual existence of an indigenous aristocracy. In preparation for the DHN 2020 conference, two more steps will be taken to improve the methodology. So far, the research was based on absolute frequencies of co-occurences. The first step will be to use so-called ‘significant collocation’ to identify which words co-occur more often than would be expected based on statistics alone [12]. Secondly, in order to capture the relations with semantically similar words, such as synonymy and hyponymy, I will use synsets. Synsets are sets of cognitive synonyms that are interlinked based on semantic and lexical relations. This approach has been successfully applied also by other researchers to study historical and geographical concepts [13]. Using synsets the term ‘aristocrat’ can thus be analyzed at a more conceptual level [14]. References 1. Johan Huizinga, "Nederland's geestesmerk", in: Geschiedwetenschap / hedendaagsche cultuur. Verzameld werk VII (Tjeenk Willink I\& Zoon N.V., Haarlem 1950) pp. 279-312. Originally published in 1935. 2. Henk te Velde, "How High did the Dutch Fly? Remarks on Stereotypes of Burger Mentality", in: Annemieke Galema, Barbara Henkes and Henk te Velde eds., Images of the Nation. Different Meanings of Dutchness, 1870-1940 (Rodopi, Amsterdam/Atlanta 1993) pp. 59-80. 3. Remieg Aerts, “De erenaam van burger: geschiedenis van een teloorgang”, in: Joost Kloek and Karin Tilmans eds., Burger. Een geschiedenis van het begrip ‘burger’ in de Nederlanden van de Middeleeuwen tot de 21ste eeuw (Amsterdam University Press, Amsterdam 2002) pp. 313-345. 4. Conrad Gietman, "Adel tijdens Opstand en Republiek. Oude en nieuwe perspectieven", Virtus. Journal of Nobility Studies 19 (2012) pp. 49-62. 5. Willem Frijhoff, "Verfransing? Franse taal en Nedderlandse cultuur tot in de revolutietijd", BMGN - Low Countries Historical Review 104.4 (1989) pp. 592-609. 6. H. van Dijk and D.J. Roorda, "Sociale mobiliteit onder regenten van de Republiek", Tijdschrift voor Geschiedenis 84 (1971) pp. 306-328; Yme Kuiper, "Adel in de achttiende eeuw: smaak en distinctie. Een verkenning van het veld", Virtus. Journal of Nobility Studies 16 (2009) pp. 9-18. 7. Paul Brusse and Wijnand W. Mijnhardt, Towards a New Template for Dutch History. De-urbanization and the Balance Between City and Countryside (Waanders/Utrecht University, [Zwolle/Utrecht 2011]); Yme Kuiper and Rob van der Laarse eds., Beelden van de buitenplaats. Elitevorming en notabelencultuur in Nederland in de negentiende eeuw (Verloren, second revised edition, Hilversum 2014); Yme Kuiper and Ben Olde Meierink eds., Buitenplaatsen in de Gouden Eeuw. De rijkdom van het buitenleven in de Republiek (Verloren, Hilversum 2015). 8. Michael Schudson, The Power of News (Harvard University Press, Cambridge 1982) pp. 17-18; Dan Berkowitz ed., Social Meanings of News. A Text-Reader (Sage, Thousands Oaks/London/New Delhi 1997) pp. xi-xiv; Martin Conboy, The Language of the News (Routledge, London/New York 2007) pp 149-150. 9. Daniel Jurafsky and James H. Martin, Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Third Edition draft, 2018). 10. Geoffrey Rockwell and Stéfan Sinclair, Hermeneutica. Computer-Assisted Interpretation in the Humanities (MIT Press, Cambridge/London 2016) pp. 49-65. 11. Katherine McDonough, Ludovic Moncla and Matje van de Camp, "Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora", International Journal of Geographical Information Science 33.12 (2019) pp. 2498-2522. 12. John Sinclair, Susan Jones and Robert Daley, English Collocation Studies: The OSTI Report (Continuum, London 2004) p. 10. 13. For example: Roberta Cimino, Tim Geelhaar, Silke Schwandt, "Digital Approaches to Historical Semantics: New Research Directions at Frankfurt University", Storicamente 11 (2015) pp. 1-16; Francesca Frontini, Riccardo Del Gratta and Monica Monachini, "GeoDomainWordNet: Linking the Geonames Ontology to WordNet", in: Zygmunt Vetulani, Hans Uszkoreit and Marek Kubis eds., Human Language Technology. Challenges for Computer Science and Linguistics. 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7-9, 2013. Revised Selected Papers (Springer International Publishing, 2016) pp. 229-242. 14. For discussions on the relation between words and concepts, see (among others): Otto Brunner, Werner Conze and Reinhart Koselleck, Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Volume I (Klett-Cotta, Stuttgart 1972) pp. xxii-xxiv; Peter de Bolla, The Architecture of Concepts. Long Paper (20+10min)
Disappearing Discourses: Avoiding Anachronisms and Teleology with Data-Driven Methods in Studying Digital Newspaper Collections
Elaine Zosa, Simon Hengchen, Jani Marjanen, Lidia Pivovarova, Mikko Tolonen
University of Helsinki, Finalnd
Newspapers have been a rich source of information for historians for the past hundred years or so. Digitized newspapers are particularly discussed with respect to the development of public discourse, but the idea of entering the realm of past discourse in toto through the digitized newspapers may in the end be harmful. In reality, historians are interested in the different layers of newspaper publicity, thus location and temporality always play a crucial role of any historical analysis of public discourse in newspapers. With these aspects in mind, this paper takes advantage of digitized newspapers and data-driven approaches in identifying disappearing discourses in newspapers. In doing this, we want to revisit one of the key tensions in historiography, that is, the interplay between being relevant for the present and at the same time writing history in a way that is true to the experiences of past actors. History’s presentism is sometimes discussed critically from the perspective of anachronism or teleology in history (Koselleck 2010; Skinner 2002), or more appraisingly in terms of genealogies of the present or letting all be the history of the contemporary (Armitage forthcoming). Regardless of the historian’s desire for contemporary relevance or for historical antiquarianism, the option to approach history without predefined questions from the present has not been possible. The advent of digitized sources that can be approached in a data-driven way opens up for a possibility of approaching history in a much more open-ended way. Hence, we propose to test the possibility of studying a historical case with as few presupposed categories as possible. To do this we study digitized newspaper collections (specifically, 19th century Finnish newspapers in Finnish and Swedish) through the perspective of discourses that fell out of fashion and disappeared from long-term diachronic newspaper data sets. We believe there is more potential in the use of digitized newspapers when we are not pinpointing the words and concepts in our approach a priori. This may lead us to completely new avenues of research, challenge our take on history as a some sort of progression and, hopefully, show the value of the data-driven approach for the humanities. To understand the boundaries and the development of the public sphere it is useful to identify those discourses that were important in a particular time and place, but have since disappeared while words and concepts of another discourse have replaced them and started to dominate the ecosystem of print publicity. It is a commonplace to note that religious discourse has lost much of its prominence or that technological advancements have brought with them new topics that have replaced old ones. Still, by turning the question around and asking which discourses disappeared, we get a broader picture. We then turn to the data again and zoom in on localities and languages in order to avoid a totalizing view and move on to looking at where and when discourse changed. Thus, while we produce an analysis of public discourse in Finland, we approach the topic by noting that this is not a unified whole, but composed of different entangled realms of public discourse (Tolonen et al 2019; Marjanen et al 2019a). Using newspapers and periodicals data in Finnish and Swedish encompassing respectively 5.2B and 3.4B tokens (National Library of Finland 2011a, 2011b), we utilise two different methods: relative word frequencies as proxies for particular discourses enhanced with distributional semantics derived from diachronic word embeddings (Kim et al 2014, Dubossarsky et al 2019), and dynamic topic modeling that captures more general themes. The former method, i.e. the combination of frequency analysis and vector space similarity allows us to focus on specific themes and track their dynamics along a timeline to detect crucial events related to those themes. This has successfully been carried out by recent previous work on similar data (Martinez-Ortiz et al 2016; Hengchen et al 2019; Marjanen et al 2019b; van Eijnatten and Ros 2019). Training diachronic word embeddings on different time granularities (e.g. months, years, or decades) allows for different views on the evolution of semantic clusters – these themes are then given weight through frequency counts. The latter method allows us to paint a larger picture of the different dynamics taking place in the data, by harnessing the power of topic models designed to capture trends in time-series data such as Dynamic Topic Models (DTM, Blei and Lafferty 2006). In DTM, the data is divided into discrete time slices and the method infers topics across these time slices to capture topics evolving over time. This method models how a topic changes from one time step to the next. Unlike vanilla LDA topic modelling which does not take into account the evolution of a topic, DTM is more robust to topics that changes vocabulary over time to talk about the same issue. In LDA, topics like these would likely to be separated into separate topics since the words associated with them has changed but in DTM they would be treated as one topic that is developing over time. To address the additional training complexity of this model we subsample the data such that we have the same amount of data for each time slice of our corpus. This would also ensure that the topics inferred are representative of all the time slices in the corpora rather than favoring the latter years which have more articles and newspapers associated with them. With thematically-labelled temporal representations of newspaper data, it becomes possible to quantify and qualify the evolution of certain themes that have been automatically inferred from the data — thus removing some bias in topic selection. We further use metadata to zoom in on changes in topics to see which towns, regions or types of newspapers to manually assess the driving locations of change and to produce a typology of disappearing discourses. Acknowledgements This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant 770299 (NewsEye). References 1. Armitage, D. (In Press). In Defense of Presentism. In D. M. McMahon (Ed.), History and Human Flourishing. Oxford: Oxford University Press. 2. Blei, D.M. and Lafferty, J.D. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. ACM, pages 113–120 3. Dubossarsky, H., Hengchen, S., Tahmasebi, N. and Schlechtweg, D. (2019). Time Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4. van Eijnatten, J. and Ros, R. (2019). The Eurocentric Fallacy. A Digital Approach to the Rise of Modernity, Civilization and Europe. International Journal for History, Culture and Modernity, 7. 5. Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands 6. Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models. ACL 2014, p.61. 7. Koselleck, R. (2010). Vom Sinn und Unsinn der Geschichte: Aufsätze und Vorträge aus vier Jahrzehnten von Reinhart Koselleck - Suhrkamp Insel Bücher Buchdetail (C. Dutt, Ed.). Berlin: Suhrkamp. 8. Marjanen, J., Vaara, V., Kanner, A., Roivainen, H., Mäkelä, E., Lahti, L., & Tolonen, M. (2019a). A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917. Journal of European Periodical Studies, 4(1), 54-77. https://doi.org/10.21825/jeps.v4i1.10483 9. Marjanen, J., Pivovarova, L., Zosa, E. & Kurunmäki, J. (2019b). Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings. in Proceedings of the 5th International Workshop on Computational History. HistoInformatics2019 - the 5th International Workshop on Computational History, 12/09/2019. 10. Martinez-Ortiz, C., Kenter, T., Wevers, M., Huijnen, P., Verheul, J. and Van Eijnatten, J. (2016). Design and implementation of ShiCo: Visualising shifting concepts over time. In HistoInformatics 2016 (Vol. 1632, pp. 11-19). 11. National Library of Finland (2011a). The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2016050302. 12. National Library of Finland (2011b). Short Paper (10+5min)
Creating an Annotated Corpus for Aspect-Based Sentiment Analysis in Swedish
Jacobo Rouces, Lars Borin, Nina Tahmasebi
University of Gothenburg, Sweden
Aspect-Based Sentiment Analysis constitutes a more fine-grained alternative to traditional sentiment analysis at sentence level. In addition to a sentiment value denoting how positive or negative a particular opinion or sentiment expression is, it identifies additional aspects or slots' that characterize the opinion. Some typical aspects are target and source, i.e. who holds the opinion and about which entity or aspect is the opinion. We present a large Swedish corpus annotated for Aspect-Based Sentiment Analysis. Each sentiment expression is annotated as a tuple that contains a one among 5 possible sentiment values, the target, the source, and the existence of irony. In addition, the linguistic element that conveys the sentiment is identified too. Sentiment for a particular topic is also annotated at title, paragraph and document level. The documents are articles obtained from two Swedish media (Svenska Dagbladet and Aftonbladet) and one online forum (Flashback), totalling around 4000 documents. Museums, Education and Technology

Long Paper (20+10min)
Digital History of Virtual Museums: The Through the designing of such a concept, the creation and development of museums' information resources, websites and various digital initiatives have become the keys to the success of museums in the digital environment today. This article considers the concept of a virtual museum, traces the transition of virtual museums from analog and interim multimedia formats to the online environment. The author surveys the crucial moments in the history of virtual museums and the stages of their development from the digital turn to their appearance on the Internet and subsequent transformation after this transition. In this article examples of museum information resources from North America and Europe, Japan and Australia are traced back to the first virtual museums online in the 1990s. Based on the analysis of materials from web archives, strategies for creating the first virtual museum resources on the WWW are identified. Short Paper (10+5min)Museums, Technology and Social Interaction in “Anyone Can Innovate!” Gabriella Di Feola, Erik Einebrant, Fredrik Trella Research Institutes of Sweden, Sweden The purpose of this paper is to describe insights gained from a collaboration project between RISE, an experimental research institute, and Borås Museum, a local cultural heritage institution, around the topic of how technology can be used in museums to encourage social interaction between visitors and between visitors and the museum staff. This is investigated through a case study of the project “Anyone Can Innovate!”, which was a multi-participatory VR-installation, using a perspective of participatory design. The study was conducted through observations by the developers, formal user testing with externally recruited testers, and by an interview with the responsible project leader and curator from Borås Museum. The VR-installation was tested in two iterations with different levels of embedded guidance, and included different roles for the participants, as an attempt to boost collaboration and interaction. One conclusion of the study is that the use of technology in a museum doesn’t per se mean that it will be participatory, and that it does not necessarily exclude the role of a human guide. In the discussion part, examples are given on how technology can be used as a tool to use participatory design. Short Paper (10+5min)No Longer Obsolete: Mapping Digital Literacy Skills for Museum Professionals in Sweden and Lithuania Nadzeya Charapan Uppsala University, Sweden; Vilnius University, Lithuania Contemporary museums as open systems are constantly transforming in response to economic, technological, social and cultural trends. The past decade has witnessed an increasing demand for information about the digitization of, access to, and preservation of museum collections to produce digital cultural heritage and new affordances for visitor-museum encounters. The post-digital turn normalizes the application of the ICTs as a basic attribute of the museum practice for preservation, collection, display and communication functions (Parry, 2010; 2013). Thus, the practitioners must be equipped with the transferable competencies to be able to successfully perform their duties and facilitate the successful digital transformation of the cultural institutions (Borowiecki & Navarrete, 2017). The previous research into the digital competencies demonstrates the paucity in its understanding and conceptualization of digital literacy (Marty, 2006; Tallon, 2017). For example, Jisc (2014) defines it as “capabilities which fit an individual for living, learning and working in a digital society. Digital literacy looks beyond functional IT skills to describe a richer set of digital behaviors, practices, and identities”. This definition provides a general view of the concept and requires further elaboration and adaptation to the specificity of the museum sector. Moreover, due to the constant and speedy change, the creative industries sector experiences a permanent gap in transferrable skills (Creative and Cultural Skills, 2011; Howard, 2013). Against the background of these trends, there is a need for further investigation into digital literacy and approaches to assessment and evaluation. The existing European (eCult Skills 2013-2015, Mu.SA project 2016-2019) and British national research projects (One by One: Building Digital Literacies 2017-2020) serve as important facilitators in addressing the existing research and practice gaps in the digital literacies and advancement of the museum sector, however, the empirically-driven conclusions are partly applicable to the Baltic and Nordic context. The goal of this paper is to provide a nuanced understanding of how the digital skills and literacies are understood, operationalized, and supplied in the Swedish and Lithuanian museological contexts. A conceptual model of the museum digital skills ecosystem, suggested by Parry, R., Eikhof, D. R., Barnes, S. A., & Kispeter, E. (2018) is adopted as a theoretical framework to scrutinize the landscape of the digital literacy skills in two case studies. The paper addresses the following interrelated blocks of research questions: 1. How do national cultural policies and legislation regulate the digitalization of museums and the provision of digital literacy skills in Lithuania and Sweden? 2. How do museum practitioners understand and deploy digital literacy skills in their daily professional practices? 3. What measures are required to bridge the gap (if any) and reach the balance in demand and supply of the skills? To depict the national peculiarities, the study will use the data from a) desk-study about the evidence on the national museum regulations and digitization in Lithuania and Sweden, and 2) qualitative research methods, based on the in-depth interviews with the museum practitioners to gain a nuanced understanding of how digital skills are developed and deployed in different structural units. The comparative thematic analysis of Kulturarvspolitik (2017) and Museilag (2017), in Sweden; and New National Museum Decree (2018) in Lithuania will create the legislative framework for the analysis of the existing regulations and infrastructures. Furthermore, the empirical data will be obtained from the museum professionals of two national art museums: the Nationalmuseum (Stockholm), incorporating Digital Laboratory; and Lithuanian Art Museum (LAM), incorporating Lithuanian Museums’ Centre for Information, Digitisation. The choice of the museums is determined by the following factors: similarity of the institutional context - art museums; the status - both museum are national cultural institutions; and they both serve as national digital hubs, incorporating the Digital Laboratory (Nationalmuseum), and Lithuanian Museums’ Centre for Information, Digitisation (Lithuanian Art Museum). The empirical data will benchmark the national peculiarities of the digital skills ecosystems and digitization processes in Lithuania and Sweden. The Baltic-Nordic comparative perspective will generate a consolidated view on the digitization of the museum sectors, discussing the existing threats and opportunities for digitalization, as well as supply and demand of the digital competencies. As an outcome, a set of recommendations for the prospective collaboration and knowledge transfer will be developed. These guidelines will provide a glimpse into nationally-tailored and regional specificity of digital skills ecosystems that will address the existing gap. References: Borowiecki, K. J., & T. Navarrete (2017). Digitization of Heritage Collections as Indicator of Innovation. Economics of Innovation and New Technology, 26, 3, 227-246. Creative and Cultural Skills (2011). Sector Skills Assessment for the Creative Industries of the UK. London: Creative and Cultural Skills. Available from: https://creativeskillset.org/assets/0000/6023/Sector_Skills_Assessment_for_the_Creative_Industries _-_Skillset_and_CCSkills_2011.pdf eCult Skills [Desk and Field Research: Guidelines and Templates] V.1.0. Available from:: http://files.groupspaces.com/eCult/files/1152507/RQMMdZeHqGSV1EEiHKk5/R2a+%26+R3a+Methodology+for+identification+of+K%2C+S%2C+C+needed+in+the+e-cult+sector+%26+Trainings+availalbe+in+the+EU.pdf Jisc (2014). Developing Digital Literacies (online guide). Bristol: Jisc. Available from: https://www.jisc.ac.uk/guides/developing-digital-literacies Howard, K. (2013). GLAM (Re-)Convergence and the Education of Information Professionals. Paper presented at A GLAMorous Future? Reflecting on Integrative Practice Between Galleries, Libraries, Archives, and Museums. Victoria University, Wellington, New Zealand. Lithuanian Art Museum. Available from: https://www.ldm.lt/en/ Lithuanian Museums’ Centre for Information, Digitisation. Available from: https://www.limis.lt/en/projektas Marty, P. F. (2006). Finding the skills for tomorrow: Information literacy and museum information professionals. Museum Management and Curatorship, 21, 4, 317-335. Mu.SA: Museum Sector Alliance (2019). Available from: http://www.project-musa.eu/about/ Nationalmuseum, Available from http://collection.nationalmuseum.se/ Parry, R. (ed.) (2010). Museums in a Digital Age. Abingdon and New York: Routledge. Parry, R. (2013). The End of the Beginning: Normativity in the postdigital museum. Museum Worlds, 1,24-39. Pedro, A. R. (2010). Portuguese Museums and Web 2.0. [Os museus portugueses e a Web 2.0]. Ciencia da Informacao, 39, 2, 92-100. Parry, R., Eikhof, D. R., Barnes, S. A., & Kispeter, E. (2018). Mapping the Museum Digital Skills Ecosystem-Phase One Report. Tallon, L. (2017). Digital is More Than a Department, it is a Collective Responsibility. The Met. Published 24 October 2017. Available from: https://www.metmuseum.org/blogs/now-at-themet/2017/digital-future-at-the-met Short Paper (10+5min)Beginning Latvian and Lithuanian as University Level Distance Learning Courses – Experiences and Reflections from the Past Two Years of Teaching Lilita Zalkalns Stockholm University, Sweden The Baltic Section of the Department of Slavic and Baltic Studies, Finnish, Dutch and German at Stockholm University has offered beginning courses in Latvian and Lithuanian ever since the fall term of 2017. While it may seem unusual to teach a language over the internet with no physical contact at all, this teaching method has been shown to be especially well suited for the so-called "smaller" or "exotic" languages, that often lack sufficient student applicants for campus courses. As a point in case, both the Latvian and the Lithuanian courses have had an average of 20 registered students per term, a number which must be regarded as unusually high for these languages. Approximately 90% carry through to the end and take the final exam. About 10% of the students decline to participate in face-to-face contacts via Skype, Zoom or Adobe Connect, which could indicate any number of things, among them the possibility that the student is cheating, i.e. someone else is doing the work in the course modules, and that s/he does not want to reveal their lack of language knowledge, or it could simply be that the student is shy. These and other types of student/study observations and statistics will be presented and analyzed. Over the past two years, the courses have successively changed based on student feedback, technological challenges and developments, and changes in teacher (my) attitudes. Among the issues I will discuss are technological problems, which for some students are a huge barrier to successful studies, and administrative issues, which can take up a major part of the teacher's allocated teaching time. Concerning course design, the teacher must be prepared to create or find new content, as links to external study materials suddenly disappear or the materials themselves are changed. Also, the increasing student use of smartphones as their main learning platform, means that study materials must be continually redesigned with the small monitor in mind. These and other observations and reflections will be presented. It can be concluded, that at least in Stockholm university, Latvian and Lithuanian will continue to be taught as distance learning courses, and that most likely, their scope and number will increase. In order to retain and augment student interest, the lessons learned and the experiences gained from the first two years of internet teaching should be gathered, systematized and implemented in the future language courses. References: ECAR study of Faculy and Information Technology 2017 (https://www.educause.edu/ecar/research-publications/ecar-study-of-faculty-and-information-technology/2017/introduction-and-key-findings) Darby, Flower: How to Be a Better Online Teacher in The Chronicle of Higher Education, April 17, 2019 (https://www.chronicle.com/interactives/advice-online-teaching) 7:00pm - 10:00pm Reception hosted by the Nordic Council of Ministers' Office in LatviaAristida Briāna iela 9, Rīga, LV1011 K. K. fon Stricka villa