Short Paper (10+5min)
Building a Linked Open Data Portal of War Victims in Finland 1914–1922
1Aalto University, Semantic Computing Research Group (SeCo), Finland; 2University of Helsinki, Helsinki Centre for Digital Humanities (HELDIG), Finland; 3The National Archives of Finland, Finland
This paper presents first results from a project that aims to publish data about the war victims in Finland in 1914–22 as a Linked Open Data service and to create a portal of tools called WarVictim-Sampo 1914–22 to explore and analyze the data. At the same, time the data is extended with new information and cleaned from mistakes when found. The project is based on the database War Victims of Finland 1914–22 ("Sotasurmat 1914–22") of the National Archives of Finland and related data compiled during the project. The core of the data includes information about roughly 40000 war victims. Most of these deaths are due to the Finnish Civil War but some are related to the First World War and the Kindred Nations Wars.
Short Paper (10+5min)
Linked Data for Digital Humanities Scholars and Researchers: “Rainis and Aspazija” Collection
National Library of Latvia, Latvia
This talk will focus on the Linked Digital Collection "Rainis and Aspazija" that showcases the use of Linked Data in Digital Humanities. This collection offers interlinked digital objects and data from several memory institutions and private repositories related to two Latvian poets of the period of National Awakening. We will also talk about the semantic annotation tool developed for cultural heritage needs that was used to create this collection and how this tool could be adapted to other use cases.
In 2016, the National Library of Latvia (NLL) together with the National Archives of Latvia, the Institute of Literature, Folklore and Art of the University of Latvia, the Association of Memorial Museums, and the Literature and Music Museum published RunA – the first digital cross-sectoral cultural heritage pilot-collection in Linked Data form in Latvia. RunA highlights the NLL`s efforts in developing new knowledge base for memory institutions and researchers. During 2018-2019, a special semantic annotation tool and a separate entity datastore was developed by the NLL to enhance RunA textual documents analysis. Although there already exist tools handling annotations of entities and links to external sources, they do not exactly provide for specific purposes of historical cultural heritage document research, like correspondence from the late 19th century, archival documents etc.
The RunA annotation tool includes support for three core types of annotations - simple annotations that may link to named entities, structural annotations that mark up portions of the document that have a special meaning within the context of the document (e.g. – direct citation of another published material) and composite annotations for more complex use cases (e.g., for representing an event described in a document with mentions of place, time and participants, all marked and identified in their own annotations).
The tool allows users to import text documents, create manual annotations and entity pages. The tool also includes a semi-automated named entity recognition technique where entity mentions in unstructured content are identified and linked to the existing entity pages in the annotation tool.
The process of semantic annotation of cultural heritage documents using special NLL`s tool includes all classical stages of annotation: text analysis to identify concepts such as people, things, places, events, etc.; concept extraction, classification of identified entities; manual relationship extraction between known and newly recognized entities; linking entities to internal and external controlled vocabularies; storing entity information with links in the datastore.
Information about the entities referenced from annotations is maintained in a dedicated entity datastore that supports links between entities and can point to additional information about these entities (e.g., to Linked Open Data resources such as VIAF, Wikipedia, etc.). The datastore provides for storing, sharing, and reusing data, extracted from individual annotations and those added by researchers. This allows experts to build a knowledge base about the entities referenced from annotations while annotating documents. This entity information could evolve as the annotation task progresses. It is possible to enhance the completeness of data on entities later. For example, they may create an entry for an entity that needs further research (with comments about what is known about the entity and what is not) which can be extended with additional information (for example, identifiers for the entity in other authoritative data sources) when it becomes available.
Machine-readable information about all entities is published according to Linked Data principles (in Turtle RDF and RDF/XML format).
Expanded annotated materials could be the research subject of students, who, whilst doing research, could become RunA’s annotation tool testers. After providing some guidance the NLL plans to involve students in the annotation process of the correspondence of Rainis and Aspazija. Students, educators and early-career researchers will have a chance to learn about the possible methods of using digitized material from cultural heritage collections. The knowledge base generated by the NLL could be integrated into the education process of the study courses at the Faculty of Humanities of the University of Latvia through the use of RunA.
This presentation will give concrete example and recommendations for exploiting the RunA as a resource and the potential of textual documents annotation tool in Digital Humanities research and teaching.
1. Bojārs, U., Rašmane, A., Žogla, A., Bāliņa, S., Salna, E. Semantic Annotation Tool for Cultural Heritage Content. Baltic Journal of Modern Computing, Vol. 6, No. 4 (2018).
2. Goldberga, Kreislere, Rašmane, Stūrmane, Salna. (2018). Identification of entities in the Linked Data collection "Rainis and Aspazija" (RunA). Italian Journal of Library, Archives and Information Science (JLIS.it ). V.9, No.1. https://www.jlis.it/article/view/83-106, DOI: http://dx.doi.org/10.4403/jlis.it-12444
3. Bojars, U., Rasmane, A., Zogla, A. The Requirements for Semantic Annotation of Cultural Heritage Content. Proceedings of the Second Workshop on Humanities in the Semantic Web (WHiSe II) co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 22, 2017. CEUR Workshop Proceedings, vol. 2014. URL: http://ceur-ws.org/Vol-2014/
4. Goldberga, A., Rašmane, A. (2017) The Linked Data collection “Rainis and Aspazija” (RunA) and the potential of IFLA FRBR LRM key entities for annotating textual documents. IFLA Library. http://library.ifla.org/1762/
5. Bojars, U. Case Study: Towards a Linked Digital Collection of Latvian Cultural Heritage. Proceedings of the 1st Workshop on Humanities in the Semantic Web (WHiSe 2016),
Long Paper (20+10min)
Estonian Language Community ca. 1900: Learning from Linked Metadata
1Tallinn University, Estonia; 2University of Tartu, Estonia; 3National Library of Estonia, Estonia
The expansion of digital resources has provided new avenues for historical research on language in a number of ways. Based on digital text collections, corpus linguistics has become one of the core disciplines for language researchers over the last few decades. Enriched texts allow one to extract facts, people, or geographic locations from texts and allow us to better understand what people were writing about.
An intriguing option that has come to be explored more recently is the study of collection metadata themselves (e.g. see Lahti et al. 2019). That is, with the study of collections and registries themselves, it may be possible to say something substantial about the historical events too. This naturally entails a more careful consideration of how the data points end up in the archives and whether they can be considered representatitive of an era, or may be biased in some way (e.g. by focussing on authors that were later canonized by critics, cf. Algee-Hewit et al. 2016).
In this presentation, I will present explorations of historical bibliographic data (i.e. Estonian National Bibliography), that aims to give a complete and comprehensive overview of publications in Estonian or related to Estonia. It is an aggregate of various bibliographies collected by book scholars over generations, and has been now made available in a digital structured format. I explore it from two angles: 1) printed books in the context of community demographics; 2) individuals involved in writing and publishing books and their backgrounds.
This study can be seen to contribute the discipline of historical sociolinguistics that aims to study the relations between society and language use in past communities. This research area is a victim to the problem of 'bad data': linguistic interactions preserve very poorly, and even for written language, a significant part of the sources have not been preserved. As a result the researchers turn to 'informational maximalism' (Janda & Joseph 2003), looking to integrate whichever data possible to research. Turning to demographic and bibliographic datasets thus offers a natural complement to the data in the field, that has not been explored much yet. While the conclusions also depend on the quality of the data, population-level characteristics allow an alternative point of view to the past linguistic communities. A systematic use of this may in fact extend the characteristics typically observed in historical sociolinguistics, and allow new concepts to be developed that to interpret data in terms of the linguistic community.
Estonian written language community provides a particularly interesting case for sociolinguistic study. While Estonian had been written regularly from the mid 16th century, this was done mostly by Baltic German clerics and used for religious purposes. The active language use in the native Estonian community was largely oral until the mid 19th century, and during a short time until the start of 20th century, written communication established a strong position within the community. At the same time, the area saw heavy urbanization, technological improvements like the railroad and improved printing, and a growing attention paid to language. The migration within the language community and the influx of a large number of new writers could influence both the shape of the language as it is used and the position that it is seen in. Here I present explorations of the growth dynamics of the written language community at the time. Particularly focussing on spoken and written language contacts as they can be reconstructed from demographic and bibliographic data, and some administrative events in language policy and education and whether their influence can be seen in the data.
#### Data and methods
The study relies on the Estonian National Bibliography (ENB), which is publically available on data.digar.ee. The data was harmonized with some heuristics and custom dictionaries. Demographic data for the period was aggregated from various published primary and secondary sources. The individuals involved in the language community were retrieved from publication data in ENB, by taking all names associated with the publications (excluding original authors of translated works). Finally, for enrichment ENB data links with VIAF were relied upon, adding biographic information to the authors based on Wikidata and DNB collections, and adding a few more sources (ISIK, VEPER). As a result, bibliographic data combined with demographic data was established, and an enriched dataset of individuals actively involved with print publications.
The urbanization component of migration was estimated based on residual migration in cities, comparing the natural growth rates estimated by demographers (Ainsaar 1997) and recorded growth across the censuses. The migration pathways were estimated on the basis of 1922 census that recorded the birthplaces of the people by county level. The native dialects of geographic areas or people were based on a published dialect map (Kyrölainen & Uiboaed 2013). The estimated population data was used to calculate per capita publishing rates over time. Additionally, aggregated data from school history (Andresen 2003) were utilized to investigate the reasons behind different rates for the emergence of authors across the counties.
#### Case studies
The demographic data show that due to internal migration within the Estonian population, most cities consisted of around 50% of immigrants around the turn of the century which has been described as heavy dialect contacts. However, only a minority of them were born in a different dialect area, due to which practical influence of dialect contacts on language can be expected to be marginal in terms of spoken language.
The publication record shows an exponential growth in both the number of printed works, and number of printed works per capita, as well as number of authors per capita. This provides a foundation for a steady rise in the written language community, that is mediated a bit by political events. Publication record shows rather abrupt changes in the relative roles of competing Estonian, German, and Russian languages as result of administrative policies.
The birthplaces of the associated individuals show a dominance of Livland in the late 19th century, to the extent that cumulatively, Livland comes to dominate the intellectual population over Estland. This trend can be understood in terms of administrative policies, that resulted in Livonian communities gaining affluence and also good coverage of public schools a decade or two earlier. However, between the north-south split of Estonian dialects, northern dialects are also dominant in large parts of Livland, so among the written language community, speakers with a native northern language still dominate.
The study of collection metadata provides an intriguing avenue for humanities research. It also opens up new discussions, particularly on the potential representativeness of the collections and the different ways that data points could be harmonized or generalized from. While these discussions may take a while to take place, opening up collections as datasets, and making them structured and machine-readable, is a clear step towards exploring these possibilities. In the case studies presented here, the encyclopedic metadata was used to study the shape of an emerging language community more than a 100 years ago. The same datasets, and other datasets like it, could be used to study many different questions relevant to humanities scholars of different fields. The more datasets become interlinked to each other, the more their investigative as well as critical potential can be taken advantage of. In this paper, we have gathered the historical demographic, bibliographic, dialectal, and administrative data to characterize a past language community.
- Ainsaar, M. 1997. Eesti rahvastik Taani hindamisraamatust tänapäevani = Estonian population from Liber Census Daniae up to nowadays. Tartu : Tartu Ülikooli Kirjastus
- Algee-Hewitt, M., Allison, S., Gemma, M., Heuser, R., Moretti, F. and Walser, H., 2016. Canon/archive: large-scale dynamics in the literary field. Literary Lab Pamphlet 11
- Andresen, L. 2002. Eesti rahvakooli ja pedagoogika ajalugu. / III, Koolireformid ja venestamine (1803-1918). Tallinn: Avita.
- Janda, R.D. & Joseph, B.D. 2003. On language, change, and language change – or, of history, linguistics, and historical linguistics. In: B.D. Joseph and R.D. Janda (Eds.), The Handbook of Historical Linguistics. Blackwell, Oxford. 3–180.
- Lahti, L., Marjanen, J., Roivainen, H. and Tolonen, M., 2019. Bibliographic Data Science and the History of the Book (c. 1500–1800). Cataloging & Classification Quarterly, 57(1), pp.5-23.
- Uiboaed, K. & Kyröläinen, A. Keeleteaduslike andmete ruumilisi visualiseerimisvõimalusi. Eesti Rakenduslingvistika Ühingu Aastaraamat 11, pp. 281–295
Short Paper (10+5min)
Curatr: A Platform for Exploring and Curating Historical Text Corpora
University College Dublin, Ireland
The selection, curation, and interpretation of text is fundamental to knowledge generation in the humanities. The increasing availability of digital collections of historical texts presents a wealth of possibilities for new research in the humanities. However, the scale and heterogeneity of such collections raises significant challenges when researchers attempt to find and extract relevant content. This work describes Curatr, an online platform that incorporates domain expertise and methods from machine learning to support the exploration and curation of large historical corpora. We discuss the use of this platform in making the British Library Digital Corpus of 18th and 19th century books more accessible to humanities researchers.
Short Paper (10+5min)
The 10M Balanced Corpus of Modern Latvian (LVK2018)
The Institute of Mathematics and Computer Science, University of Latvia, Latvia
Nowadays the research of different scientific disciplines would not be possible without the use of corpora, especially a general corpus, that “aims to represent the universe of contemporary language” (Aston & Burnard 1998, 5).
A corpus is used in linguistics to conduct language research, create dictionaries and grammars, in sociology to analyze mass opinion and behaviour and in computer science to develop natural language processing components, such as machine translation, speech recognition and various text taggers.
This abstract presents The Balanced Corpus of Modern Latvian (LVK2018) – a new 10 million representative corpus of contemporary Latvian. It describes the design, composition and text selection criteria of LVK2018. The LVK2018 is available at www.korpuss.lv
2. Development of the LVK corpus
The balanced corpus of modern Latvian has been developed in multiple rounds. The history of the LVK series goes back to 2007 when the first 1 million corpus was created. The LVK design, compilation and the text selection criteria were based on the Latvian Language Corpus Conception (Levāne-Petrova 2012). The experience from the designing of other general corpora was taken into account as well. The reviewed list of corpora includes British National Corpus (Burnard 2007; Aston, Burnard 1998), Czech National Corpus (Čermák 2002; Hnátková et al. 2014; Křen et al. 2016), Corpus of Contemporary Lithuanian Language (Kovalevskaitė 2006; Rimkutė et al. 2010), and others. The same corpus design criteria were also used for the subsequent LVK series. The previous corpus from this series (LVK2013) was released on 2013 with 4.5 million words (Levāne-Petrova 2012). All corpora are morphologically annotated (Paikens 2007; Paikens et al., 2013; Paikens 2016) and with the texts also annotated with metadata. LVK2018 is an extended version of LVK series corpus so it contains all the data from previous corpus releases (Levāne-Petrova 2019).
3. Design of LVK2018
LVK2018 is designed as general-language, representative and balanced corpus that aims to cover the variety of existing texts in some estimated proportions. Therefore, the corpus contains five different sections:
parliamentary transcripts (2%).
The corpus proportions are slightly modified from the previous edition of the Corpus, i.e., the Miscellaneous section has been incorporated in the Journalism section, because almost all corpus samples previously included in this section also might be included in the Journalism section, for instance, web articles on different topics, etc.
To cover different magazines and newspapers, subsequently the Journalism section also has been divided into the following subsections: nationwide media (41%), regional media (22%), leisure media (24%), popular science media (13%). This section also has been updated as compared with the previous LVK2013 corpus, for instance, there was a separate category for news articles published online, but now it is incorporated in the nationwide newspapers’ subsection, because nowadays there is almost no distinction between the printed press and the online press. Subsequently the sources are categorized by the genre, not publishing media like in the previous versions of the LVK.
4. Text selection criteria
To ensure quality and diversity and other desired aspects of the developed corpus, multiple text selection criteria were set.
time – texts published from 1991 as this corpus aims to reflect the contemporary Latvian. Although the data sample as from 1991 might be chosen for the Corpus, it is also crucial to cover the last years events in the mass media and other fields. Besides, the sources from the last years are available in the digital format, which means that does not require any additional efforts to digitalize the sources.
availability – the data sample in the digital format will be chosen primary for the corpus, not in printed media. For instance, for the Journalism section we choose the articles from the period 2013-2017 to cover the last years events in the mass media and they were also available in the digital format. The novels in the digital format will be preferably included in the Fiction section instead of the novels in the printed media.
originality – LVK2018 contains just texts originally written in Latvian; therefore, the obvious translations of the different texts into Latvian will not be included in LVK2018. It is clear that some foreign news from the mass media are translations from other languages, but there are no specific criteria set to not include such news or other sources that might be translations or localizations in the Corpus.
diversity – texts should cover as wide range of topics as possible. It is very important criteria for the Journalism and Science sections but applies to the Legal and Parliamentary transcripts sections as well. That means that the sources should be selected from the very different topics like foreign news, sports news, leisure, etc. The sources for the Science section should cover different branches of science like biology, mathematics, linguistics, etc.
text completeness – the selected documents should be included in full length if possible, but there is some exception to this. To not dominate the terms from the particular source in the whole Corpus, the source may not exceed 5% of the particular section of the Corpus. Thus, novels or PhD thesis that exceed this amount were not fully included in the Corpus. The Corpus contains just the samples of these sources.
uniqueness – in the previous Corpus development stages it was very important to not include the news about one and the same event from different sources in the Journalism section, for example, articles of various journalists about the newly elected President of Latvia in various editions. It is also crucial at this Corpus development stage, but it was even more important not to include any news or parts of the articles that could duplicate, as most portals republish the same news, even in the different time periods.
quality – texts should only contain clean text. There should be methods developed for text quality validation because the text selection for the corpus is not the problem, but the conversion from the source digital format to the corpus text. The text sources that are suitable for the Corpus just to be selected, for instance, the source with many tables, formulas, etc. non-text parts would not be usable for the search queries and therefore these parts of the text will be removed from the particular source.
The text selection criteria also might be changed or updated during the different Corpus development stages. For instance, recently just sources with hard copies were included in the LVK, but nowadays when almost all sources are in the digital format, especially journalism, this text selection criteria is not relevant any more.
Although all of the criteria were taken into account in the development of the corpus, the way the criteria were applied for each section (journalism, fiction, scientific, legal, parliamentary transcripts) were different, besides the data was selected automatically in this corpus development stage, that makes this task even more difficult.
5. Annotation of LVK2018
LVK2018 metadata schema is fully revised and updated. Multiple metadata fields were standardized and normalized during the revision of the metadata. LVK2018 has three publicly visible metadata fields – unique identifier (id), section and reference. A different reference template was designed for each of the five sections to incorporate all the relevant metadata fields for that sections. For instance, the following metadata fields like author, title/chapter, publishing place, publisher, year are used for the fiction section.
LVK2018 contains morphosyntactic annotation by the IMCS morphological tagger. (Paikens, 2007; Paikens et al., 2013; Paikens, 2016) Morphosyntactic annotations contain PoS tag, lemma and other Latvian specific morphological and syntactic information.
A balanced subcorpus of LVK2018 (10 000 sentences), containing samples of texts from the different styles, domains and subdomains existent in the corpus, is also syntactically manually annotated (Rituma et al., 2019), using hybrid dependency-constituency grammar formalism developed in the previous Latvian Treebank pilot project (Pretkalnina et al., 2011). Afterwards the hybrid annotation is automatically converted to Universal Dependencies to achieve the cross-lingual compatibility, as well as to provide training data for efficient and robust parsers (Gruzitis et al., 2018).
LVK2018 has been released in the framework of Latvian National Corpus. LVK2018 is freely available via the corpus query interface NoSketch Engine (Rychlý 2007) at http://nosketch.korpuss.lv/.
7. How to quote the LVK?
The corpus material is to be quoted in the bibliography in the following way:
The Balanced Corpus of Modern Latvian – LVK2018 (beta). The Institute of Mathematics and Computer Science, University of Latvia. Riga, 2018. Available at: www.korpuss.lv
This work has received financial support from the Latvian Language Agency through the grant agreement No. 4.6/2019-029.