Conference Agenda

Historical Studies, AI, Linked Data
Wednesday, 18/Mar/2020:
4:10pm - 5:40pm

Session Chair: Eetu Mäkelä
Location: Hall A

Long Paper (20+10min)

Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters

Mats Dahllöf

Uppsala University, Sweden

The present study is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation.

Short Paper (10+5min)

Linked Open Data Vocabularies and Identifiers for Medieval Studies

Toby Burrows1, Antoine Brix2, Doug Emery3, Mitch Fraas3, Eero Hyvönen4, Esko Ikkala4, Mikko Koho4, David Lewis1, Synnøve Myking2, Kevin Page1, Lynn Ransom3, Emma Thomson3, Jouni Tuominen4, Hanno Wijsman2, Pip Willcox5

1University of Oxford, UK; 2Institut de recherche et d'histoire des textes, France; 3University of Pennyslvania, US; 4Aalto University, Finland; 5The National Archives, UK

This paper examines the use of Linked Open Data in the research field of medieval studies. We report on a survey of common identifiers and vocabularies used across digitized medieval resources, with a focus on four internationally significant collections in the field. This survey has been undertaken within the “Mapping Manuscript Migrations” (MMM) project since 2017, aimed at aggregating and linking disparate datasets relating to the history of medieval manuscripts. This has included reconciliation and matching of data for five main classes of entities: Persons, Places, Organizations, Works, and Manuscripts. For each of these classes, we review the identifiers used in MMM’s source datasets, and note the way in which they tend to rely on generic vocabularies rather than specialist medieval ones. As well as discussing some of the major issues and difficulties involved in conceptualizing each of these types of entity in a medieval context, we suggest some possible directions for building a more specialized Linked Open Data environment for medieval studies in the future.

Short Paper (10+5min)

An Artificial Intelligence Approach to Segmenting Medieval Manuscripts with Complex Layouts

Lisandra S. Costiner1, Lizeth Gonzalez Carabarin2

1Merton College, University of Oxford, UK; 2Eindhoven University of Technology, The Netherlands

An Artificial Intelligence Approach to Segmenting Medieval Manuscripts with Complex Layouts

Lisandra S. Costiner (Merton College, Oxford) & Lizeth Gonzalez Carabarin (Eindhoven University)


Digitization initiatives undertaken by libraries, museums and collections around the globe are rapidly increasing the number of manuscript images online. Given the large volume of such data, it is important to devise new ways to automatically process and extract relevant information from these images, saving valuable human time invested in manual transcription and image extraction.

Digitized documents pose a number of challenges for the extraction of relevant information, the key ones being the location of areas of text and illustration. Medieval manuscripts are especially challenging for automatic segmentation. Each surviving book was hand produced so its page layout, script used, and illustrations widely vary. Furthermore, medieval decorations do not typically conform to uniform rectangular registers -- they can be unframed, be placed throughout the text at irregular intervals and extend into page margins. Given this, such documents pose particular difficulties for traditional methods of segmentation designed for printed text, requiring instead the development of customized algorithms.

Although many techniques have been developed for image segmentation (Eskenazi et al, 2017), there is a need for a generic tool that is flexible in dealing with a range of documents, low on processing power, and white-box, allowing every step to be queried. This paper proposes such a technique for the automatic identification and extraction of images (illuminations or miniatures), and of lines of text from Western medieval manuscripts.

Algorithms for the extraction of images and texts in layout analysis (segmentation) can be generally divided into three classes. Most of the approaches employed in document segmentation are adapted to specific types of records (Shafait et al, 2008), so there is a need for a global or generic approach that will be able to adapt to different types of documents. Older approaches rely on rule-based algorithms which have reduced versatility, generality, robustness and accuracy when segmenting hand-written documents (Shafait et al, 2008). Recent developments have tended to focus on the use of neural networks (Eskenazi et al, 2017) (Gao et al, 2017) (Ares Oliviera et al, 2018). Although effective, neural networks (NNs) require manually-annotated data for training, expending large amounts of human time; they are computationally heavy, and are black boxes, meaning that their inner workings are not understood. New approaches with increased versatility, stability, generality, ability to perform multi-scale analysis, and to handle color remain a desiderata (Eskenazi et al, 2017).

The current approach proposes to address these needs. It is based on k-means algorithm with a very limited number of features. Although k-means has been applied for document segmentation previously, the number of features used in these approaches was large, increasing the computational cost. The current methodology relies on only three features.

Although for the segmentation of historical documents with challenging layouts, a number of annotated datasets have been created (Gruning et al, 2018; Simistira et al, 2016), no such dataset exists for illuminated medieval manuscripts. For the current study a dataset was created containing manuscripts with a range of layouts, decorations, and containing a variety of texts (devotional and medical), produced in different regions in different time periods. The images, freely available (Digital Bodleian) derive from the following manuscripts in Oxford’s Bodleian Library: MS Canon. Misc 476, MS Add. A. 185, MS Ashmole 1462, MS Auct. 2.2, MS Buchanan e 7.

As a pre-processing step, the image is converted into gray format, a uniform filter is then applied using a kernel size of 13 in order to obtain a smoother format. After pre-processing, three features are proposed for clustering.

Once all features are computed and standardized, k-means algorithm is performed over 5 clusters. Additionally, after computing k-means for 120 images belonging to 4 different manuscripts, the centroids of each cluster are calculated and plotted.

This approach uses clustering and filtering techniques for segmenting challenging illuminated medieval manuscripts. Traditional approaches to text segmentation assume that text regions are enclosed in rectangular shapes, which is not true for many illuminated medieval books. Although k-means and filtering have been previously used for this task, the uniqueness of this approach is its reliance on only three features. The strength of the method further lies in its transparency at every step of the process, low-memory use, potential to produce highly refined results, and versatility. This stands as an alternative to programs such as neural networks which are black-boxes, do not allow for querying of their decision-making process, are computationally intensive, and demand manually-annotated training sets. This approach, therefore, provides not only a solution for the segmentation of challenging images with mixed textual and visual content, but more importantly leads towards algorithms with improved robustness, stability and versatility.

Short Paper (10+5min)

Linked Open Data Service about Historical Finnish Academic People in 1640–1899

Petri Leskinen1, Eero Hyvönen1,2

1Aalto University, Finland; 2University of Helsinki, Finland

The Finnish registries "Ylioppilasmatrikkeli"' 1640–1852 and 1853–1899 contain detailed biographical data about virtually every academic person in Finland during the respective time periods.

This paper presents first results on transforming these registries into a Linked Open Data service using the FAIR principles.

The data is based on the student registries of the University of Helsinki, formerly the Royal Academy of Turku, that have been digitized, transliterated, and enriched with additional data about the people from various other registries.

Our goal is to transform this largely textual data into Linked Open Data using named entity recognition and linking techniques, and to enrich the data further based on links to internal and external data sources and by reasoning new associations in the data. The data will be published as a Linked Open Data service on top of which a semantic portal "AcademySampo" with tools for searching, browsing, and analyzing the data in biographical and prosopographical research are provided.

Short Paper (10+5min)

Personal Names as Mirrors of the Past in Medieval Northwestern Russia

Jaakko Raunamaa, Antti Kanner

University of Helsinki, Finalnd

Name is a linguistic universal that occurs in all known languages of the world. Names are used to identify individual people, places, and other referents. Furthermore, names are connected to the culture surrounding them. For example, Bedouins living in North Africa have different ways of naming places and people than the Finns living in Northern Europe. Similarly, many Finnic and Sami (Finno-Ugric language groups) place names occurring in Northern Russia prove that Finno-Ugric tribes inhabited these areas earlier. In other words, names preserve information about their users and can give researchers clues on what has happened in the past (Ainiala et al. 2012: 13‒29.)

This paper introduces the personal name system used at the end of the 15th century in Northwestern Russia. More precisely, the study focuses on the personal names attested in the census books of Novgorod (AD 1499‒1563). These contain over 10 000 personal names and cover large areas in Northwestern Russia. The aim is to examine what kind of personal names were used in the area and what kind of regional differences can be found in the name usage. The study concentrates in particular on the northern areas of Novgorod Republic that supposedly had Finnic population. The goal is to learn if personal names used in Finnic areas differ from other ones. Last, the results are compared to archaeological, genetic and linguistics researches and a broader overview of the settlement history in medieval Northwestern Russia is presented.

Since Northwestern Russia, and especially its northern part, has been remote and loosely populated before the modern era, there are only limited amount of historically important sources, such as archeological finds or written documents. Thus, the history of Northwestern Russia is full of questions and uncertainties. For a long time already, researchers interested in history have used linguistics and onomastics in order to create a more comprehensive picture of the past (e.g. Rjabinin 1997 and Sedov 1982). However, the usage of names as a source material is, in many cases, small scale and limited. Either the studies are often regionally restricted or they have only limited amount of analyzed names. In addition, many history-oriented studies rely only on contemporary name data.

To some extent, the above mentioned problems can be explained by the methods and materials that have been used in the past. More precisely, the analogical materials, such as written documents or hand-drawn maps, have not allowed researchers to create a compressive studies based on names. The situation is now different since digital methods can be used to overcome the problems that earlier studies had. Many tasks that were previously considered as too time-consuming, like collecting thousands of names from documents, can now be done on a computer.

This study relies on methods that development of digital humanities have made possible. First, the research material is compiled from the editions of Novgorod census books by scanning the pages and using OCR-reading to create editable copies of texts. The census books from the area of the Novgorod Republic were a product of a certain order coming from the Grand Prince of Moscow. The Grand Duchy of Moscow had subjugated the city-state and its belongings before the end of the 15th century. The ruler wanted to know how much income the Grand Duchy of Moscow should acquire from the newly conquered area, and thus the Moscovites ordered the tax documentation after the conquest had been finalized (Nevolin 1851: i‒xii). The documents are written in (old) Russian. Sources chosen for this study are edited versions of 15th and 16th century census books (NPK III, IV; POKV; PKOP). These transcriptions were mainly done at the turn of the 20th century. The study area is presented on a map below (pdf-file). Material contains approximately three thousands pages, in which there are around 10 000 villages and over 20 000 homesteads. Tax payers are grouped into homesteads (in Russian dvor). One homestead usually contains one owner but sometimes there are other people named as well, such as the brother(s), adult son(s), nephews and other relatives of the owner. All the census books are divided into parishes (in Russian pogost), which are typically named after the location of the main church or after the monasteries or local nobles, who had the rights to collect the taxes.

The structured pattern of census books simplifies the process of collecting taxpayers’ personal names. For example, in census book POKV, which covers the areas of Karelian Isthmus and western shores of Lake Ladoga, the pattern is almost always following: “Деревня Дуброва, (д) Фомка Ивашковъ, (д) Онтушко Ивашковъ;” (‘Village Dubrova, (d)vor Fomka Ivaškovŭ, (d) Ontuško Ivaškovŭ;’). A Python script was written to exploit the systematic formalities of this record to harvest the personal names mentioned. The output is a data matrix that contains frequencies of person names for each parish, including main names (e.g. Ontuško) and bynames, such as patronyms (Ivaškovŭ) or descriptive ones (Volkŭ ‘wolf’).

This allows for a systematic statistical measurements of similarity across the parishes. Classification of names makes it possible to evaluate how the measured similarities are caused by names belonging to, for example, different parishes. Comparing for similarities of naming practices is not a straightforward task, since there is no straightforward definition for naming practice. However, simply applying different distance measurements highlights different aspects of the use of person names. Cosine similarity for highlights of the widest overall trends, Jaccard index for selection of names. Hierarchical clustering algorithm enables to cross-reference the similarity of naming practices with geographical data to see whether area based clusters emerge. Together these approaches contribute in forming a holistic interpretation of how names expressed linguistic and ethnic identities in northern areas of Novgorod Republic.

One of the main aims of this study is to focus on the northern areas of Novgorod Republic that supposedly had Finnic population. This area was bordered in the northwest by the Diocese of Åbo that was eastern part of the Realm of Sweden. Mostly Finnic speaking tribes, such as Ingrians, Karelians and Savonians, inhabited the border area. The emergence of these groups is a continuously discussed question among scholars but it is known that they share many similarities in archaeological finds dated into Late Iron Age (AD 1000‒1200) (Uino 2003: 300‒400) and in linguistics as well (Frog & Saarikivi 2012). Thus, it is worthwhile to compare the personal names attested in the Novgorod census books to those that are attested in the Swedish taxation documents concerning the border area. The reference material, altogether approximately 2000 names, consists of personal names used in 1561 in parish Juva from Savo region and of names used in 1545 in parish Kivennapa located in Karelian Isthmus. Finnic names, such as main names or clan names, are particularly interesting because they have been used on both sides of the border: e.g. in Kivennapa Kaupi Nousia and in Kir’jažskij pogost (in Finnish Kurkijoki parish) Kiridko Novzejevъ.

Measuring and evaluating the census book data and comparing it to material collected from Swedish documents creates many new valuable perspectives into the history of Northwestern Russia. The results demonstrate how different personal names were distributed and used in the study area. This outcome is compared to the latest archeological, linguistic and genetic research, which allows us to create a comprehensive picture of the directions of cultural impacts and settlement movements in medieval Northwestern Russia. In addition, the results reveal those areas that were inhabited by people using Finnic names or Finnic forms of the Christian names in the end of the 15th century.


Ainiala, Terhi, Minna Saarelma & Paula Sjöblom 2012: Names in Focus. An Introduction to Finnish Onomastics. Studia Fennica. Linguistica 17. Helsinki: Finnish Literature Society.

CHR = Maureen Perrie (ed.) 2006: The Cambridge History of Russia: Volume 1, From Early Rus' to 1689. Cambridge: Cambridge University Press.

Nevolin 1853 = Неволин, К. А.: О пятинах и погостах новгородских в XVI веке, с приложением карты. Из Записок Императорского русского географического общества, Кн. VIII. Санкт-Петербург: Тип. Имп. Акад. наук.

NPK III = Новгородские писцовые книги. Переписная окладная книга Водской пятины 1500(7008) года. Часть 1. Санкт-Петербург: Археографицеская Коммиссия. 1868.

NPK IV = Новгородские писцовые книги. Переписная оброчная книга Шелонской пятины. 1498, 1539, 1552-1553. Санкт-Петербург: Археографицеская Коммиссия. 1886.

PKOP = Писцовые книги Обонежской пятины : 1496 и 1563 гг. Ленинград: Академия наук Союза Советских Социалистических Республик. Археографическая комиссия. 1930.

POKV = Переписная окладная книга по Новугороду Вотьской пятины : 7008 года. Москва: Временник Московского общества истории и древностей. 1851.

Rjabinin 1997 = Рябинин, Е. А.: Финно-угорские племена в составе Древней Руси : к истории славяно-финских этнокультурных связей. Историко-археологические очерки. Санкт-Петербург: Изд-во С.-Петербург. унта.

Ronimus, J.V. 1906: Novgorodin vatjalaisen viidenneksen verokirja v. 1500 ja Karjalan silloinen asutus. Helsinki: Suomen historiallinen seura.

Sedov 1982: Седов, В.В.: Восточные славяне в VI-XIII вв. Москва: Российская академия наука.