Texts and Lexical Diversity
Friday, 20/Mar/2020:
1:50pm - 3:20pm

Session Chair: Peeter Tinits
Location: Hall A

Long Paper (20+10min)

Studying Semantic Domains in Akkadian Texts

Heidi Jauhiainen, Krister Lindén, Saana Svärd, Tero Alstola, Aleksi Sahala

University of Helsinki, Finland

In the Semantic Domains in Akkadian Texts project, we study semantic fields in texts written in Akkadian language. Akkadian is an East Semitic language that was spoken and written in ancient Mesopotamia, roughly in the area of modern-day Iraq, c. 3000 – c. 100 BCE. The texts were written in cuneiform script and the ones we are analyzing come from the Open Richly Annotated Cuneiform Corpus (Oracc). Oracc is an international cooperative undertaking providing free online editions of texts written mostly in the Akkadian and Sumerian languages. The text corpora in Oracc have been created by various projects and it is one of the largest electronic resources of cuneiform texts.

The snapshot of the corpus we have downloaded from Oracc contains 16,487 texts and almost 2 million words. About half of these texts have been tagged as having been written in the Akkadian language. The basic lexical unit in Oracc is the transliteration of the word, that is the representation of the cuneiform signs in Latin script. More than half of the words have been annotated with, for example, dictionary forms, word senses, and part-of-speech tags. No syntactic annotation of these texts has been published yet. Since Akkadian is an inflecting language, we have opted for using the dictionary forms of the words when analyzing semantic contexts.

Annotation of a text is always an interpretation by a scholar and, as Oracc is composed of a number of subprojects, the same word can be annotated in different ways in different subprojects – despite available Oracc guidelines. Therefore, we had to do some preprocessing of the texts, such as normalizing the ways deities and places are referred to. A word may also have several meanings, but homonyms are distinguished by their translation glosses in the annotation. As the annotation of the texts has been done by hand in many different projects, also the translation glosses are not always consistent, so we needed to unify synonymous translation glosses of the words that we wanted to study.

We use Pointwise Mutual Information (PMI) and fastText to find semantic contexts and relations of words in the Akkadian texts. PMI is a popular statistical association measure used in automatic collocation extraction. PMI measures the ratio of observed word co-occurrence probabilities compared with their hypothetical co-occurrence if the word order of the corpus is randomized and all syntactic and semantic information is lost. PMI excels in discovering syntagmatic semantic relationships. FastText is a method that uses an artificial neural network model to generate word vectors which have been shown to model paradigmatic semantic regularities between words. FastText is a variation of a method called word2vec but, in addition to words, fastText takes subword information into account by representing a word as shorter sequences of characters. We, furthermore, use network analysis to study relations between clusters of words in context windows of approximately ten words. Usually, we use at least two of these three methods together to study certain kinds of words and their semantic fields.

After preprocessing the data, we build a file with each document of the dataset on one line. We then extract collocations with PMI and build word vectors for each word with the collocation PMI values indicating a position in a multidimensional semantic space. We may also build such word vectors with the similarity values produced by word2vec or fastText. We can look directly at the N most informative syntactic collocates using PMI or paradigmatic collocates using word2vec and fastText, or we can extract the words in the semantic space that are closest to each of the words we are interested in. The results we get with computational methods are lists of words that are supposedly semantically similar to the words of interest as they occur in similar contexts. Our dataset is not as big as the ones generally processed with such tools which affects the noise in the result, so it is imperative to manually evaluate the automatically proposed results, for which we have created a workflow.

We start by visualizing the words and their clusters of similar words as networks with Gephi, an open source visualization tool. We typically build graphs for both the 10 and the 50 semantically closest words. The larger graphs of 50 are used for getting an idea of the wider contextual domain of the words of interest. In the smaller graphs of the 10 closest words, it is usually easier to spot the important links between words and, in some cases, the common contexts as well. The analysis of the networks moves in a hermeneutic circle. After examining the graphs, a better understanding of the word contexts emerges. Different possibilities can be explored by going back to the context of individual words. For this, we use the corpus search tool Korp to further analyze the contexts in which certain words appear together.

Korp is an online service in the Language Bank of Finland provided by FIN-CLARIN. In Korp, the results of a query are presented as concordances. After examining the contexts in Korp, certain possibilities appear more likely than others. We then return to the graphs to distill the idea further and use Korp at the same time to question and examine our analyses of contexts in graphs. The final results of this analytical process are then reviewed in light of previous lexical research as documented, e.g. in current lexical resources, on the Akkadian words under scrutiny.

In the presentation, we outline our methodology and some preliminary results that we have reached as members of the Centre of Excellence in Ancient Near Eastern Empires (ANEE) at the University of Helsinki. One of our aims within ANEE is to describe the identity of the ruling elite in ancient Mesopotamia. We showcase three recently concluded case studies. First, a study on divine names demonstrating how the conquering Assyrian elite promoted their identity as rulers through the worship of their main deity Assur, and how he was integrated into the pantheon of established deities at the time. The other two case studies are a study of emotion words and a study of verbs of seeing, which identify special usage contexts for synonyms which have not previously been documented in the available lexical resources for Akkadian. We present the main research results and describe how we arrive at our results by analysing semantically similar words in Akkadian texts. Furthermore, we describe the methods of our work combined with hermeneutically evaluating the results with the help of the corpus search tool Korp.

Long Paper (20+10min)

Using Word Statistics in Studying Variation of Folksongs

Mari Sarv

Estonian Literary Museum, Estonia

Within the field of digital humanities various methods and tools based on word statistics, like stylometry, topic modeling and sentiment analysis have become popular to answer different research questions on the basis of literary texts, written documents or everyday writings. With digital text corpora created on the basis of voluminous folklore collections methods based on word statistics have the potential to help us to gain better understanding on the essence of folkloric variation.

Variation is an inherent feature of folklore, emerging as a result of folkloric transmission process, where the transmitted knowledge is constantly re-created and adapted. At the same time the process is hard to be catched and surveyed in real-life situations. Statistical analysis of large text corpora enables us to get insight into the essence and details of variation, tradition flows and regional peculiarities.

My paper explores the possibilities of using the word statistics in studying the variation of Finnic runosongs on the basis of the material in Estonian and Finnish runosong databases. In studying the variation of textual folklore we always have to keep in mind that linguistic variation always underlies folkloric variation. In addition to underlying dialectal variation, runosongs use specific poetic register with archaisms and specific word forms instead of colloquial language. Due to the extreme linguistic variativity of runosongs the wordforms can not be automatically lemmatized nor grammatically analyzed. Nevertheless, use of computational methods based on word statistics like stylometry and topic modeling give us a valuable overview on the regional and topical variation of runosongs. Thanks to existence of large corpora we can for the first time ever draw data-driven outlines on the content and regional division of the tradition, but without being able to distinguish linguistic and contentual layers in the analysis, the results also always include both aspects.

Short Paper (10+5min)

Short Texts in the Corpus of Early Written Latvian (

Everita Andronova

The Institute of Mathematics and Computer Science, University of Latvia, Latvia

Early written Latvian texts are important sources not only for humanities, but also in culture and social studies. Unfortunately, being scattered in different libraries and archives (in different countries), they have not been much investigated; they are very much treated isolated and in many cases are used for quite narrow purposes. There was a serious lack of general overviews introducing the sources and studies on them, and more important, even now there are still a few interdisciplinary studies carried out. Fortunately, the last two decades have seen a growth in popularization and dissemination of the early written sources. The 21st c. brought new chances for lesser-used and lesser-studied languages, namely, the era of digitalization has resulted in the development of different general and special corpora.

The diachronic Corpus of early written Latvian was launched in 2003 and is intended to cover the history of written Latvian of the 16th–18th cc. (Andronova 2007). The aim of the corpus is to facilitate studies of early Latvian in general and to serve as the basis for the Historical dictionary of the Latvian language (this is a good example of successful co-operation between linguists and software engineers in creating a new kind of dictionary in Latvian lexicography; 1200 pilot entries are now available on the web:

The development of the corpus has gone through several phases. Early written Latvian texts have been acquired thanks to close co-operation with Latvian and Lithuanian libraries, as well as with researchers across Europe interested in the history of early Latvian texts. Undergraduate students at the University of Latvia and St. Petersburg State University (Russia) have also been involved in the process of transliterating some texts during the compilation of the corpus. This has served to raise the interest of the history of the Latvian language, and subsequently some bachelors’ theses have been defended on the basis of these texts.

The first digitized text copies were handed over to the National Library of Latvia in 2002. Some new sources have been discovered since then: thus e.g. a unique copy of Agenda Parva (1622), earlier reported unknown, has recently been published on the website of the Warmia-Mazury Digital Library ( We are presently processing Latvian fragments in this Agenda that will be added to the corpus.

One of the challenges in this work is the crucial need of comparison between different editions of the same source, as well as an analysis showing the tradition of circulation of different parts of religious texts from one source to another.

One of the advantages of this corpus is that it provides the exact location of a word-form (usually the abbreviation of the source, page and line number of the text or the Bible Book, chapter and verse). This makes it easy to accurately cite the corpus data. There is a possibility to look at facsimiles of the sources as well, which gives an extra added-value to this resource.

All sources in the corpus are included in toto, no samples are chosen. Quite a wide range of short texts has either been added to the corpus recently or is presently in the process of being included; these texts can be divided into 3 groups:

1) individual short texts, e.g., occasional poetry, oath texts;

2) Latvian texts found in sources written in other foreign languages, e.g., the prayer Pater Noster published in the 16th c.; sentences in Latvian in several editions of ‘Stratagema oeconomicum oder Akker-Student’ written in German by S. Gubert in the 17th c. or Latvian text in Agenda Parva (1622 and later editions);

3) shorter texts in Latvian appended to some individual Latvian sources.

The description of these three groups and the methodology of their inclusion in the corpus is the topic of the present study.

1. Individual short texts

These include both poetry and certain legal texts (different oaths, laws of war court). The bulk of the sources in this group is occasional poetry, written in the 17th and 18th c.

The beginnings of Latvian occasional poetry have recently been the object of in-depth studies. A broad inspection of the 16th and 17th c. poetry in the cultural context has been carried out by Māra Grudule (2017). The book shows the long way of evolution of this type of texts: they were profoundly influenced by German culture but later little by little turned into Latvian poetry. Three early dedication poems were already added to the corpus in 2016. In 2019 around 70 poems from 15 sources have been collected in different libraries and are now in the process of being included in the corpus. One of these new poems is a unicum kept at the Russian National Library – ‘Mūsu visu upurs tai priecas dienā’ (1791). These new poems are of wide thematic range, covering different occasions – birthday congratulations, wedding songs, popular New Year’s wishes, which can be printed on cards or written in letters, funeral songs and others.

These songs may be interesting not only for literature and linguistic studies, but also in order to examine the culture, history and ethnography in Livonia at that time. One can examine New Years dedication poems in ‘Jaunā Gada vēlēšanas pēc ikkatra gribēšanas’ (1781) and ‘Jaunā Gada vēlēšanas’ (1793) not only for literary analysis, but also to understand the soul, psychology and manners of people. Thus, we would like to encourage not only linguistic, but all other kinds of studies by means of the corpus. These texts will be included in the Corpus as individual sources.

2. Latvian inscriptions in texts written in other languages

This group covers single words, phrases, sentences and longer passages in Latvian in books printed in other languages. Latvian proper names – personal names and places names – have been found in several sources dated to the 15th century (e.g. chronicles). The lists of craftsmen guilds from the 16th c. should be examined and excerpted for the purposes of the corpus). The history of written Latvian rises with the period of Reformation and the claim of Martin Luther to use native language. There are already a number of prayers Pater Noster from the 16th c. in the corpus, before including them a linguistic analysis was performed in order to define which prayer to include (see Vanags 2014).

At the moment 2 new sources are being processed for inclusion in the corpus:

(1) Agenda Parva (1622) with its texts written in Polish, German, Estonian and Latvian. For the needs of the corpus only the Latvian sentences are excerpted and processed, and a Latvian word-list will be created on the basis of this material.

(2) The popular 17th c. book by S. Gubert, ‘Stratagema oeconomicum oder Akker-Student’ (1st ed. 1645 and later editions in the 17th c.), is a good example of so-called Hausväterliteratur and is a valuable source for ethnographical studies among others (e.g. the description of instruments and agriculture cultures known in Livonia at that time; ‘Bauer=Prognosticon’ for weather forecast is often mentioned, later included in the volumes of Latvian beliefs compiled by P. Šmits (1940-1941). In this book we can find Latvian phrases and hymnals at the end (last edition printed in 1757 excludes hymnals). Single words and phrases are encountered within the German sentences, commonly introduced by the phrase ‘die Bauern nennen’, e.g. names of insects (circiņš ‘criket’), names of plants (vavieriņi ‘marsh tea’), phrases like dvēsel laiks liter. ‘time of souls’ meaning ‘time span between Michael’s Day (29th of September) and Martin’s Day (10th of November)). In this case the whole sentence will be copied and marked as German, but only the Latvian phrase will be included in the word list. There are some hymnals added at the end of the book both in German and Latvian (most probably the songs were translated by S. Gubert himself, the last edition printed in 1757 lacks songs). All the songs will be included in the corpus in order to facilitate the analysis of the source text in German and its translation into Latvian.

3. Texts in Latvian added (later) to some individual Latvian sources

At the moment we have only one such source – a letter written by the peasant Anšs to the priest Loder dated June 1771 and added to the transcript of the ‘Lettisches und Teutsches Wörterbuch’ by Ch. Fürecker. This letter has already been included in the Corpus ( as a separate item.

The development of the Corpus of early written Latvian texts ‘SENIE’ is an on-going activity within other research projects; in 2018–2020 it is funded by the State Research program ‘The Latvian Language’ (No. VPP-IZM-2018/2-0002).


Andronova, E.. The Corpus of Early Written Latvian: current state and future tasks. In: Proceedings of Corpus Linguistics, 2007, Birmingham, UK. Available at:

Grudule Māra. Latviešu dzejas sākotne 16. un 17. gadsimtā kultūrvēsturiskos kontekstos. Rīga (2017).

Vanags Pēteris . Latviešu valodas vēsturiskās vārdnīcas projekts. In: Valodas prakse: Vērojumi un ieteikumi. Rīga (2014), pp. 97–109.

Short Paper (10+5min)

Comparing Word Frequencies and Lexical Diversity with the ZipfExplorer Tool

Steven Coats

University of Oulu, Finland

The ZipfExplorer is a tool for the interactive comparison and visualization of shared word type frequencies for two texts or corpora. The tool can be used to give insight into similarities and differences in textual and discourse content in terms of individual keywords or groups of keywords, and also calculates several measures of lexical diversity for the shared types of the selected texts. A selection of texts and corpora can be analyzed, and users can upload their own files for in-teractive comparison.

