Conference Agenda

Topic Modeling, Stylometry, Poetry
Friday, 20/Mar/2020:
11:20am - 12:50pm

Session Chair: Katrine F. Baunvig
Location: Hall D
Level -1

Long Paper (20+10min)

Verse Form & Meaning: Using Topic Modeling to Trace Semantic Patterns Within Poetic Meters

Artjoms Šeļa1, Boris Orekhov2, Roman Leibov1

1University of Tartu, Estonia; 2Higher School of Economics, Russia


Our paper addresses an established theory in versification studies that is known as "semantic halo of a meter". Mainly popularized in the works of scholars of Russian literature (Kirill Taranovskii and Mikhail Gasparov), in general form it states that distribution of meanings across different metrical forms (and their variations) is non-random. For example, iambic trimeter over the course of it's history will retain certain semantic features - and configuration of these features will never completely overlap with other metrical forms. The existence and accumulation of these metrical differences could, as it was suggested, form a "semantic valency" of a meter or it's "expectations horizon" for a reader.

Despite being well established, the theory of "semantic halo" was one of the less rigorous work venues in quantitative versification studies. Tracing high-level semantic patterns across a whole tradition was a meticulous task which was hard to formalize. Scholars were able to explore more distinctive, less populated metrical forms (notably, Russian trochaic pentameter) and struggled to describe mechanisms behind the appearance of the halo and the overall structure of relationships of meters in the semantic space.

In this paper we propose an operationalization of the "semantic halo" using topic modeling on 1800-1950s corpus of Russian poetry. We model each meter as aggregated topic probabilities of individual poems (composed in corresponding meter) and then use an entropy metric to compare probability distributions. We show that based on the vectors of semantic features meters cluster together in a non-random fashion: this strongly suggest that "semantic halo" theory holds true for a large scale corpus when accounting for every metrical variation in it. We then proceed to suggest that metrics derived from topic models could be used to answer fundamental questions of the nature of "semantic halo" and even used to compare "halo effects" across languages and national traditions.

Corpus & preprocessing

For the study we use a dump of Russian poetry corpus of the 19th - mid 20th century (subcorpus of Russian national corpora: To avoid overrepresented topics of large texts we filter out too short and too long poems by number of lines (4 =< lines =< 100); the final set includes 58,000 texts. Ruscorpora data provide metrical annotation, that we further clean to synchronize and simplify notation. For this experiment only "classical" verse forms were used (with some exceptions): accentual-syllabic monometric verse or verse with regularly alternating line length (e.g. Iamb-4/Iamb-3).

Corpus preprocessing included 3 steps:

1. Texts were lemmatized ("mystem" morphological analyzer, v3.0) and only lemmas were used for the analysis;

2. Stop words were removed (conjunctions, prepositions, pronouns);

3. We reduced the lexical variance in corpus. Topic models work generally better in mid-sized corpora when the number of word types could be decreased (e.g. use only nouns in the model -> make extremely sparse matrices a little less sparse). For this reason, we used only 5000 most frequent words across the corpus for model building. Words outside the top-5000 were checked for their top-10 closest semantic neighbors in the vector space representation (word2vec model was build in "gensim" Python package on the same corpus; similarity measured in cosine distance). If there was a closest neighbor for a word in top-5000 list, then this word was replaced by it's neighbor, if not - the word was simply thrown out. This dramatic decrease in poetic nuance was made deliberately to mimic the high-level semantic abstraction to which poetic themes were reduced by previous scholars (e.g. Night, Death, Love, etc.)


One LDA topic model with Gibbs sampling topics was trained on the whole corpus with document being a single poem ("topicmodels" package for R). We settled on 120 topics after several coherence tests and controlling for their even distribution in corpus. Hyperparameters were set to alpha=0.1 and delta=0.1 (i.e. we don't assume that one text should be generated by only one topic & one topic contains one very probable word). Because it was argued before that "semantic halo of a meter" follows somewhat distributional logic (it is formed not by distinct non-overlapping themes, but rather by unique configuration of common poetic topics), the design of LDA suited well for our purposes. It allowed to model each text as probability distribution over all 120 topics with several topics potentially accounting for "generation" of a single text with higher probability than others.

After the model was complete, we have used metadata in the document labels to aggregate topic probabilities of single documents ("gamma") by their meter. Thus each classical accentual-syllabic meter would become represented as distribution over averaged probabilities of 120 topics. We then can measure the entropy similarity of these distributions to access the semantic relations between different metrical forms (represented by equal samples of poems for each meter that is frequent enough) and build hierarchical clustering.


We won't discuss the resulted topics of our model at length: human assessment could mark them as "satisfactory" and issue coherent labels (Night landscape, Power, Silence/Sound, etc.), but we are not looking for topical words here. We doubt that LDA can reveal something new about poetic language when the latter is taken on such a large scale with so many semantic approximations made. On the opposite, we find LDA power in providing coherent (and, possibly, simplistic) representation of poetic language in one semantic space. We would like to use this representation to our advantage.

The hierarchical clustering (Ward's method) of metrical forms show that semantic information alone is enough to find similarities between and within metrical "lineages": iambs tend to be similar to iambs and trochees to trochees, with other distinct clusters of ternary meters (dactyl/anapest/amphibrach). The clusters also remain consistent if random sampling of poems is made numerous times, as is shown by majority-rule consensus tree. As we don't really have a "ground truth" to strictly establish semantic relations within meters, our results at this stage should be treated as a general sanity check while strongly suggesting the evidence for "semantic halo" theory on a large scale.


Our results could be used to advance the discussion on the nature and the origins of "halo effect". We argue that meter in this case could be understood as mnemonically strong form that limits the transmission of meaning, bounding it to more or less distinct "lineages". If there would be no limitations for semantic possibilities imposed by metrical form, then all meters would eventually converge to similarly shaped distribution of topics. This is clearly not the outcome of our experiment, suggesting a primary role of meter for carrying semantics. This also means that in the case when a poem of 'rare meter' emerges and is copied, it is more likely that new copies (in general) will be semantically more similar to the initial founder(s), than to existing large population of poems. This case of reduced variation in small separated populations is also widely known as "founder effect".

Short Paper (10+5min)

What is Russian Elegy? Computational Study of a Nineteenth-Century Poetic Genre

Antonina Martynenko

University of Tartu, Estoni; Institute of Russian Literature, Russian Academy of Sciences, Russia

The presentation will be dedicated to the computational approaches to study a poetic genre, namely Russian elegy. I will try to show that quantitative and computational approaches to poetic corpora will give significant results in studying literary genres in their development.

In the beginning of the 19th century elegies were largely elaborated in Russian poetry as a result of European literatures' influences. While first writings were translations of English, French and Latin elegies [1], since the 1810s poets produced large number of original elegies in Russian language. Thus, the period between late 1810s and early 1830s is considered to be the most important stage in the development of Russian elegy. However, despite the importance of the genre as a whole, most literary scholars had analyzed only canonical elegiac poems (e.g. Pushkin's elegies) and given small attention to the large population of elegies published by minor authors or anonymously [2].

In order to examine the history of the elegy on macro-level, a corpus of Russian poems named as 'an elegy' and published between 1815 and 1835 was compiled for this study [3]. The corpus includes 509 poetic texts and retains punctuation, line division and rhymes according to historical sources. Texts are provided with metadata such as year of publication, bibliographical references, and verse characteristics (meter and number of feet, rhyme scheme). 390 out of 509 elegiac poems were gathered from periodicals, so that they could be dated more precisely than the ones in poetry collections; the analysis below will be based only on well-dated texts. In addition, a part of the poems in the collection is digitalized and introduced as a research object for the first time. As a result, besides canonic elegiac poems mentioned above, elegies of minor writers are well represented in the corpus (for example, these are poems written by D.P. Glebov, P.A. Pletnev, V.I. Tumanskij, A.S. Norov, I.P. Borozdna, V.N. Grigorjev, and many others). Hence, the elegiac poems gathered in the corpus aim to represent the historical meaning of the genre title more properly than preceding collections of canonical elegies [4].

The corpus metadata overview leads to important conclusions about the authors of the elegies. In the beginning of 1820s, the most remarkable young poets were engaged in writing elegies (particularly, Alexander Pushkin and Yevgeny Baratynsky). This implies that initially the elegy was a promising genre elaborated by renowned Russian poets. However, already in early 1830s this genre seemed to be popular mostly among novice non-professional poets who started to represent themselves as romantic elegists. It is assumed that the latter had great influence both on the decline in the prestige of elegies and on content of the poems itself.

The corpus provides the opportunity to study the content of elegies and their formal features in order to test the hypothesis that elegies had changed significantly during the 1820s. The analysis of lexical frequencies shows that the key notions for the elegy genre are love and melancholy. Then the lexical features of the corpus of elegies were analyzed in comparison with general Russian poetic language of this period (the poetical subcorpus of Russian National Corpus was used as a contrast corpus). The most distinctive words for the elegies were detected using the log odds ratio: these words are nouns that express emotions and abstract notions mostly connected with the theme of love (such as "love", "tear", "heart", "dear", etc.) and loss ("sorrow", "sad", "wither", "vainly"), as well as the words (collocations) for parting ("the last time", "everything disappears", "tears of heart").

However, the themes detected by the lexical analysis are presented differently in the elegies published in different periods. As it was mentioned above, the changes could be connected with novice authors’ attention to the elegy in the late 1820s. To test the hypothesis of thematic change in elegies between 1810s and 1830s a topic model was created (LDA, R package 'topicmodels') [5]. The model shows that different themes are distributed unequally during the period under consideration. In elegies published between 1815 and 1825 thematic diversity is higher than in late 1820s. For instance, in the end of the 1810s elegies were likely to describe historical events; the importance of the historical theme in elegies is explained by the influence of Napoleonic wars poetry. Also, in the elegies published in the early 1820s pastoral scenes appear more as well as the scenes of mourning one's death. Both these themes are connected with exemplary Latin elegies that, according to the corpus, had lost their influence in the mid-1820s. Based on the distribution of topics in the model, the period between 1825 and 1835 should be described as the emergence of the theme of romantic love. The variety of themes in corpus decreases significantly in the end of the period under consideration and ultimately the elegy became a short love poem close to a madrigal.

The latter conclusion is supported by quantitative analysis of the elegies’ formal features. The study of the metrical repertoire shows that decrease in thematic diversity happens simultaneously with the reduction of metrical variations in the corpus. In the elegies published before 1825 number of different meters were used. Above all, these are free iambic verse, iambic hexameter, and iambic verse with regular alternation of iambic hexameter and iambic pentameter ('iamb-65'). However, already in the end of the 1820s more than a half of the poems in the corpus belong only to iambic tetrameter.

Another formal feature worth considering is the length of elegies in lines. Statistical analysis shows that the length of a poem is strongly correlated [6] with the year of publishing: both mean and median lengths of poems aggregated for each year demonstrates significant decrease in elegies' length roughly from 60 to 30 lines during the period from 1815 to 1835 [7].

Thus, the computational study of the elegy genre leads to the following conclusions: between 1815 and 1835 the elegies became more thematically homogeneous and shifted from various meters to iambic tetrameter; at the same time a poem's size have reduced significantly. These findings proven by quantifiable results make visible the processes specific to elegy in 1820s, the last period of massive elaboration of this genre. Moreover, the prepared corpus could be used as training data for further genre classification. So, in conclusion, the results of such classification will be presented. Based on the elegiac features gained from the corpus, poems close to elegies will be extracted from the poetical subcorpus of Russian National Corpus and then compared to existing compilations of elegiac texts.

[1] See: Frantsuzskaia Elegiia XVIII-XIX vekov v perevodakh poėtov pushkinskoj pory [Russian Elegy of the 18th and 19th Centuries Translated by the Poets of Pushkin's Time], edited by Vadim Vatsuro and Vera Mil'china. Moscow, 1989.

[2] See, for example, important studies on the development of Russian elegy by Irina Semenko (Semenko, Irina. Poėty Pushkinskoj Pory [Poets of Pushkin's time]. Moscow, 1970) and Vadim Vatsuro (Vatsuro, Vadim. Lirika Pushkinskoj Pory ["Lyrics of Pushkin's time"]. Saint-Petersburg, 1994) both focused on the poetry of Pushkin's closest associates.

[3] The corpus is available on github repository: (texts' id-s do not correlate with actual number of texts in the corpus).

[4] Cf.: Russkaia elegiia XVIII - nachala XX veka [Russian elegy between the 18th and the beginning of the 20th century], edited by Leonid Frizman. Leningrad, 1991.

[5] The problems regarding the application of LDA topic modeling to poetical corpora were discussed in: Navarro-Colorado, Borja. “On Poetic Topic Modeling: Extracting Themes and Motifs From a Corpus of Spanish Poetry.” Frontiers in Digital Humanities, 5:15, 2018.

[6] r = -0.6 and -0.8 for correlation between year and mean / median lengths respectively.

[7] See the similar conclusion about poems' decline in length based on the study of all-genre corpus of Russian poetry: Shelya, Artjom, and Oleg Sobchuk. “The shortest species: how length of Russian poetry changed (1750—1921)”. Studia Metrica et Poetica, 4.1, 2017, 66–84. The reduction of elegies’ lengths, therefore, proves that the corpus of elegies adequately represents the processes that had been happened in poetry of this period.

Short Paper (10+5min)

Text Mining Themes of the Urban Night in Historical Literary Corpora

Hanne Emilia Juntunen

Tampere University, Finland

In this presentation, I will go over how I have used the text mining method of topic modelling to discover salient themes associated with the literary urban night using historical text corpora.

The study falls under the umbrella of digital literary studies. Its interest focuses on large-scale historical thematic trends which are difficult to study with traditional literary methods. As is usual in literary studies, the object of this study is a theme, rather than an era, certain authors, or places. Specifically, the interest lies with the subthemes, the themes that cluster around a larger theme, the urban night. Topic modelling was used for this discovery, supported with corpus linguistic methods to produce the most salient themes, or topics, associated with the urban night in the data used.

The topic modelling approach was chosen for the study of the literary urban night because it has been studied with qualitative methods applied to a relatively large number of texts to produce generalizable statements about its subthemes – a quantitative approach is therefore both relevant and lacking. The study has a large timeframe, looking at literary texts from 1500’s to 1920’s. This trajectory represents a historical time when the urban night evolved into the phenomenon we now recognize: before the sixteenth century, walking outside at night was illegal in most major European cities! The timeframe also foregrounds the historical thematic trends under consideration: idiosyncratic and short-lived trends get lost in the large mass of texts. Similar timeframes are, moreover, employed with qualitative methods in the studies of the literary (urban) night more generally as well. As such, the questions the study set out to answer were whether the method would produce such consistent historical thematic trends, and whether the results would challenge or support the established understanding of the literary urban night.

The data used in this study is comprised of several full-text corpora of literary texts: Early English Books Online, Eighteenth Century Collections Online, Corpus of Late Modern English Texts, Corpus of English Novels, and Tampere Corpus of English Novels. Together, they span the years 1500-1923. These digital resources are all, except for the last one, freely available for research use. Some of them contain a mixture of genres and texts from different centuries. These have been sorted into centennial and literary-only and mixed-genre subcorpora (parts of a larger corpus) automatically, and randomly checked manually. American literature was hand-picked out of the corpora as well. The size of the corpus obtained by these operations is 7797 full literary texts.

In order to start applying the text mining method of topic modelling to the data, the literary urban night had to be operationalised. That is, the qualitative feature of story-telling, the theme of the night of the city, had to be transformed into a quantifiable and measurable variable. It was decided to focus on the explicit mention of the night in the context of the town as this was easiest to automatically detect. Furthermore, a comparison of the occurrences of words that signal a nocturnal setting in literature showed that these words (such as 'lamp', 'candle' and 'moon') occur in similar patterns or with notably less frequency. Taking other words into consideration was thus deemed unnecessary. A corpus tool was used to extract texts that contain the nodewords (‘night’, ‘nocturnal’ and ‘nyght’) in the desired context, the words 'town', 'city' and 'urban' within a 40-word window. These texts were then compiled into a corpus of their own, forming the urbannight-subcorpus of 1686 full texts.

The new urbannight-subcorpus was then used to extract the themes. The text mining method used was Latent Dirichlet Allocation (LDA). Topics were extracted from the full texts which produced a set of topics that limited both in terms of its internal consistency and variety (3-5 topics per centennial subcorpus). These topics were very generic, and most likely reflect the global themes for the texts, i.e. themes that the texts as a whole thematise in many different contexts. However, the focus of the study was on subthemes, or themes local to the urban night. To get at the local level of the urban night, several chunking options were tested. The best solution was judged to be chunking based on the nodewords including 500 words to the left and 500 to the right. This enabled the modelling to focus on only the most relevant parts of the full texts. Lastly, these extracts were lemmatized, that is, the different word forms (‘write/writes’) were collapsed into the basic word form, using the topic modelling software's own WordNet lemmatizer, and analysed separately using the Bag of Words method with Euclidean distance. Each century produced a set of topics that varied slightly (10-14) with some variations in parameters (lower threshold at 0,05-0,10, and the upper at 0,15-0,25). The topics obtained in this way were more internally consistent as well as representing a wider variety of phenomena than both the full text and the sequentially chunked text, which were rather similar, and lacked some areas (e.g. entertainment) that were prominent in the nodeword-chunked data.

The justifiable criticism has been aimed at topic modelling that it is quite subjective. Indeed, as the labelling of the topics is decided by the researcher, the results of the method can be subjective, even spurious. To mitigate this, a common practice from qualitative thematic analysis was adapted for the purposes of this study: double-coding. This means simply that two different people assign themes (or ‘code’ themes) to the same data independently of each other. The result of this double-coding is then checked for intercoder agreement (ICA), that is, how often the same segment is considered by both to fall under the same theme. This is used specifically to counteract bias that results from the subjective theme assignment of one researcher. Double-coding is usually done by two humans, but in this case the words composing the topics obtained from LDA were subjected to semantic tagging (assigning tags that indicate the meaning of the word using the USAS Semantic Tagger). The entire corpus could not be tagged in this way due two factors, one being a technical limitation, and the other that the tagger functions best with contemporary language. Only the resulting themes were therefore semantically tagged, and the final labels for the topics were based on the results of both the intuitive labelling and these tags. Despite the relative context-lessness of the tagged words, the tagging turned out to be a valuable and fruitful addition to human intuition, challenging and combating subjective bias in labelling.

The list of historical thematic trends was not yet complete, however. The themes obtained needed to make sense from a literary analytical perspective. Therefore, topics indicating e.g. reported speech and interaction between characters were dropped from the final listing – it is hardly a discovery that novels have characters who talk with each other. The final list of thematic trends does contain some discoveries. The final trends were body & experiences, entertainment, family, rulers, journey, military, and religion. These themes could be found in either four out of five or all five centennial subcorpora, forming the six main lines of historical thematic trends in data.

To minimize the effect of context knowledge, relevant research literature (i.e. cultural and literary analysis) into the urban night was consulted for comparison only after obtaining these final results. As noted, prior to this study, the urban night has been only analysed with traditional qualitative methods, and the themes covered in these studies include entertainment and religion quite prominently; rulers, journey, and family less so but still to a certain degree; and military not at all. The by far most prominent theme in these qualitative studies, lighting, was absent in the results of this study.

The pre-existing research literature relies on a different scale of data and analysis as the present study – at best, the data consists of some hundred texts (per study), whereas the results of this study were obtained from nearly 1700 texts with a wider variety of authors, cultural status and plot significance of the urban night than qualitative research could encompass. As such, it can be preliminarily concluded that quantitative methods like topic modelling can present a significant contribution to the existing research done on the themes of the urban night, as it can produce meaningful thematic trends, and both confirms some aspects of pre-existing analysis and challenges others.

Lastly, the software programme used for the topic modelling was Orange. It has a graphical user interface and requires the minimum amount of coding skills. While certainly not foregoing learning the differences between qualitative and quantitative approaches, it is important to recognize especially in digital literary studies that if we want the field to grow and be contributed to by researchers with a classical literary studies training, the tools we use need to be truly available. Not only low-cost or completely free of charge, but most importantly such that they do not require extensive pre-existing skill sets to use. Orange fulfils these requirements.

Long Paper (20+10min)

Exploring the Potential of Bootstrap Consensus Networks for Large-Scale Authorship Attribution in Luxdorph's Freedom of the Press Writings

Florian Meier, Birger Larsen, Frederik Stjernfelt

Aalborg University Copenhagen, Denmark

Authorship attribution (AA) is concerned with the task of finding out about the true authorship of a disputed text based on a set of documents of known authorship. In this paper, we investigate the potential of Bootstrap Consensus Networks (BCN) --- a novel approach to generate visualizations in stylometry by mapping similarities of authorial style between texts into the form of a network --- for large-scale authorship attribution tasks. We apply this method to the freedom of the press writings (\textit{Trykkefrihedsskrifter}), a corpus of pamphlets published and collected in Denmark at the end of the 18th century. By conducting multiple experiments, we find that the size of the constructed networks depends heavily on the type of variables and distance measures used. Furthermore, we find that, although a small set of unknown authorship problems can be solved, in general, the precision of the BCN method is too low to apply it in a large-scale scenario.