NER, Annotation, Clustering
Short Paper (10+5min)
Evolving Political Keywords 1945–1989: Clustering Word Distributions in 3 100 Swedish Governmental Official Reports
Umeå University, Humlab, Sweden
The Swedish welfare state is usually associated with a number of political keywords: equality, liberalism, rationalism, internationalism and folkhemmet [the people’s home]. Such keywords are by no means static. On the contrary, they tend to change in frequency, and sometimes new terms replace old ones. For instance, some researchers argue that the 1960s and 1970s was a period of radical change that also affected the vocabulary to describe the shifting trends and modes in society (e.g. Bjereld & Demker, 2018). Similarly, some media historians have argued that people during this period saw a need for a new language that could describe the shifting media and information landscape at that time (Hyvönen et al, 2017). Such historical research tends to present changes in language use through qualitative or even anecdotal methods – whether it regards specific ruptures or slowly evolving temporal developments in society. Today, computational methods offer an opportunity to empirically examine the rise and fall of central keywords – as well as to detect new or forgotten ones – on a larger scale (e.g. Guldi, 2018). In our presentation we use this approach in order to examine such previously mentioned research assumptions, and to study changes in the political vocabulary in general, and changes related to media and communication in particular.
In our presentation, which is part of a work in progress case study, we thus focus on changes in frequency of words within the political sphere during the Swedish Post-War era. The case study calculates and clusters word distributions over time in a political corpus of government reports, with a specific focus on words with a higher degree of fluctuation over time. Our paper aims to answer the following questions: Are there particular time periods when keywords appear and disappear in the political vocabulary in Sweden between 1945 and 1989? If so, when? And more importantly, what previous knowledge can be discarded and confirmed about the political landscape in general, and about media and communication issues in particular, by studying aggregated changes of word distributions over time?
Empirically, we focus on the entire corpora of the Swedish Governmental Report series (Statens offentliga utredningar, SOU) from 1945 to 1989, in total about 3 100 reports. These reports, based on a system of commission inquiries, are used to provide the government with knowledge and alternatives before submitting a proposal for new legislation. The vast diversity of scrutinized topics after 1945 make the series a valuable historical source for broader studies of the Swedish government’s view on various political matters, and which keywords that were of importance in order to navigate in the political landscape, as well as for more narrow studies, for example about the political vocabulary concerning issues of media and communication (Norén & Snickars, 2017).
Methodologically, this paper builds on David McClure’s work, in which he traces aggregated word distributions across narrative time (from the first page to the last page) in 27 000 American novels (McClure, 2017). But instead of narrative time we focus on word distributions across historical time in 3 100 SOU reports – from 1945 to 1989.
As a first methodological step, we focus on (non-lemmatized) words that appear at least 10 000 in our corpus (cf. McClure, 2017). Due to the bureaucratic genre of our corpus it is expected that many of these words – “report”, “scrutiny”, “government”, “evaluate” and so on – perform a relatively uniform distribution over time. However, since we are interested in changes in the vocabulary over time, and in particular if there were specific periods when such changes did occur, we need to focus on words with a higher frequency fluctuation across this time period. As our null hypothesis, we assume that a word distribution over time will result in the same normalized frequency value each year, and thus generate a complete flat curve with no fluctuation in frequency across time. We then use the chi-squared test to determine the difference between the expected word frequency (i.e. the null hypothesis, with a flat frequency curve over time) and the observed word frequency (i.e. the actual normalized frequency, with a fluctuating trend over time). From this result we limit our following study to a list with words that perform a high variance score (i.e. high fluctuation of normalized word frequency across time).
Then, as a second step to investigate whether some time periods are more sensitive towards change in the political vocabulary, we will cluster our chosen list of words based on how similar their relative frequency distributions are across time (and not by their similarity in meaning). We use basic hierarchical cluster modeling on our list of chosen words with the highest degree of normalized variation. An advantage of hierarchical clustering is that you do not have to base the clustering on too many subjective decisions – just metrics used to compute the distance between the distributions. Here, we will apply different distance metrics to test stability of the hierarchical clustering. We will start with a lower threshold and increase it until the hierarchical model delivers a result with a relatively limited number of word distribution clusters. This will leave us with a better overview of vocabulary changes during the political Post-War period – from the perspective of the Swedish government official report series. The result from the hierarchical cluster model will thus constitute the base for our analytical work: changes in political language across during the Post-War era, and what new knowledge can be generated about the political landscape in general, and about political issues concerning media and communication in particular.
This presentation is part of the research project “Welfare State Analytics. Text Mining and Modeling Swedish Politics, Media & Culture, 1945–1989” (WeStAc), that both digitizes and curates three massive textual datasets – in all almost four billion tokens – from the domains of newspapers, literary culture, and Swedish politics during the second half of the 20th century.
Demker, M. & Bjereld, U. 1968: När allt började, 2018.
Hyvönen, M., Snickars, P. & Vesterlund, P, ” The Formation of Swedish Media Studies, 1960–1980”, Media History, published online 10 Feb 2017.
Guldi, J. “Critical Search: A Procedure for Guided Reading in Large-Scale Textual Corpora”, Cultural Analytics, published online Dec 2018.
McClure, D. “Distributions of words across narrative time in 27,266 novels” (https://litlab.stanford.edu/distributions-of-words-27k-novels/) and “A hierarchical cluster of words across narrative time” (https://litlab.stanford.edu/hierarchical-cluster-across-narrative-time/), 2017.
Norén, F. & Pelle, S. “Distant reading the history of Swedish film politics in 4500 governmental SOU reports”, Scandinavian journa of Cinma, no 2 2017.
Long Paper (20+10min)
Name the Name – Named Entity Recognition in OCRed 19th and Early 20th Century Finnish Newspaper and Journal Collection Data
University of Helsinki, National Library of Finland, Finalnd
Named Entity Recognition (NER), search, classification, and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. Performance of a NER system is usually quite heavily genre and domain dependent. Entity categories used in NER may also vary. The most used set of named entity categories is usually some version of three partite categorization of locations, persons, and organizations.
In this paper we report evaluation results with data extracted from a digitized Finnish historical newspaper collection Digi using two statistical NER systems, namely, Stanford Named Entity Recognizer and LSTM-CRF NER model. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%. Our NER evaluation collection and training data are based on ca. 500 000 words which have been manually corrected from OCR output of ABBYY FineReader 11. We have also available evaluation data of new uncorrected OCR output of Tesseract 3.04.01.
Our Stanford NER results are mostly satisfactory. With our ground truth data we achieve F-score of 0.89 with locations and 0.84 with persons. With organizations the result is 0.60. With re-OCRed Tesseract output the results are 0.79, 0.72, and 0.42, respectively. Results of LSTM-CRF are similar.
Short Paper (10+5min)
Challenges in Annotation: Annotator Experiences from a Crowdsourced Emotion Annotation Task
University of Helsinki, Finalnd
With the prevalence of machine learning in NLP and other fields, an increasing number of crowd-sourced data sets are created and published. However, very little has been written about the annotation process from the point of view of the annotators. This pilot study aims to help fill the gap and provide insights into how to maximize the quality of the annotation output of crowd-sourced annotations with a focus on fine-grained sentence-level sentiment and emotion annotation from the annotators point of view.
Short Paper (10+5min)
Names in History and Characters in Novels: Using Named Entity Recognition to Tell the Difference
National Library of Norway, Norway
Literary history is not only an old-established discipline, but also a genre with its own set of literary and stylistic conventions. As such, literary history takes many forms and includes works on the literature of nations, periods, traditions, schools, regions, social classes, political movements, ethnic groups, etc. More or less common to all these forms is that a literary history, whether written in the narrative mode or in the form of a chronicle or an encyclopedia, typically contains a myriad of proper names. These names generally come in five categories according to their referents: 1 real persons, 2 fictional persons, 3 real places, 4 fictional places, 5 works.
In its encyclopedic form, proper names of the first category – real historical persons – usually appear as the basic structuring principle of the account, starting with writers whose surname begins with the letter A and ending on Z – or as in the Norwegian case, on the letter Å. Proper names appear as basic structuring principles and key elements also in literary histories written in the narrative mode, but in endlessly more complex ways. Here, proper names may appear as part of a complex web of meanings with multiple referents, a name can refer to a real person, a fictional person, and the title of a literary work at the same time. In this paper, I will present a project of examining a corpus of book-length accounts of Norway’s literary history from the digital collection of the National Library of Norway. Drawing on the library’s digital research infrastructure, I will use Named Entity Recognition to extract proper names from the texts and various forms of network analysis to identify patterns and deep level connections between the proper names across the corpus.
On a theoretical level, I am driven by an interest in the use of narrative devices in history writing. In his book Metahistory: The Historical Imagination in Nineteenth-Century Europe, Hayden White distinguishes between different levels of conceptualization in the historical work: 1 chronicle, 2 story, 3 mode of emplotment, 4 mode of argument, and 5 mode of ideological implication. According to this model, the first stage in creating a literary history is to make a chronicle, that is to list in chronological order the works and events that fall within the relevant time span. In the second phase, the literary historian shapes a story within the chronicle. This involves among other things choosing protagonists and picking out starting and ending points. In the third phase, the author must emplot his story, that is, he must identify it with some archetype already familiar to the reader so that the reader will recognize it as a story of a particular kind. White draws his examples from literary genres and identifies four different modes of emplotment: Romance, Tragedy, Comedy, and Satire.
Alongside ideas put forward by Arthur C. Danto in Narration and Knowledge (1985) and Paul Riceur in Time and Narrative (1984-1988), Hayden White has had a wide impact on theoretical considerations in literary historiography since the 1970s, also in Norway. In 1975, following the launch of the six-volume Norges litteraturhistorie [The Literary History of Norway], the professor of rhetoric at the University of Bergen Georg Johannesen vehemently attacked the editor Edvard Beyer and his co-authors for lack of theoretical rigour and methodological consistency. In a pamphlet Johannesen called into question the basic assumption that fiction and nonfiction are to be clearly separated from each other. The use of literary devices is not restricted to fictional writing, he claimed, they apply also to academic writing, only academicians are mostly unaware of it. With reference to Georg Johannesen’s work, a younger colleague of his at the University of Bergen, Arild Linneberg later characterized the four-volume Norsk litteraturhistorie [History of the Norwegian Literature], edited by Bull, Paasche, Winsnes and Houm, as «vitskapsfiksjon» [a mix of scholarly and fictional writing] and «professordikting» [professorial or academic fiction]. «In writing about fictional literature, they themselves were unknowingly creating fiction”, Linneberg maintained.
Since the advent of digital humanities, the range of computational techniques and digital research methods has not only enabled scholars of today to raise new questions and develop new hypotheses. There is also the possibility of revisiting arguments and claims made before the digital turn. In this paper, I will re-examine the argument of Georg Johannesen and Arild Linneberg. I argue that the premise is correct, but their conclusion is wrong. There can be no doubt that history writing and fictional writing both make use of narrative devices. However, the use of narrative devices does not turn nonfiction into fiction.