Conference Agenda

Session Overview
Session
Newspapers
Time:
Wednesday, 18/Mar/2020:
4:10pm - 5:40pm

Session Chair: Anda Baklāne
Location: Hall B

Presentations
Short Paper (10+5min)

Can Umlauts Ruin your Research in Digitized Newspaper Collections? A NewsEye Case Study on ‘the Dark Sides of War’ (1914–1918)

Barbara Klaus

University of Innsbruck, University of Vienna, Austria

Digitized newspaper collections facilitate the access to historical newspapers. Even though they offer several useful possibilities regarding the research in historical newspapers and magazines, the (automatic) research in these col-lections is (still) full of limitations and pitfalls. Based on the research con-ducted on the platform AustriaN Newspapers Online (ANNO) for the NewsEye case study ‘the dark sides of war’, the main challenges of working with digitized newspaper collections will be discussed in this paper. Especial-ly two aspects – the fire catastrophe at the munitions factory Wöllersdorf (1918/09/18) in Lower Austria and the Austrian press coverage about war widows during the First World War – will be used as specific examples. The discussed limitations include the Optical Character Recognition (OCR) quali-ty, provided search options and metadata, as well as others. Furthermore, possible improvements regarding these challenges, e.g. Optical Layout Recognition (OLR), Named-entity Recognition (NER) and Named-entity Linking (NEL), will be presented in this paper.



Short Paper (10+5min)

The Life and Death of Newspapers: Using Metadata to Assess the Outlook and Trajectories of Newspapers in Finland, 1771–1917

Zafar Hussain, Eetu Mäkelä, Jani Marjanen, Mikko Tolonen

University of Helsinki, Finalnd

(Please see PDF, because the abstract contains figures)

The reason why so many historians are currently excited about the increasing availability of digitized newspapers is that newspapers were possibly the most important channel of public communication in nineteenth-century Europe. They recorded most societal events and thus are a rich source for historical findings, but they are also often identified as factors in major transformations in history such as the emergence of a bourgeois public sphere (Habermas 1962), the establishment of nation states (Anderson 2006) and the breakthrough of representative democracy (Keane 2009). Since an increasing amount of newspapers have been digitized in the past twenty years, there are huge amounts of studies that target detailed questions in the newspapers and the expectations for what can be analyzed through computational methods are sometimes rather unrealistic (Da 2018). For the analysis of large societal processes, like the transformation of the public sphere, there are, however, great obstacles with regard to data quality, coverage and uniformity.

For a computational analysis of the public sphere, it is crucial that we are able to optimize the research question to the creation of sub-corpora through the available datasets. This balance between research questions and the corpus is particularly important because there is always bias in the historical records. This does not mean that it cannot be used in a meaningful way (as in any historical research). But if the research questions are framed in a manner that is too broad, we risk to make the existing bias part of our analysis.

Our core assumption is that the public sphere cannot be accessed as one whole with the idea that the newspapers as such represent it in a meaningful way (Marjanen et al 2019). Instead, we aim to study particular manifestations of different types of public spheres in different locations and time focusing on changes at different scales. Thus, for example, instead of looking at Finland as a unified entity it makes sense to divide the analysis into different public spheres that are realised at different paces (Swedish vs. Finnish), town vs country, inland vs coast. We use existing metadata records to examine the shapes and boundaries of public discourse.

In order to uncover and understand the complexity of the phenomenon, as reflected by our data, one part of our work has been to develop purely data-driven means to delineate and model the different types of newspapers in our dataset. In this study, we do so not by looking at the content of the newspapers, but by mapping the complexity of their material development. Here, we have extracted from the scanned ALTO/METS data for each X newspapers, information on their page size, number of columns, information density and frequency of publication. In earlier work, we have used this materiality information to chart general trends in how Swedish and Finnish newspapers developed in Finland during the 19th century (Marjanen et al. 2019, Mäkelä et al. 2019a, Mäkelä et al. 2019b). However, as part of that work, we discovered that the general trends belie a much more complex reality.

For example, while the general trend in the page size of newspapers follows a linear increase (Fig. 1), if one looks more closely at the data, one can see that this is caused by an interplay of multiple different phenomena. First, a lot of the increase is caused by a clear increasing trend in page size for newly established newspapers, even though individual variance is also large (Fig. 2). When one looks to how already established newspapers switch sizes, a much more complex picture appears. Here, in Figure 3, we can see existing newspapers steadily move both to larger as well as smaller page sizes. The overall increase here only occurs due to the proportion of newspapers increasing in size being consistently more than those decreasing.

Figure 1. Mean newspaper area by year (drop between 1890–1910 is an artifact in the data)

Figure 2. Page sizes of newly established newspapers

Figure 3. Proportion of news papers each year increasing (green) and decreasing (red) in size, as well as their difference (yellow)

Based on these realizations, in this study we decided to see if we could categorise newspapers not on their absolute characteristics at a particular time, but instead by their behaviour during their lifespans. Thus, instead of absolute values for the page size, number of columns, information density and publication frequency, we took as features whether these values increased, decreased or stayed the same each year.

While for the most part, robust analyses of this data is still underway, we do already have some preliminary and provisional results.

First, to analyse the data, we did a hierarchical clustering to identify general trajectory categories. To be able to compare newspapers with different lifespans, we aggregated the data into proportional features, i.e. for example for page size, we calculated the percentage of years it increased, stayed the same or decreased. To maintain diachronic information, we also added a volatility measure, which denoted how often in its lifetime a paper switched between categories, e.g. from increasing to decreasing. We limited our analysis to the 301 newspapers in our dataset that were published for at least three years.

Based on a preliminary analysis of the clustering results, we can identify five distinct major developmental categories, with different representations. First, 13 newspapers out of the total 301 clustered together into a category we describe as completely stable. For their lifetime, they do not change their material dimensions. The second category of 64 papers can be described as relatively stable, with only occasional forays either way in paper size. The third and fourth categories identified were mostly decreasing (32 papers) and mostly increasing (14 papers) respectively. Interestingly here, more papers were identified to be relatively constantly decreasing in page size as opposed to increasing in it. Finally, the largest category which contained a total of 177 newspapers was formed around papers with a high volatility, i.e. those which frequently changed between larger and smaller formats. When we delved deeper into this category, it split into two almost equal parts, one where volatility is extremely high, but the general trend is still increasing, and the other defined only by its volatility. Interestingly, none of the categories differed much in the lifespans of newspapers allotted to them.

Beyond categorising the material trajectories, we’ve also been interested in their dynamics and interrelationships. To test for this, we are running experiments where we test if a machine learning classifier can predict certain categories of developments based on others. As an example of this, our preliminary results seem to indicate that a failure to increase throughput, whether by increasing publication frequency, paper size or information density, will increase the probability of a newspaper going out of business.

Bibliography

Anderson, B. (2006). Imagined communities: Reflections on the origin and spread of nationalism (Rev. ed). Verso.

Da, N. Z. (2019). The Computational Case against Computational Literary Studies. Critical Inquiry, 45(3), 601–639. https://doi.org/10.1086/702594

Habermas, J. (1962). Strukturwandel der Öffentlichkeit: Untersuchungen zu einer Kategorie der bürgerlichen Gesellschaft. Herman Luchterhand Verlag.

Keane, J. (2009). The life and death of democracy. London: Simon & Schuster.

Marjanen, J., Vaara, V., Kanner, A., Roivainen, H., Mäkelä, E., Lahti, L., & Tolonen, M. (2019). A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917. Journal of European Periodical Studies, 4(1), 54–77. https://doi.org/10.21825/jeps.v4i1.10483

Mäkelä, E., Tolonen, M., Marjanen, J., Kanner, A., Vaara, V., & Lahti, L. (2019). Interdisciplinary Collaboration in Studying Newspaper Materiality. Proceedings of the Digital Humanities in the Nordic Countries 2019. CEUR-WS. http://ceur-ws.org/Vol-2365/07-TwinTalks-DHN2019_paper_7.pdf

Mäkelä, E., Tolonen, M. & Kanner, A. (2019). Charting the Material Development of Newspapers. DH2019 long paper abstracts. https://dev.clariah.nl/files/dh2019/boa/0726.html



Short Paper (10+5min)

Foreignizing the Other: National Identity and the Concept of Aristocrat in Dutch Historical Newspapers

Leon Wessels

Utrecht University, The Netherlands

The Netherlands are commonly associated with a bourgeois culture. In a classic essay, the renowned Dutch historian Johan Huizinga emphasized that the single most important characteristic of the Dutch nation is its thoroughly bourgeois spirit [1]. Huizinga was not presenting a completely new view of Dutch culture. He was rather summarizing a common opinion of his time [2]. From the eighteenth century until at least the 1960s, the values, spirit and attitude of the middle class were widely seen as the pars pro toto of the Dutch nation [3]. As a result, historians have largely ignored the aristocratic elements in Dutch culture [4]. If they were noticed at all, they were disqualified as ‘un-Dutch’ [5]. Several studies have shown, however, that the elites ruling the Dutch Republic went through a process of ‘aristocratization’. They evolved into a closed oligarchy, especially in the eighteenth century [6], and adopted an aristocratic lifestyle exemplified by their luxurious mansions in the countryside [7].

How did language reflect the social and cultural presence of elites? In this paper, I will present some of the results of my ongoing PhD research into the broader conceptual history of the term ‘elite’ in the Netherlands. I will seek to understand how the word ‘aristocrat’ was conceptualized in Dutch newspapers between 1840 and 1994, examining in particular its spatial (in this case national) connotations. The corpus consists of articles (advertisements have been excluded) from over 30 different national and regional newspapers and contains almost 15 billion words. Newspapers are particularly interesting to study the history of concepts, because their serial nature allows one to study change over time and because newspapers both produce and reflect public discourse [8].

Following the principles of Natural Language Processing suggested by Jurafsky and Martin [9], I have created a number of Python scripts to query the newspaper corpus. I started out by making a simple concordancer, similar to various openly available concordance tools [10]. Next, I wrote a script to generate frequency lists (per year) of words that occur close to the keyword “aristocrat”. This keyword was written as a Python list containing regular expressions that capture the Dutch words ‘aristocraat’, ‘aristocratie’, ‘aristocratisch(e)’ and compound words, in historical spelling variations. I applied this script to make frequency lists of words that occur within a window of three words of the keyword. For example, the sentence ‘De dwingelandij van de aristokratie van Spanje is alom bekend.’ (The tyranny of the aristocracy of Spain is widely known.) would add the following words to the frequency list: ‘van’ (2), ‘de’ (1), ‘dwingelandij’ (1), ‘is’ (1), ‘spanje’ (1).

The next step was to build a historical gazetteer suitable for extracting spatial information from the word frequency lists. A gazetteer is a geographical dictionary containing references to countries, regions, place names, et cetera. To avoid so-called ‘temporal dissonance’ I did not use an existing modern Dutch gazetteer, but created a historical Dutch gazetteer following the principles of McDonough et al. [11]. This gazetteer includes historical spelling variations and references to states that no longer exist. Using this gazetteer, I extracted references to nations from the word frequency list and saved the results as a tabularized set of data.

The resulting data were used to analyze how frequently references to various nations co-occurred with keywords related to the concept of aristocrat. Among other things, the analysis shows a clear tendency in Dutch newspapers to associate the concept of the aristocrat with foreign countries, in particular Great Britain. References to a domestic aristocracy on the other hand are only marginally present. My research thus shows that the concept of the aristocrat – as the counterpart of the burgher – was effectively foreignized. This conclusion is in keeping with the generally held image of the Dutch as thoroughly bourgeois, in spite of the actual existence of an indigenous aristocracy.

In preparation for the DHN 2020 conference, two more steps will be taken to improve the methodology. So far, the research was based on absolute frequencies of co-occurences. The first step will be to use so-called ‘significant collocation’ to identify which words co-occur more often than would be expected based on statistics alone [12]. Secondly, in order to capture the relations with semantically similar words, such as synonymy and hyponymy, I will use synsets. Synsets are sets of cognitive synonyms that are interlinked based on semantic and lexical relations. This approach has been successfully applied also by other researchers to study historical and geographical concepts [13]. Using synsets the term ‘aristocrat’ can thus be analyzed at a more conceptual level [14].

References

1. Johan Huizinga, "Nederland's geestesmerk", in: Geschiedwetenschap / hedendaagsche cultuur. Verzameld werk VII (Tjeenk Willink I\& Zoon N.V., Haarlem 1950) pp. 279-312. Originally published in 1935.

2. Henk te Velde, "How High did the Dutch Fly? Remarks on Stereotypes of Burger Mentality", in: Annemieke Galema, Barbara Henkes and Henk te Velde eds., Images of the Nation. Different Meanings of Dutchness, 1870-1940 (Rodopi, Amsterdam/Atlanta 1993) pp. 59-80.

3. Remieg Aerts, “De erenaam van burger: geschiedenis van een teloorgang”, in: Joost Kloek and Karin Tilmans eds., Burger. Een geschiedenis van het begrip ‘burger’ in de Nederlanden van de Middeleeuwen tot de 21ste eeuw (Amsterdam University Press, Amsterdam 2002) pp. 313-345.

4. Conrad Gietman, "Adel tijdens Opstand en Republiek. Oude en nieuwe perspectieven", Virtus. Journal of Nobility Studies 19 (2012) pp. 49-62.

5. Willem Frijhoff, "Verfransing? Franse taal en Nedderlandse cultuur tot in de revolutietijd", BMGN - Low Countries Historical Review 104.4 (1989) pp. 592-609.

6. H. van Dijk and D.J. Roorda, "Sociale mobiliteit onder regenten van de Republiek", Tijdschrift voor Geschiedenis 84 (1971) pp. 306-328; Yme Kuiper, "Adel in de achttiende eeuw: smaak en distinctie. Een verkenning van het veld", Virtus. Journal of Nobility Studies 16 (2009) pp. 9-18.

7. Paul Brusse and Wijnand W. Mijnhardt, Towards a New Template for Dutch History. De-urbanization and the Balance Between City and Countryside (Waanders/Utrecht University, [Zwolle/Utrecht 2011]); Yme Kuiper and Rob van der Laarse eds., Beelden van de buitenplaats. Elitevorming en notabelencultuur in Nederland in de negentiende eeuw (Verloren, second revised edition, Hilversum 2014); Yme Kuiper and Ben Olde Meierink eds., Buitenplaatsen in de Gouden Eeuw. De rijkdom van het buitenleven in de Republiek (Verloren, Hilversum 2015).

8. Michael Schudson, The Power of News (Harvard University Press, Cambridge 1982) pp. 17-18; Dan Berkowitz ed., Social Meanings of News. A Text-Reader (Sage, Thousands Oaks/London/New Delhi 1997) pp. xi-xiv; Martin Conboy, The Language of the News (Routledge, London/New York 2007) pp 149-150.

9. Daniel Jurafsky and James H. Martin, Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Third Edition draft, 2018).

10. Geoffrey Rockwell and Stéfan Sinclair, Hermeneutica. Computer-Assisted Interpretation in the Humanities (MIT Press, Cambridge/London 2016) pp. 49-65.

11. Katherine McDonough, Ludovic Moncla and Matje van de Camp, "Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora", International Journal of Geographical Information Science 33.12 (2019) pp. 2498-2522.

12. John Sinclair, Susan Jones and Robert Daley, English Collocation Studies: The OSTI Report (Continuum, London 2004) p. 10.

13. For example: Roberta Cimino, Tim Geelhaar, Silke Schwandt, "Digital Approaches to Historical Semantics: New Research Directions at Frankfurt University", Storicamente 11 (2015) pp. 1-16; Francesca Frontini, Riccardo Del Gratta and Monica Monachini, "GeoDomainWordNet: Linking the Geonames Ontology to WordNet", in: Zygmunt Vetulani, Hans Uszkoreit and Marek Kubis eds., Human Language Technology. Challenges for Computer Science and Linguistics. 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7-9, 2013. Revised Selected Papers (Springer International Publishing, 2016) pp. 229-242.

14. For discussions on the relation between words and concepts, see (among others): Otto Brunner, Werner Conze and Reinhart Koselleck, Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Volume I (Klett-Cotta, Stuttgart 1972) pp. xxii-xxiv; Peter de Bolla, The Architecture of Concepts. The Historical Formation of Human Rights (Fordham University Press, New York 2013) pp. 19-26.



Long Paper (20+10min)

Disappearing Discourses: Avoiding Anachronisms and Teleology with Data-Driven Methods in Studying Digital Newspaper Collections

Elaine Zosa, Simon Hengchen, Jani Marjanen, Lidia Pivovarova, Mikko Tolonen

University of Helsinki, Finalnd

Newspapers have been a rich source of information for historians for the past hundred years or so. In the past twenty years, digitization of newspapers has made it possible to do simple tasks such as keyword searches or more elaborate text mining analyses. Advancements like this create unprecedented possibilities to the analysis of historical sources. While there is some truth to the promises of the future, the reality is such that the research on digitized newspapers remains underdeveloped with regard to reference corpora and reproducibility of the research. Digitized newspapers are particularly discussed with respect to the development of public discourse, but the idea of entering the realm of past discourse in toto through the digitized newspapers may in the end be harmful. In reality, historians are interested in the different layers of newspaper publicity, thus location and temporality always play a crucial role of any historical analysis of public discourse in newspapers.

With these aspects in mind, this paper takes advantage of digitized newspapers and data-driven approaches in identifying disappearing discourses in newspapers. In doing this, we want to revisit one of the key tensions in historiography, that is, the interplay between being relevant for the present and at the same time writing history in a way that is true to the experiences of past actors. History’s presentism is sometimes discussed critically from the perspective of anachronism or teleology in history (Koselleck 2010; Skinner 2002), or more appraisingly in terms of genealogies of the present or letting all be the history of the contemporary (Armitage forthcoming). Regardless of the historian’s desire for contemporary relevance or for historical antiquarianism, the option to approach history without predefined questions from the present has not been possible. The advent of digitized sources that can be approached in a data-driven way opens up for a possibility of approaching history in a much more open-ended way. Hence, we propose to test the possibility of studying a historical case with as few presupposed categories as possible. To do this we study digitized newspaper collections (specifically, 19th century Finnish newspapers in Finnish and Swedish) through the perspective of discourses that fell out of fashion and disappeared from long-term diachronic newspaper data sets.

We believe there is more potential in the use of digitized newspapers when we are not pinpointing the words and concepts in our approach a priori. This may lead us to completely new avenues of research, challenge our take on history as a some sort of progression and, hopefully, show the value of the data-driven approach for the humanities. To understand the boundaries and the development of the public sphere it is useful to identify those discourses that were important in a particular time and place, but have since disappeared while words and concepts of another discourse have replaced them and started to dominate the ecosystem of print publicity. It is a commonplace to note that religious discourse has lost much of its prominence or that technological advancements have brought with them new topics that have replaced old ones. Still, by turning the question around and asking which discourses disappeared, we get a broader picture. We then turn to the data again and zoom in on localities and languages in order to avoid a totalizing view and move on to looking at where and when discourse changed. Thus, while we produce an analysis of public discourse in Finland, we approach the topic by noting that this is not a unified whole, but composed of different entangled realms of public discourse (Tolonen et al 2019; Marjanen et al 2019a).

Using newspapers and periodicals data in Finnish and Swedish encompassing respectively 5.2B and 3.4B tokens (National Library of Finland 2011a, 2011b), we utilise two different methods: relative word frequencies as proxies for particular discourses enhanced with distributional semantics derived from diachronic word embeddings (Kim et al 2014, Dubossarsky et al 2019), and dynamic topic modeling that captures more general themes. The former method, i.e. the combination of frequency analysis and vector space similarity allows us to focus on specific themes and track their dynamics along a timeline to detect crucial events related to those themes. This has successfully been carried out by recent previous work on similar data (Martinez-Ortiz et al 2016; Hengchen et al 2019; Marjanen et al 2019b; van Eijnatten and Ros 2019). Training diachronic word embeddings on different time granularities (e.g. months, years, or decades) allows for different views on the evolution of semantic clusters – these themes are then given weight through frequency counts. The latter method allows us to paint a larger picture of the different dynamics taking place in the data, by harnessing the power of topic models designed to capture trends in time-series data such as Dynamic Topic Models (DTM, Blei and Lafferty 2006). In DTM, the data is divided into discrete time slices and the method infers topics across these time slices to capture topics evolving over time. This method models how a topic changes from one time step to the next. Unlike vanilla LDA topic modelling which does not take into account the evolution of a topic, DTM is more robust to topics that changes vocabulary over time to talk about the same issue. In LDA, topics like these would likely to be separated into separate topics since the words associated with them has changed but in DTM they would be treated as one topic that is developing over time. To address the additional training complexity of this model we subsample the data such that we have the same amount of data for each time slice of our corpus. This would also ensure that the topics inferred are representative of all the time slices in the corpora rather than favoring the latter years which have more articles and newspapers associated with them.

With thematically-labelled temporal representations of newspaper data, it becomes possible to quantify and qualify the evolution of certain themes that have been automatically inferred from the data — thus removing some bias in topic selection. We further use metadata to zoom in on changes in topics to see which towns, regions or types of newspapers to manually assess the driving locations of change and to produce a typology of disappearing discourses.

Acknowledgements

This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant 770299 (NewsEye).

References

1. Armitage, D. (In Press). In Defense of Presentism. In D. M. McMahon (Ed.), History and Human Flourishing. Oxford: Oxford University Press.

2. Blei, D.M. and Lafferty, J.D. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. ACM, pages 113–120

3. Dubossarsky, H., Hengchen, S., Tahmasebi, N. and Schlechtweg, D. (2019). Time Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

4. van Eijnatten, J. and Ros, R. (2019). The Eurocentric Fallacy. A Digital Approach to the Rise of Modernity, Civilization and Europe. International Journal for History, Culture and Modernity, 7.

5. Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands

6. Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models. ACL 2014, p.61.

7. Koselleck, R. (2010). Vom Sinn und Unsinn der Geschichte: Aufsätze und Vorträge aus vier Jahrzehnten von Reinhart Koselleck - Suhrkamp Insel Bücher Buchdetail (C. Dutt, Ed.). Berlin: Suhrkamp.

8. Marjanen, J., Vaara, V., Kanner, A., Roivainen, H., Mäkelä, E., Lahti, L., & Tolonen, M. (2019a). A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917. Journal of European Periodical Studies, 4(1), 54-77. https://doi.org/10.21825/jeps.v4i1.10483

9. Marjanen, J., Pivovarova, L., Zosa, E. & Kurunmäki, J. (2019b). Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings. in Proceedings of the 5th International Workshop on Computational History. HistoInformatics2019 - the 5th International Workshop on Computational History, 12/09/2019.

10. Martinez-Ortiz, C., Kenter, T., Wevers, M., Huijnen, P., Verheul, J. and Van Eijnatten, J. (2016). Design and implementation of ShiCo: Visualising shifting concepts over time. In HistoInformatics 2016 (Vol. 1632, pp. 11-19).

11. National Library of Finland (2011a). The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2016050302.

12. National Library of Finland (2011b). The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2016050301.

13. Skinner, Q. (2002). Visions of politics. Vol. 1, Regarding method. Cambridge University Press.

14. Tolonen, M., Lahti, L., Roivainen, H., & Marjanen, J. (2019). A Quantitative Approach to Book-Printing in Sweden and Finland, 1640–1828. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 52(1), 57–78. https://doi.org/10.1080/01615440.2018.1526657



Short Paper (10+5min)

Creating an Annotated Corpus for Aspect-Based Sentiment Analysis in Swedish

Jacobo Rouces, Lars Borin, Nina Tahmasebi

University of Gothenburg, Sweden

Aspect-Based Sentiment Analysis constitutes a more fine-grained alternative to traditional sentiment analysis at sentence level. In addition to a sentiment value denoting how positive or negative a particular opinion or sentiment expression is, it identifies additional aspects or `slots' that characterize the opinion. Some typical aspects are target and source, i.e. who holds the opinion and about which entity or aspect is the opinion. We present a large Swedish corpus annotated for Aspect-Based Sentiment Analysis. Each sentiment expression is annotated as a tuple that contains a one among 5 possible sentiment values, the target, the source, and the existence of irony. In addition, the linguistic element that conveys the sentiment is identified too. Sentiment for a particular topic is also annotated at title, paragraph and document level. The documents are articles obtained from two Swedish media (Svenska Dagbladet and Aftonbladet) and one online forum (Flashback), totalling around 4000 documents. The corpus is freely available and we plan to use it for training and testing an Aspect-Based Sentiment Analysis system.