Conference Agenda

Parliamentary Corpora
Wednesday, 18/Mar/2020:
2:10pm - 3:40pm

Session Chair: Ilze Auziņa
Location: Hall B

Short Paper (10+5min)

Discourse on Safety / Security in the Parliamentary Corpus of Latvian Saeima

Ilva Skulte, Normunds Kozlovs

Riga Stradins University, Latvia

The discourse on (public) safety and (social) security in the political communication has an impact on constructing national identity and community feelings through the ideas of risk and emergency. Indeed, the many aspects of unsecurity / unsafety make this to be elaborated in speeches as a rather manifold and complex concept. How is this conceptual nexus used and perceived in the speeches of MPs of Latvian parliament, and what impact it may have had on (re)formed national identity? These are the main issues in the proposed paper. Methodologically here the analysis combines critical discourse analysis (CDA) and corpus analysis and is based on the Corpus of Debates in Latvian Saeima (1993 - 2017 ( By means of corpus analysis tools the categories and frames of representation of safety and security in the speeches of Latvian MPs are selected and described, and the qualitative analysis is carried out to understand and interpret differences and similarities in understanding and treating different aspects of safety adn security by MPs in the parliamentary discourse in Latvia and the changes in it during the time period after regaining independence.

Short Paper (10+5min)

Analyzing Candidate Speaking Time in Estonian Parliament Election Debates

Siim Talts, Tanel Alumäe

Tallinn University of Technology, Institute of Software Science, Estonia

In this paper, we analyze the amount of speaking time by each candidate and political party during the election debates that aired in broadcast media during the Estonian 2019 parliament election campaign, using automatic speaker identification and weakly supervised neural network training techniques.

The work has two goals: analyze the effectiveness of a rapid weakly supervised speaker identification model training method under real-world conditions, and examine the potential bias in broadcast media towards political parties, in terms of speaking time allotted to the corresponding individual candidates during election debates.

Usually, speaker identification systems are trained on manually segmented and labelled training data: for each person that needs to be covered by the system, several speech segments which contain speech from this person are needed. This makes training data preparation costly and time-consuming, especially if a large number of speakers needs to be identifiable. In this work, on the other hand, we trained speaker models using the recently proposed weakly supervised training method which only needs recording level speaker labels: for each person, several recordings are needed where this person is one of the speakers while segment level labeling is not required. This makes training data creation less costly. Furthermore, often such training data can be constructed automatically, using metadata accompanied with speech recordings. The method relies on automatic speaker diarization of training data, i-vector based speaker embeddings and a special cost function that encourages a deep neural network to assign only one of the discovered speaker vectors to a particular speaker label.

The Estonian 2019 parliament elections had 1084 enlisted candidates. We used YouTube and the Estonian Public Broadcasting (ERR) media archive to retrieve audio and video files that likely contained speech by each of the candidates. In the case of YouTube, we retrieved videos whose title or description contained the person’s full name. For ERR, we relied on the metadata of each media clip that listed the names of the persons speaking in the recording. Using such technique, potential training data was found for 810 candidates. However, only 317 candidates occurred in 10 or more recordings, as was required by our training method.

We manually examined a small subset of the resulting dataset. We determined that 12% of the clips are false positives, meaning that they did not actually contain the person for whom they were retrieved for.

After training speaker identification models on the automatically constructed training data, we validated the accuracy of the system using a set of four manually segmented and labelled election debates. The validation dataset contained speech by 26 unique candidates, 21 of which (78%) were covered by our system. The system correctly identified 24 of the candidates, resulting in a recall rate of 73% over all candidates. No false positives were returned, resulting in a 100% precision.

The full speaking time analysis was performed over a set of 55 election debates from six different radio and TV stations, resulting in a total of 55 hours. 19% of the debates were in the Russian language, the rest were in Estonian. For each debate, a set of candidates who appeared there was manually constructed, with the help of metadata that came with the recording. A total of 123 unique candidates appeared in the debates, of which 69 (56%) were covered by our system.

The analysis of speaking time over individual candidates brought no real surprises: the leaders of the eight political parties that participated in the elections with a so-called full list (i.e., at least 101 candidates) occupied the first seven places in terms of total speaking time.

By aggregating the speaking time of individual candidates of the political parties, we calculated the total speaking time of different parties. At first, the results seemed to indicate a large bias: large and established parties received up to two times more speaking time than newer parties (even when limiting the analysis to “full list” parties). However, we acknowledged that this was partly due to the weakness of our training method: newer parties have more candidates that are fresh to politics, and have thus less exposure on YouTube and in the public broadcasting archive, increasing the risk that they are not covered by our model. Thus, we adjusted the results using the following method: all candidates who were present in the debates but not identified by our system, were assigned an estimated speaking time, calculated as an average over the speaking time of the persons in this debate who were identified successfully. The adjusted results show relatively little difference between political parties: all full list parties were assigned between 220 and 270 minutes of speaking time.

We did not attempt to analyze the causality between candidate speaking time and election results, since there are several factors, such as prior popularity, speaking skills and experience in political debates, that affect both exposure in debates as well as the number of votes received.

The experiments showed that it is possible to use methods of weak supervision to create a targeted speaker identification system with a high precision by using several potentially noisy data sources. However, it was also observed that for a large part of the candidates no training data could be automatically retrieved from public data sources and thus no speaker identification models could be trained for them.

The analysis showed that the election debates were not biased from the speaking time point of view: all major political parties received around 245 (± 10%) minutes of speaking time across the debates.

Long Paper (20+10min)

Digging Deeper into the Finnish Parliamentary Protocols – Using a Lexical Semantic Tagger for Studying Meaning Change of Everyman’s Rights (Allemansrätten)

Kimmo Kettunen1, Matti La Mela2

1University of Helsinki, National Library of Finland, Finalnd; 2Aalto University, Semantic Computing Research Group, Finland

This paper analyses the protocols of the Finnish parliament 1907–2000. They have been digitised and published as open data by the Finnish Parliament in 2018 . In the analysis we use a novel tool, a semantic tagger for Finnish - FiST [1]. We describe the tagger generally and show results of semantic analysis both on the whole of the parliamentary corpus and on a small subset of data where everyman’s rights (a widely used right of public access to nature) have been the main topic of parliamentary discussions. Our analysis contributes to the understanding of the development of this “tradition” of public access rights, and is also the first study utilizing the Finnish semantic tagger as a tool for content analysis in digital humanities research. Keyword search shows first that that the discussion of everyman’s rights has had three different peak periods in the Finnish parliament: 1946, 1973, and 1992. Secondly, the contents of the discussions have different nature for all the periods, which could be clearly detected with FiST and keyness analysis.

Long Paper (20+10min)

Keeping it Simple: Word Trend Analysis for the Intellectual History of International Relations

Benjamin G. Martin1,2

1Uppsala University, Department of History of Science and Ideas; 2affiliated researcher, Umeå University, Humlab

In my current research on the intellectual history of international relations, I aim to use digital methods of text analysis to explore conceptual content and change in diplomatic texts. Specifically, I am interested in the sub-set of bilateral treaties explicitly related to cross-border cultural exchange -- cultural treaties -- some 2000 of which were signed in the twentieth century. What methods and workflows seem most appropriate for this task? Our answer thus far has been to keep it simple. Inspired by recent work by Franco Moretti, Sarah Allison and others, we apply a straightforward form of quantitative word trend analysis, integrated with analysis of metadata about the corpus and tested (and expanded) through full-text searching. By formulating this approach in a specific relationship to the nature of the corpus and the historical questions I want to ask of it, we are able to get quite a lot out of this simple method. In this paper, I describe this approach, share some provisional findings, and offer some methodological reflections.