In this paper, we analyze the amount of speaking time by each candidate and political party during the election debates that aired in broadcast media during the Estonian 2019 parliament election campaign, using automatic speaker identification and weakly supervised neural network training techniques.
The work has two goals: analyze the effectiveness of a rapid weakly supervised speaker identification model training method under real-world conditions, and examine the potential bias in broadcast media towards political parties, in terms of speaking time allotted to the corresponding individual candidates during election debates.
Usually, speaker identification systems are trained on manually segmented and labelled training data: for each person that needs to be covered by the system, several speech segments which contain speech from this person are needed. This makes training data preparation costly and time-consuming, especially if a large number of speakers needs to be identifiable. In this work, on the other hand, we trained speaker models using the recently proposed weakly supervised training method which only needs recording level speaker labels: for each person, several recordings are needed where this person is one of the speakers while segment level labeling is not required. This makes training data creation less costly. Furthermore, often such training data can be constructed automatically, using metadata accompanied with speech recordings. The method relies on automatic speaker diarization of training data, i-vector based speaker embeddings and a special cost function that encourages a deep neural network to assign only one of the discovered speaker vectors to a particular speaker label.
The Estonian 2019 parliament elections had 1084 enlisted candidates. We used YouTube and the Estonian Public Broadcasting (ERR) media archive to retrieve audio and video files that likely contained speech by each of the candidates. In the case of YouTube, we retrieved videos whose title or description contained the person’s full name. For ERR, we relied on the metadata of each media clip that listed the names of the persons speaking in the recording. Using such technique, potential training data was found for 810 candidates. However, only 317 candidates occurred in 10 or more recordings, as was required by our training method.
We manually examined a small subset of the resulting dataset. We determined that 12% of the clips are false positives, meaning that they did not actually contain the person for whom they were retrieved for.
After training speaker identification models on the automatically constructed training data, we validated the accuracy of the system using a set of four manually segmented and labelled election debates. The validation dataset contained speech by 26 unique candidates, 21 of which (78%) were covered by our system. The system correctly identified 24 of the candidates, resulting in a recall rate of 73% over all candidates. No false positives were returned, resulting in a 100% precision.
The full speaking time analysis was performed over a set of 55 election debates from six different radio and TV stations, resulting in a total of 55 hours. 19% of the debates were in the Russian language, the rest were in Estonian. For each debate, a set of candidates who appeared there was manually constructed, with the help of metadata that came with the recording. A total of 123 unique candidates appeared in the debates, of which 69 (56%) were covered by our system.
The analysis of speaking time over individual candidates brought no real surprises: the leaders of the eight political parties that participated in the elections with a so-called full list (i.e., at least 101 candidates) occupied the first seven places in terms of total speaking time.
By aggregating the speaking time of individual candidates of the political parties, we calculated the total speaking time of different parties. At first, the results seemed to indicate a large bias: large and established parties received up to two times more speaking time than newer parties (even when limiting the analysis to “full list” parties). However, we acknowledged that this was partly due to the weakness of our training method: newer parties have more candidates that are fresh to politics, and have thus less exposure on YouTube and in the public broadcasting archive, increasing the risk that they are not covered by our model. Thus, we adjusted the results using the following method: all candidates who were present in the debates but not identified by our system, were assigned an estimated speaking time, calculated as an average over the speaking time of the persons in this debate who were identified successfully. The adjusted results show relatively little difference between political parties: all full list parties were assigned between 220 and 270 minutes of speaking time.
We did not attempt to analyze the causality between candidate speaking time and election results, since there are several factors, such as prior popularity, speaking skills and experience in political debates, that affect both exposure in debates as well as the number of votes received.
The experiments showed that it is possible to use methods of weak supervision to create a targeted speaker identification system with a high precision by using several potentially noisy data sources. However, it was also observed that for a large part of the candidates no training data could be automatically retrieved from public data sources and thus no speaker identification models could be trained for them.
The analysis showed that the election debates were not biased from the speaking time point of view: all major political parties received around 245 (± 10%) minutes of speaking time across the debates.