Conference Agenda

Session
OCR, Lexicography
Time:
Thursday, 19/Mar/2020:
9:00am - 10:30am

Session Chair: Fredrik Norén
Location: Hall B

Presentations
Long Paper (20+10min)

Supervised OCR Post-Correction of Historical Swedish Texts: What Role does the OCR System Play?

Dana Dannells1, Simon Persson2

1University of Gothenburg, Sweden; 2Chalmers University of Technology, Sweden

Current approaches for post-correction of OCR errors offer solutions that are tailored to a specific OCR system. This can be problematic if the post-correction method was trained on a specific OCR system but have to be applied on the result of another system. Whereas OCR post-correction of historical text has received much attention lately, the question of what role does the OCR system play for the post-correction method has not been addressed. In this study we explore a dataset of 400 documents of historical Swedish text which has been OCR processed by three state-of-the-art OCR systems: Abbyy Finereader, Tesseract and Ocropus. We examine the OCR results of each system and present a supervised machine learning post-correction method that tries to approach the challenges exhibited by each system. We study the performance of our method by using three evaluation tools: PrimA, Språkbanken evaluation tool and Frontiers Toolkit. Based on the evaluation analysis we discuss the impact each of the OCR systems has on the results of the post-correction method. We report on quantitative and qualitative results showing varying degrees of OCR post-processing complexity that are important to consider when developing an OCR post-correction method.



Short Paper (10+5min)

Targeted, Neural Re-OCR of Norwegian Fraktur

Andre Kåsen, Lars G. Johnsen

National Library of Norway, Norway

This paper presents the process of making a gold standard data set for training optical character recognition (OCR) engines for Norwegian fraktur prints. We also train models with the OCR software Tesseract using a transfer learning approach. The training set and the model will be made freely available.

Finally, we start the process of re-OCRing a corpus of publicly available books from the digital repository at the National Library of Norway.



Short Paper (10+5min)

Handwritten Text Recognition and Linguistic Research

Erik M. Petzell

Institute for language and folklore, Department for dialectology, onomastics and folklore research in Gothenburg, Sweden

In this talk, I describe my ongoing work with automatic transcription of handwritten Swedish dialect texts from the 19th century, and relate it to my linguistic research on enclitic pronouns in North Germanic. Enclisis is a linguistic phenomenon that balances on the border, as it were, between syntax and morphology. For instance, enclitic pronouns fill syntactic slots just like free pronouns and larger noun phrases. However, clitics are prosodically dependent on another word, in effect being unable to bear stress. In that respect, they are more like inflectional endings than independent phrases.

Enclisis of any kind is hard to investigate in texts, since orthography, both in the past and the present, normally does not mark it. Audio recordings of dialect speakers may contain relevant data for historical linguists, but this type of material is very time consuming to work with. However, there is a third type of archival language data, which constitutes an intriguing source of linguistic structure of old: dialect texts, handwritten in the 19th century using a traditional phonetic alphabet. Dialect texts of this sort exist in archives all over Scandinavia, and through them, we are granted access to the phonetic subtleties of an era that is too distant to have been caught on audio tape.

In my talk, I will address two such texts (both written in the 1890s) from the south-west of Sweden: the first one is a compilation of dialectal expressions, collected in the parish of Fagered (in the province of Halland); the second one is a collection of narratives from of the island of Orust (in the province of Bohuslän). I refer to the alphabet used in these texts as LMA, a label based on the name of the Swedish dialect alphabet (viz. LandsMålsAlfabetet, ‘the alphabet for rural dialects’). Nowadays, the LMA is used very marginally (and almost never outside of traditional onomastics). As a rule, linguists of today instead use the International Phonetic Alphabet, IPA (https://www.internationalphoneticassociation.org), when there is need for phonetic detail in written form. However, as soon as corpus-based linguistic research targets non-phonological issues, the fine phonetic details are superfluous. In fact, such detail only makes word- and phrase-based searches more complicated. Consequently, in order to make the old dialect texts useful for different sorts of linguistic research, it does not suffice to simply transform the text of the images to a digital correlate. In addition, there is need for several conversions of the original text into different more or less simplified formats, which, in turn, can be useful also for non-linguists (both other researchers and members of the general public).

The tool I use to analyse and transcribe the dialect texts is Transkribus (https://transkribus.eu/Transkribus). The first step was to decide how to write the LMA with a standard keyboard. As mentioned, the LMA is hardly used anymore, and only very few of the LMA symbols have a Unicode status. Although all IPA symbols indeed do, they are difficult to produce with a standard keyboard. In order to reach an acceptable transcription speed, I have instead created a SAMPA based transcription key. SAMPA stands for Speech Assessment Methods Phonetic Alphabet (https://www.phon.ucl.ac.uk/home/sampa) and it resorts only to the 128 characters that a standard (i.e. English) keyboard can produce. These characters, either in isolation or combined with others, are then given a specific phonetic value. Although the underlying principles for creating phonetic symbols are the same, my dialect SAMPA is a digital version of the LMA and is therefore quite different from standard Swedish SAMPA, which is IPA-based.

To begin with, I made a SAMPA transcript of roughly 100 pages of the Fagered collection. This amount of manual transcription is what is needed to train a so called HTR engine (where HTR stands for handwritten text recognition). Once the HTR engine is integrated in the Transkribus platform, it is capable of automatically generate transcriptions of more text of the same hand. How well the engine works of course depends on an array of factors. One factor that often (according to the Transkribus crew) turns out to be complicating is super- and subscripted diacritics of the sort that occur abundantly in the dialect texts. Still, the HTR engine managed to handle the rest of the Fagered collection almost flawlessly; only a handful of minor manual corrections (concerning individual segments or diacritics) per page (16 lines) was required to perfect the transcription.

Transcription accuracy naturally decreases dramatically when the HTR engine is run on other LMA texts, written by other field linguists. When the Fagered engine handles text from Orust, only about a third of the LMA words are represented correctly in the SAMPA format. However, by adding some 50 pages of manual transcription of Orust text to the training sample of the existing HTR engine, the resulting SAMPA output becomes as satisfactory as with the Fagered collection.

Apart from dealing with the actual transference process (i.e. LMA image  SAMPA transcript), I have also experimented with conversions from SAMPA to other more or less simplified formats, in order to make the texts accessible for a wider circle of users. Only quite recently have I become aware of the models for dialect transliteration developed by the Text Laboratory in Oslo. These models transform dialectal forms to standard language, which opens up for automatic lemmatization and annotation, in turn enhancing searchability radically. My ambition is to learn from the Norwegian project and to add transliteration to standard Swedish to the list of formats that the SAMPA transcripts can be converted to.

Finally, I will show how my linguistic research into clitics has been facilitated by the digitization of dialect texts. Since the SAMPA output contains both phonetic and prosodic details, it is fairly easy to extract those instances of prosodic dependencies (marked _ in the SAMPA format) in a text that represent potential enclitic pronouns. A somewhat prosaic effect hereof is simply that I am now able to sort and quantify relevant data in a way that I could not do before. A more intriguing consequence is that I have actually discovered linguistic variation that has previously fallen under the radar. For instance, descriptions of the traditional Bohuslän dialect mention only one masculine and one feminine object clitic: (e)n and (n)a respectively. However, my Orust text reveals a hitherto unnoticed gender asymmetry: the feminine form (n)a in fact competes with a reduced form of the full pronoun hener (viz. ner), whereas (e)n remains the only masculine option, reduced forms of the full pronoun ham being unattested.



Long Paper (20+10min)

Integrating TEI/XML Text With Semantic Lexicographic Data

Tarrin Wills, Ellert Thór Jóhannsson, Simonetta Battista

University of Copenhagen, Denmark

The Dictionary of Old Norse Prose (ONP — onp.ku.dk) is an extensive digital resource which links the semantic analysis of the lexicon of Old Norse with its material record (manuscripts and charters). Its citation index of around 800,000 words represents an estimated 7% of the entire corpus of Old Norse. Only a small proportion of the corpus, which is around 10 million words, has been prepared as digital texts. The dictionary is not complete but nevertheless at this stage contains a semantic and/or grammatical analysis of around one in every twenty words of the Old Norse corpus. This analysis is spread fairly evenly across the corpus.

The methods and data structures for the dictionary, which began in 1939, were developed before digital corpus linguistics was possible. The dictionary’s methods are based on manual excerption of words and surrounding text and are not in themselves compatible with corpus-based approaches. Other projects, however, have been developing manuscript-based Old Norse digital texts that belong to the same corpus that ONP covers and a compatible manuscript-based approach. The most extensive of these is the Menota project (menota.org) which includes a catalogue of TEI/XML manuscript texts encoded according to a specified subset of TEI. A previous paper by the authors (Wills, Jóhansson and Battista 2018) describes a fast and user-friendly workflow whereby Menota texts can be linked at the lexical level to dictionary headwords in ONP using a combination of automated and manual stages. This workflow is designed to achieve very high levels of accuracy (close to 99.9%) for the automated stages. The workflow demonstrates an interoperable method whereby TEI/XML encoded texts can be integrated and linked into relational data models such as dictionaries. These methods are designed to maintain a link between the two external data sources at the level of the word so that they can be edited and maintained separately. (Documentation can be found at goo.gl/ncdWAC)

Linking lemmas at the lexical level means that users can access the dictionary directly by interacting digitally with the words in the text: clicking on a word, for example, can bring up a full dictionary entry regardless of homographs, and regardless of the normalisation or lemmatisation used in the particular text edition. It also means that full concordances for a particular lemma can be generated automatically.

The current research builds on these processes to link the words of the corpus deeper into the dictionary’s semantic structure. A dictionary aims for not just a lexical but also a semantic overview of the corpus. This is done in traditional dictionaries such as ONP by excerpting relevant words from the corpus and analysing every citation excerpted, building a semantic tree of how the headword is used in the texts. Every node in that tree contains a sense and a definition of that sense, forming the structure of the dictionary entry. With such dense excerption of examples in a dictionary such as ONP, it is technically possible to link a high proportion of words in a given text to a particular semantic analysis as assigned by dictionary editors. That is, a high proportion of words can be potentially specifically linked to the individual senses and definitions of the structured dictionary entry. For the user this would mean that they can find a particular sense or usage of a word in the specific context they are reading.

This process requires the digital linking of individual words in a text not only to the dictionary headword but to the particular citation in the semantic tree of the dictionary entry. This is a challenging task. The references in the dictionary for citations are in almost all cases to the physical page and line of the published edition. For Menota-style TEI texts the words can normally be identified by the page and line of the manuscript version of the text. The two sets of references are in the same order but are not otherwise compatible. Not all words are excerpted by the dictionary, leaving no simple way of aligning and linking the two types of reference.

The methodology employed here has as its first stage to identify (by database queries) lemmas that appear only once in the TEI/XML text and which also appear only once in citations from the same text for that lemma in the dictionary. Because there is little chance for ambiguity that the word in such a case corresponds to the citation in the dictionary, the word and citation can be fairly reliably linked automatically. Accuracy is around 90% and so these links require manual checking that the citation is of the same word in the same context as the word in the text. The initial links between the words and the citation index provide a framework by which the same method can be applied to the smaller sections of text between the linked words. This again involves identifying lemmas unique to each section of text in both the manuscript-based edition and the print-based edition used by the dictionary (using page and line references in each case to define the extent of the section searched). Links are inserted in both data structures: in the dictionary to the word in the text, and in the text to the citation in the dictionary. This method is repeated, with decreasing gaps between the identified words, until no further automatic linking is possible.

The result is that a very high proportion of the citations from a given Menota text can be quickly and accurately linked to and from the word in the text and dictionary. These links (as URIs and/or database keys) represent the minimal information needed to connect the words in each resource and are maintained even as the texts and the dictionary continue to be edited and developed separately.

For the user this linking means that when they access the text, the individual words are not only linked to the dictionary entry, but in a good proportion of instances they are linked to the individual definition and/or phrasal-grammatical context of the word as defined in the dictionary. A user — for example a student or researcher — can pull up a section of text and click on any word to get the dictionary entry, if available. Words linked at the citation level can be highlighted to indicate that further information is linked and when clicked will show the individual definition for the word, if available, and other information that the dictionary may record about that particular citation, such as the citation slip and edition information. Users of the dictionary can find specific examples for usages and can access the full text where that usage occurs, rather than the minimal surrounding text normally provided for each citation. (For an example see https://onp.ku.dk/c475521 and click on the Menota button. The red coloured words are linked to other citations in the dictionary, many of which have been defined.)

At this stage one text has been extensively linked to ONP using this method: Strengleikar (the Old Norse version of the Lais of Marie de France) in Uppsala manuscript DG 4-7 (onp.ku.dk/r10468). The method described above automatically and accurately linked 3168 of the 4065 citations in ONP to the Menota edition, representing 8% of the whole text. A further 1700 citations have been linked in other Menota texts, most extensively the Saga of Barlaam and Josaphat in manuscript Holm perg 6 fol (onp.ku.dk/r252). We expect to present more results at the conference and a further investigation of the issues regarding the citations that could not be linked by this method.

The advantages of the linking method described here include the automated generation of integrated glossaries for the text edition. Such glossaries can be used as a reading aid, for those less familiar with the language, and as a language learning tool. Glosses assist in language acquisition by improving text comprehension and aiding in vocabulary acquisition (Lomicka 1998). This applies also to digital glosses and ‘authentic’ texts (Abraham 2007), such as those used in this project.

The semantic analysis of significant portions of the text can be further developed if the dictionary at a later point, as is hoped, integrates a digital thesaurus. The thesaurus, when linked to particular senses in the dictionary, can be integrated into the text itself, potentially creating a semantic map of the text as a whole and helping users to find semantically similar material in the corpus. Lastly, the majority of words, those which are not analysed by the dictionary project, can be semantically analysed according to statistical or other digital methods so that the particular meanings of non-manually analysed words can potentially be predicted from those in similar contexts.