Early Experiments on Automatic Annotation of Portuguese Medieval Texts

This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (∼155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.


Introduction
For a long time, researchers in historical linguistics handpick the traces of the phenomena they choose to study. It is laborious and slow work, and the pressure of time and deadlines usually meant that the scope of the investigation has to be restricted, whether in terms of the phenomena or in the quantity of the data perused. In addition, though the availability of old texts on the web is larger than ever before [1,3,5,11,12,22,19], most of the times they are produced only as a facsimile or, even if transcribed and edited, they are not linguistically annotated, at least for the words' parts-of-speech (PoS), i.e. morphosyntactic categories (noun, verb, adjective, etc.) and their inflection, as well as the words' lemmata [8,13]. This presents a challenge to those researchers focused on studying the history of a language. When looking for phenomena that rely on the written word to know how the language was at a particular time, picking up the data manually can both be an valuable asset and a kryptonite. Thus, having texts' words annotated for their lemmas and PoS allows for further linguistic processing, namely automatic syntactic analysis (parsing) and the modelling of former stages of language by way of treebanks [16] and texts' collation [2]. For this reason, Natural Language Processing (NLP) tools and techniques [10] can be very useful to Historical Linguistics [5,6,18]. Not only do they allow new and different kinds of research questions, but they also introduce new research tools and methods regarding the collection of data and speed up its analysis. This paper is part of a larger project that aims to use NLP methods on the investigation of Old Portuguese, particularly on the texts that make up the Corpus de Textos Antigos 'Old Texts Corpus' (CTA) 4 , a project started in 2015 by the Center of Linguistics of the University of Lisbon (CLUL) 5 . As a repository of transcribed and edited texts in Old Portuguese, dated up to 1525, this corpus can be a helpful resource to researchers interested in the older stages of the language. Nevertheless, the CTA's texts are not yet annotated neither for their PoS nor for their lemmas. Manual annotation of the entire corpus is a very time-consuming, highly-skilled and costly task, hence a machine-learning approach would better suit these goals. In fact, even if the automatic annotation produced by this language models is not completely accurate, it goes a long way in preparing the textual material for a manual revision and correction, speeding up the human annotation effort. This paper, then, will present some early results of the automatic annotation of a subset of the corpus whose data was prepared for a machine-learning PoS and lemmatization tasks.

Corpus of Ancient Texts (CTA)
The Corpus de Textos Antigos is a project developed by CLUL's Philology group, which aims to publish all hagiographic, spiritual and didactic texts written in or translated to Portuguese up to 1525 (this is a flexible date, deliberately chosen to allow the inclusion of incunabula and also texts that, despite dating from the first quarter of the 16 th century, transmit older manuscripts). The main purpose of this project is to offer editions that reproduce the texts with high fidelity to the manuscript (ms.) or incunable 6 . Following this principle, there is little or no editorial intervention when it comes to the correction of errors, the restitution of lacunae or orthographic variation. This is why a simple string search would almost never capture all the instances of a word occurring within the corpus, so that lemmatization is an essential previous step towards efficient lexical queries. The corpus uses the web-based framework TEITOK [9,20], an online tool which combines both textually annotated texts with linguistic annotations. With a modular design and the granular customization it allows, TEITOK can be used with very different corpora.
As of April 2022, the corpus consists of 31 editions of 26 different texts. There are three texts with more than one edition: Horto do Esposo has two edited manuscripts from the late 14 th century; Vida de Santa Maria Egipcia, with two mss. from the 15 th century; and Vida e Milagres de Santa Senhorinha de Bastos, written in the second half of 13 th century, has four edited witnesses that date from the early 17 th century to the 19 th century. The texts differ in extension, from a couple hundreds to more than 150,000 words. As shown, texts also vary both in the date of redaction and in date of production. The oldest text (and manuscript) is the ms. A of Horto do Esposo 7 and it dates between 1390-1437. The most recent text (not necessarily the most recent manuscript or edition) is Memorial da Infanta Santa Joana 8 and dates between 1513-1525. The most recent manuscript comes from the end of the 18 th century, the ms. P of Vida e Milagres de Santa Senhorinha de Bastos 9 .

Text selection, preparation and annotation
In this section, the text selection, preparation and annotation process are described. For the manual annotation, the ms. A of Horto do Esposo (henceforward, HdE-A), whose both the manuscript and the text date from about the same time (c. 1390-1437), was chosen. For testing, the ms. G1 of Vida e Milagre de Santa Senhorinha de Basto (henceforward, VMSSB-G1 ) 10 was chosen. This is a text dated between 1248-1284 and whose manuscript has been dated from 1620-1645. As the corpus has three others, albeit fragmented, witnesses of Horto do Esposo (henceforaward, HdE-DCE ) 11 , the testing was also done on this witness. The HdE-DCE ms. was chosen for testing the POS-tagger because of its natural likeness to HdE-A, while the choice of VMSSB-G1 is due to the fact that, being both HdE-A and VMSSB-G1 hagiographic in genre, a greater similarity between their respective lexicons is expected. For another experiment, a second model was trained on these 3 texts, and 2 other texts were selected from the corpus for testing: the História do mui nobre Vespasiano (henceforward, Vespasiano) 12  is thought to have been first written between 1513 and 1525. Table 1 presents the contents of the texts selected from the CTA corpus for the experiments in this paper. The selected texts are indicated by a conventional code with their respective date (see details below). Information on the number of tokens, words, different word forms (case sensitive) and punctuation signs is provided. The manual annotation task consisted in attributing to each token the corresponding lemma and the part-of-speech (PoS) tag. A set of guidelines for this task were produced to define the criteria for attributing the lemmata, to describe the tagset, and to explicitly guide the PoS-tag attribution, especially in more complex cases. For the lemmatization of the word forms, the modern lemma was adopted whenever possible, in order to ensure an efficient way to query the corpus. The traditional criterion for lemma attribution was generally adopted: the impersonal infinitive for the verbs, the masculine-singular form for the adjectives, the singular form for nouns (masculine or feminine, depending on its gender), and so on. Each PoS-tag consists of a morphosyntactic category (v.g. adjective, adverb, conjunction, determiner, interjection, noun, preposition, pronoun or verb) and, if applicable, an inflection code indicating the morphological categories relevant to that category (i.e., tense-mood and person-number, for verbs; gender and number for nouns; etc.). We adopt a highly granular tagset, adapting one already developed for Modern Portuguese and presented in [4,14,15]. The formalism here used is generically the same that was originally developed by [7]. Three annotators participated in the task, all linguists familiar with Old Portuguese texts and its grammar. At the end of the process, a set of procedures was put in place to verify and correct eventual inconsistencies.

Experiments and Results
Having all words present in Horto do Esposo (HdE-A) initially annotated with lemmas and PoS-tag, a thorough revision was made, not only regarding the correctness of the lemmas attribution but also considering the formal consistency of the annotation. Errors and inconsistencies, due to manual annotation, were detected and corrected. A PoS-tagging model was then trained with the TreeTagger [17] and applied to both HdE-DCE and VMSSB-G1. Then, after correcting the annotations produced for these two texts, a new model was trained and applied to both the MISJ and Vespasiano. Table 2 shows the results of the different experiments in automatically PoS-tagging the corpus' texts. A preliminary, manual inspection of the results and the corresponding error analysis was then carried out. Entirely correct matches (lemma, PoS and morphosyntactic tag) are marked as true-positives (TP) and precision (P) is provided. Then, lemma attribution was considered, either correctly (L-), or incorrectly (T-) attributed, or, else, not given (unknown, U-). Within each of these lemma attributions, the correcteness of the PoS and the morphosyntatic tag were also distinguished: -p indicates when both PoS and morphosyntactic tag were correctly given; -t indicates that the correctly marked PoS was, but not the morphosyntactic tag; -z indicates that neither PoS nor tag were correct. Punctuation (punct) marks were often marked by the system as unknown lemmas instead of the conventional notation adopted. Often, these were incorrectly given a PoS and a morphosyntactic tag.
Concerning the first experiment, the better performance of the model on HdE-DCE (precision: 91,24%) could be explained by the fact that it is another witness of Horto do Esposo. Both manuscripts (HdE-A and HdE-DCE) thus have the same lexicon and the same syntactical structures. The proximity in the dates of the manuscripts may also have played a role in this results. Several aspects may explain the worst performance of the model on VMSSB-G1 (precision: 67,4%). The ms. VMSSB-G1 dates from the 17 th century, which is much later than the date of HdE-A (c.1390-1437). Though some older traces of the language are preserved, VMSSB1-G1 shows some linguistic changes that happened between the two periods. For example, the program did not recognize the form nao (adverb 'no') as HdE-A only presents the forms nõ, non, nom, and nã This new graphic form nao signals the changes in the nasal word endings, converging into the diphthong <ão>. On the other hand, many lemmas could not be ascribed due to graphic differences found in this manuscript, even if the same word appears in both. Also, VMSSB-G1 makes use of the comma 1,426 times, whereas HdE-A only uses the full stop, which explains the punctuation errors signaled in the Table. The model built upon the data of HdE-A, HdE-DCE and VMSSB-G1 was then applied on the MISJ and Vespasiano. Results show that the new model produced better results on Vespasiano (precision: 82.38%) than in MISJ (precision: 77.31%), though it still fails to recognize a large number of lemmas, especially in the latter. Again, many words show a spelling different from the one used to learn the model.
As for the incorrect annotations (false-positives), there are two types of errors, based on whether the model attributes a lemma to a word (column T-z) or not (column U-z). A large number of words with unknown lemmas (U-) are still adequately tagged as for their PoS (U-p) or their morphosyntactic values (U-t). This case corresponds to the PoS-tagger being able to correctly guess those values from the surrounding words. Many cases in T-z correspond to the typical situation of PoS ambiguity. For example the word nos may correspond to different inflections of the personal pronoun (nós/nos, 'we,us') but also to the contraction of preposition and a definite article (em_os 'in_the-masc.pl.). As for the cases with unknown lemmas and where the system also fails the PoS and morphosyntactic tags (U-z), this may hint at the natural limitations of the machine-learning approach here adopted.

Conclusion and future work
This paper presents the preliminary steps taken towards the automatic annotation of the Portuguese Corpus de Textos Antigos. This annotation consists in attributing lemmas and PoS-tags to their word forms, both the morphosyntactic categories and their inflection values. An initial annotation task was manually carried out on HdE-A, containing almost 150 thousand tokens. Such data was then used to train a Machine Learning model, and then used to automatically annotate two other smaller documents: another ms. (HdE-CDE) of the same text used for training; and another text, different but of a similar genre (VMSSB-G1). As expected, the second ms. of the HdE text achieved a very high precision (91.24%). With the unrelated text of VMSSB-G1, the model only produced a modest precision (67.4%), mostly because many word forms (8.84%) had not been previously seen by the model, so that their lemmata were labelled as unknown. Still, the model was able to correctly assign the PoS and the inflection values to most of them.
The preliminary results of the automatic annotation show how the model improves the more data it receives. The performance of the second model on Vespasiano is better than the outcome of the first experiment on VMSSB-G1. Whereas the latter was annotated with a model with the data from only one text, the former had the model trained with three different texts. As for the errors, whenever the tagger inaccurately attributes a lemma, it is often due to the ambiguous nature of the word.
The use of NLP methods on the corpus will allow for new questions to be asked in new approaches to this linguistic data. Based on the lexically annotated corpus, it will now be possible to analyse the irregularity of the forms and linguistic changes. The use of an annotated corpus could also be helpful in determining the affiliation between different witnesses of the same text [2], using automatic collation tools, such as Collatex [21] 14 .