A Semi-Supervised Learning Approach for Acoustic-Prosodic Personality Perception in Under-Resourced Domains

Automatic personality analysis has gained attention in the last years as a fundamental dimension in human-to-human and human-to-machine interaction. However, it still suffers from limited number and size of speech corpora for speciﬁc domains, such as the assessment of children’s personality. This paper investigates a semi-supervised training approach to tackle this scenario. We devise an experimental setup with age and language mismatch and two training sets: a small labeled training set from the Interspeech 2012 Personality Sub-challenge, containing French adult speech labeled with personality OCEAN traits, and a large unlabeled training set of Portuguese children’s speech. As test set, a corpus of Portuguese children’s speech labeled with OCEAN traits is used. Based on this setting, we investigate a weak supervision approach that iteratively reﬁnes an initial model trained with the labeled data-set using the unlabeled data-set. We also investigate knowledge-based features, which leverage expert knowledge in acoustic-prosodic cues and thus need no extra data. Results show that, despite the large mismatch imposed by language and age differences, it is possible to attain improvements with these techniques, pointing both to the beneﬁts of using a weak supervision and expert-based acoustic-prosodic features across age and language.


Introduction
The analysis of personality traits has a plethora of applications, such as discriminating natural from disordered behaviors or automatically assessing personality traits, either in human-human communications or in human-computer interactions.Much of the literature on automatic processing of personality traits is still mostly focused on assessing and detecting the traits based on several sets of distinct features.Artificial intelligence applications are, however, taking steps towards endowing robots and virtual agents with certain traits to better interact with humans, making the communication more idiosyncratic and tuned to the paralinguistic fingerprints of an interlocutor.The Big-Five (OCEAN) personality traits is a widely used psychological model that aims at describing human personality in terms of five broad dimensions: Openness (artistic, imaginative, original), Conscientiousness (organized, effi-cient, thorough), Extroversion (energetic, outgoing, talkative), Agreeableness (kind, generous, sympathetic), and Neuroticism (anxious, self-pitying, worrying).
The automatic perception/classification of personality traits is still a very challenging task, either due to the individual spectrum of a speaker, or to the spectrum of the trait itself: whenever the richness of a person is defined by the Big-Five model in five personality dimensions, it may not cover all the subspecifications or the boundaries between such classes, as psychological studies have pointed out [1].It is clear in the literature that some traits can be more easily recognized by means of automatic procedures than others, but this fact may vary according to the data and the methodologies applied (see [2] for a survey).Moreover, it has been timidly pointed out that different personality traits are revealed in spontaneous speech by means of different sets of representative acoustic/prosodic features [2,3,4,5,6], but exhaustive categorizations of such features and studies across ages, cultures, etc. are still very scarce.
In addition, computational perception of personality is a recent field of research and speech datasets annotated in terms of personality traits are still scarce and small, which hinders the development of robust and accurate personality models.Psychological studies have shown a strong debate between change and continuity of personality traits from childhood to adult age, or even elderly in longitudinal studies [7,8,9,10].The studies in [7] show that children's personality traits are linked to the ones displayed in adult age.In this same line, we hypothesized the existence of a consistent set of acoustic/prosodic features for Extroversion and Agreeableness in both adult and children speech, pointing out to reasonable performance rates for the perception of personality traits across different languages and ages [11].This opens the door to the use of heterogeneous data sets in personality perception tasks as a way to circumvent the scarcity of labeled data in under-resourced domains.
Building upon our previous work, we devise here a more solid experimental setup by adding new personality annotations of the test set, and by incorporating a large, unlabeled database with Portuguese children's speech that is used in a semi-supervised learning approach to learn more solid personality models.For this purpose we apply the concept of transfer learning, following other successful applications of this approach to emotion recognition in speech [12,13].
The paper is organized as follows: Section 2 presents the speech databases employed in this work.Section 3 describes the acoustic/prosodic features used as the basis for the automatic personality perception models described in Section 4. Experimental results are presented in Section 5 and the paper ends with conclusions in Section 6.

Cross-age and cross-language datasets
This work pursues performing automatic perception of children's personality by taking advantage of heterogeneous corpora in a cross-language and cross-age setup.The well-known Speaker Personality Corpus (SPC) [14,15,16] database has been used here to train statistical models (binary classifiers) for each personality trait in the Big-Five model (OCEAN).The more populated, unlabeled CNG Corpus of European Portuguese Children's Speech (CNG) [17] was then used in a selflearning (semi-supervised) approach to iteratively refine the initial models.Finally, the Game-of-Nines (GoN) corpus [18] has been used as a test set to study how personality models built up from French adults' speech can be used to assess the Big-Five dimensions of personality of Portuguese children, and to evaluate the performance of the semi-supervised learning procedure adopted in this work.

Speaker Personality Corpus
The Speaker Personality Corpus consists of 640 speech files from 322 different Swiss-French speaking adult individuals.Each file contains 10 seconds of speech from just one speaker (around 1 hour and 40 minutes in total).All the files were independently assessed by 11 judges using the BFI-10 personality questionnaire [19].For each file, a high or a low level is assigned for every personality trait (denoted as O/NO, C/NC, E/NE, A/NA, N/NN, respectively) using a majority vote procedure.We refer to [15] for a detailed description of the division of the SPC corpus into train, development and test subsets.

Game-of-Nines Corpus
The Game-of-Nines corpus was originally designed to study how conflict unfolds in social interactions by looking at behavioral cues (e.g.gaze) in a mixed-motive social interaction (i.e. a scenario with competitive and cooperative incentives) with children.It comprises synchronized video-and audio-recordings of 11 dyadic sessions with 22 Portuguese children aged 10 to 12 years-old, playing a bargaining card game (a modified version of the Game of Nines [20]).The duration of the recordings varies between 9 and 18.6 minutes, with an average duration of 12.8 minutes and a total of 2 hours and 20 minutes.
A preliminary pre-processing of the original GoN database was performed in order to adapt it for our purposes [11].As a result, three different speech subsets were generated, which allow for a comparison of the effect of long/medium/short acoustic cues on personality perception systems: 1. GoN-complete: all the speech segments for a given child during the game session were concatenated together in one single speech file.As a results, the GoN-complete subset consists of 22 files ranging from 49 seconds to 8.1 minutes of speech (average duration of 4.2 minutes).
2. GoN-20seconds: for each child, 4 different files with around 20 seconds of speech were generated by concatenating their longer speech segments in the session.Very short segments (below 2 seconds) were discarded in order to avoid an excessive variability in the speech characteristics, resulting in just 2 files for one of the par-ticipants.As a result, the GoN-20seconds subset consists of 86 files with an approximate duration of 20 seconds.
3. GoN-10seconds: this subset was constructed by splitting each file in the GoN-20seconds subset into approximately 2 halves, resulting in a subset of 172 files with an approximate duration of 10 seconds each.
The original video recordings in the GoN database have been independently annotated in terms of the Big-Five personality dimensions by three experienced assessors (1 psychologist and 2 professional speech practitioners) using the BFI-10 personality questionnaire.These annotations have been used as the ground-truth labels in this work.The inter-annotator agreement values in terms of the Fleiss' Kappa coefficient are 0.673 for Openness, 0.151 for Conscientiousness, 0.292 for Extroversion, 0.07 for Agreeableness, and 0.209 for Neuroticism (mean value of 0.279).Although it is not straightforward to make comparisons across different experimental setups, these values are in line with those reported in the literature, e.g., in [21].We refer to [11] for a more detailed description of the GoN subsets, including the number of examples in each class.

CNG Corpus of EP Children's Speech
The CNG Corpus of European Portuguese (EP) Children's Speech [17] comprises around 20 hours of speech from 484 speakers.The corpus contains four different types of utterances spoken by children aged 3 to 10 years old: phonetically rich sentences, musical notes, isolated cardinal numbers, and sequences of cardinal numbers.Data is organized in two different subsets with children aged 3 to 6 years old and children aged 7 to 10 years old, respectively.Depending on their age and reading skills, the children either read the prompts, or repeated them after a supervisor.
In this study, we have just used the subset of phonetically rich sentences uttered by children aged 7 to 10 years-old (6 hours of speech).The reasons are: i) that is the more similar subset to the target domain (Portuguese children aged 10 to 12 years-old), and ii) phonetically rich sentences are more prone to display personality traits, rather than the other types of utterances in the CNG database.

Feature extraction
The experiments performed in this work use two sets of features extracted with openSMILE [22], and a set of knowledge-based features known in the literature to have impact on the classification of personality-related tasks, henceforth referred to as KB-features.

Baseline features
The first set (IS2012) consists of 6125 features and was created in the scope of the Interspeech 2012 Speaker Trait Challenge-Personality Sub-challenge.We have also used the eGeMAPS feature set [23], an extended version of GeMAPS -Geneva Minimalistic set of Acoustic Parameters for Voice Research and Affective Computing, that consists of 88 features well-known for their usefulness in a wide range of paralinguistic tasks.

Knowledge-based features
Our knowledge-based features (KB-features) are based on phone tokenizations of the speech files using the neural network-based acoustic models of the AUDIMUS speech recognizer [24].The phonetic tokenizations provide phone align-ments for each speech file, which can be used to extract duration-related features and to generate more advanced features.In this way, for instance, it is possible to extract the silence ratio, speech duration ratio, and speech rate features in terms of phones per second.The phone tokenizations also make it possible to characterize each speech segment using n-grams of phones.Based on these tokenizations, we then derive Inter Pausal Units (IPUs), that consist of sequences of phones delimited by silences.Our experiments use both French and Portuguese phone models.
The experiments presented in this work use a set of 41 knowledge-based features, including duration of speech with and without internal silences, and tempo measurements such as speech and articulation rates (number of phones or syllables divided by the duration of speech with and without internal silences, respectively) and phonation ratio (duration of speech without internal silences divided by the duration of speech including internal silences).Other features involve pitch (f0), energy, jitter and shimmer, including pitch and energy average, median, standard deviation, dynamics, range, and slopes, both within and between IPUs [25].Pitch related features were calculated based on semitones rather than frequency.On top of such features, we extracted elaborated prosodic features for the whole sentence involving the sequences of derived IPUs, that were expressed in terms of standard deviation, slope and concavity.The Snack Sound Toolkit 1 was used to extract the pitch and energy from the speech signal.Jitter and shimmer were extracted from openSMILE low-level descriptors.For the time being KB-features are still not extensive and have been used in combination with eGeMAPS features in order to achieve improved performances, amounting to a total of 129 features.

System for automatic personality perception
In this work, the same experimental setup as that employed in the Interspeech 2012 Speaker Trait Challenge-Personality Subchallenge [15,16] has been adopted.We use support vector machines (SVM) with logistic functions fitted to the SVM soft outputs as statistical models (binary classifiers) for automatic personality perception.Special attention has been paid to feature normalization given the heterogeneous characteristics of the different corpora used in this work.Two different normalization techniques have been used ([0, 1] range -denoted as NORM.-, and zero-mean and unit-variance -denoted as STAND.-).

Supervised learning
Five different models were trained in this work corresponding to the Big-Five dimensions of personality (OCEAN).Each model is trained to assign a high/low level on that trait (denoted as O/NO, C/NC, E/NE, A/NA, N/NN) to every speech file.A gridsearch approach using the train and development subsets of the SPC corpus was applied to find the optimal value for the complexity parameter C of the SVMs.The value for C providing the higher unweighted average recall (UAR) on the development subset was then selected.Then, the training and development subsets were merged together and the definitive SVM models were trained on this data set, using the selected values for C. Finally, the UAR on four different test sets (SPC test set, GoN-complete, GoN-20seconds and GoN-10seconds sets) was calculated to evaluate the models on both same-and cross-1 http://www.speech.kth.se/snack/language conditions.

Semisupervised learning
A iterative self-learning (semi-supervised) approach has been applied starting from the initial models described above.At each iteration, the current model is used to classify the remaining samples in the unlabeled CNG dataset, and the 100 samples with the maximum output probabilities for each class (high/low on a given trait) are then extracted from the CNG dataset and joined together to the current training set, forming a new training set to be used in the next iteration.The labels assigned by the current model are used as ground truth labels to train the model in the next iteration.This process is iterated 8 times.

Experimental results
The results are presented in terms of unweighted average recall (UAR) and accuracy (Acc).When the data is substantially unbalanced, which is the case of the data sets employed here, UAR should be used as the most relevant measure.Tables 1, 2, 3 and 4 present the results for the initial models learnt by means of the completely supervised approach.Table 1 reveals that Conscientiousness and Extroversion can be easily perceived in the SPC corpus.Overall, openSMILE features achieve the best performance, except for Neuroticism.When using [0, 1] range normalization, the eGeMAPS features achieve considerably lower values for Openness and for Conscientiousness, but the difference is much smaller for Openness when using zero-mean and unit-variance normalization.The combination of KB-features and eGeMAPS is almost consistently better than the eGeMAPS features alone.
Tables 2, 3 and 4 present the results achieved for the GoNcomplete, GoN-20seconds and GoN-10seconds subsets, respectively.In comparison with our previous results in [11], we observe that the Openness trait can be reasonably perceived now in the GoN subsets.This fact may be related with the recent additional personality annotations of this dataset, which may have lead to more consistent labels.Overall, the results show reasonable performance rates for the perception of Openness, Extroversion and Agreeableness traits across languages and ages.Figures 1 and 2 show the results (UAR) achieved by the models trained at each iteration of the semi-supervised approach for Oppenness and Extroversion, respectively.The zeromean and unit-variance normalization and the eGeMAPS feature set was used in these experiments.Although still in a preliminary stage, these results point out the potentials of applying self-learning approaches as a method to overcome the lack of  labeled data in severely under-resourced domains such as the task of children's personality perception addressed in this work.

Conclusions
This paper investigates a semi-supervised training approach to tackle personality perception tasks in severely under-resourced Based on this setting, we investigate a weak supervision approach that iteratively refines our initial models by using the unlabeled data-set.We also investigate knowledge-based features, which leverage expert knowledge in acoustic-prosodic cues and thus need no extra data.Results show that, despite the large mismatch imposed by language and age differences in the training and test data sets, it is possible to attain improvements with these techniques, pointing both to the benefits of using a weak supervision approach and expert-based acoustic-prosodic features across age and language.

Table 1 :
Results achieved on the SPC data set -initial models-.

Table 2 :
Results achieved on the GoN-complete data set -initial models-.