EXPRESSION OF EMOTIONS: CUES FOR EMOTION RECOGNITION.
The purpose of this work is the classification of emotions through certain attributes of the spoken communication. Therefore, it is worthwhile to go into a short overview to see how emotions are expressed within the human communication and, particularly, how emotions are enclosed within the oral expression.
There is a large literature on the signs that indicate emotion, both within the psychological tradition and beyond it. The vocal cue is one of the fundamental expressions of emotions, on a par with facial expression. Primates, dolphins, dogs and all the mammals have emotions and can convey them by vocal cues. Humans can express their feelings by crying, laughing, shouting and also by more subtle characteristics of the speech. Consequently, in expressing and understanding emotions, different types of sources need to be considered. Emotional manifestation has also a wide range of somatic correlates, including heart rate, skin resistivity, temperature, pupillarity diameter and muscle activity. They have been widely used to identify emotion-related states, for instance in lie detection. However, key studies convey the idea, that signs of emotion are clearer identified through facial expression [Rus94] and voice. Most works in automatic recognition of emotions use only a single modality, either speech or video as data, but there are also some works that try to combine both sources to improve the recognition performance (s. [Che98, Cow01, Iwa95]). Most of the literature concerning this field is essentially based on the suggestion, stemming from Darwin, that signs of emotion are rooted in biology and are therefore universal [Dar65].
Different kinds of information within the oral communication.
In general, speech presents two broad types of information. It carries linguistic information insofar as it identifies qualitative targets that the speaker has attained in a configuration that conforms to the rules of the language. On the other hand, paralinguistic information is carried by allowed variations in the way that qualitative linguistic targets are realised. This paralinguistic signal is non-linguistic and non-verbal but tells the interlocutor about the speaker's current affective, attitudinal or emotional state. It also includes the speaker's regional dialect and characteristic sociolect. Paralinguistic aspects also conduct control of the active participation of time sharing in the conversation. The paralinguistic markers, however, are to some extent culturally determined and their respective interpretation must be learned [Mar97]. This non-verbal information is primarily extracted from variations in pitch and intensity having no linguistic function and from voice quality, related to spectral properties that aren’t relevant to word identity.
Figure 3.1. Different channels in the communication of human emotions.
In terms of oral manifestation, emotion can be expressed in at least two verbal means: linguistically (lexical, syntactic and semantic features) and acoustically. The semantic information of a word, concerning its meaning, can be an important cue in emotion recognition, since there are words that tend to be used only in certain situations, e.g. swears words or some affect bursts (s. [Scö00, Sch94]). In the listening tests carried out by Mark Schröder [Scö00], where a database of affect bursts was presented to 20 subjects, the overall recognition rate of the uttered emotion reached 81%. Since the recordings were presented audio only and without context, affect bursts were found to be an effective mean of expressing emotions. Nevertheless, this kind of linguistic indications is not enough to assure the emotional intention of the speaker because, in some languages, a swear word can also be said with friendly meaning and its connotation could only be detected by non-semantic information.
As a result, spoken messages meaning depends, not only on what it’s said, but also strongly on how it’s said. In [Cow95] two channels have been distinguished within human interaction: one transmits explicit messages, and is more related to semantic issues; the other transmits implicit messages about the speakers themselves or the situation and is associated with acoustic properties of the speech (see figure 3.1). These two channels of human communication interact: the implicit channel tells people “how to take” what is transmitted through the explicit channel.
Though not all the studies on the subject of voice-emotion correlation agree about which attributes of the speech signal are the most favourable to detect the speaker’s emotional state, all of them coincide in the idea that emotion strongly influences the acoustic characteristics of the speech. Furthermore, Branka Zei, phsicologyst by the Universital Hospital of Geneva, refers to three clearly differentiated levels of influence of emotion on speech:
Although segmental features and the content of the utterance themselves carry emotion, suprasegmental features, play an important role in conveying emotion [Ban96]. Murray and Arnott have conducted a literature review on human vocal emotion (table 3.1) and concluded that in general, the correlation of the acoustic characteristics, both prosody and voice quality, and the speaker’s emotional state are consistent among different studies, with only minor differences being apparent [Mur93].
Vocal correlates of emotion and attitude.
A major question in work on acoustic correlates of emotion is to what extent recognition of the speaker’s emotional state is due to pure acoustic indications and to what extent it is due to bordering factors such as the implicit message included in the utterance, the context or the social convention.
Table 3.1. Emotions and speech parameters (from Murray and Arnott, 1996). With the aim of examine how well emotions can be recognised from acoustic properties of the speech signal, the following listening experiment was carried out by L. Yang and N. Campbell in [Yan01]: 13 phonetically untrained native speakers of Chinese (5 females and 8 males) and 5 Americans (1 female and 4 males), with no knowledge of Chinese, were asked to identify 21 examples of a variety of emotional speech samples uttered in Chinese. Results show that, despite the existence of some seeming discrepancies, there is a basic correspondence between sound and meaning. For example, strong disgust and dismissive emphasis, as well as the degree of intensity, were clearly differentiated by both the Chinese and American respondents, and both had a single unambiguous interpretation. Consequently, the fact that the American respondents do not have any knowledge of Chinese and were unaware of the semantic content, and were still able to differentiate the speaker state purely based on the sound pattern further supports the idea of a sound-meaning link.
Another revealing study related to the recognition of emotions based only on acoustic cues was performed by A. Tickle in [Tick00]. There, the comparison of cross-cultural emotional decoding accuracy of vocalisations encoded by native English and native Japanese speakers yield some language independent, and therefore message independent, results: Japanese subjects decoded English vocalisations with greater accuracy than they decoded Japanese vocalisations. This fact may suggest that, despite possibly being exposed to less emotional expression, Japanese subjects are capable of decoding emotions where acoustic correlates are due to psycho-biological response mechanisms.
It is amply proved that certain emotional information is enclosed within the oral communication relying only on the acoustic aspects of the speech signal. A huge number of studies have been accomplished in order to determine which acoustic attributes, contained in the oral expression, are the most advantageous candidates to unequivocally detect the speaker’s emotional state. An important research is the one accomplished by Murray and Arnott [Mur93], whose results, yielding the best acoustic attributes for detecting basic emotions, are presented in table 3.1. This study, in spite of being one of the most mentioned works in the field of emotion recognition through the speech signal, is not the only one. In a recent review of research on the acoustic correlates of speaker states, Pittam and Scherer [Pit93] have summarised the state of the literature to date as follows:
Anger/Frustration: Anger generally seems to be characterised by an increase in mean F0, F0 variability, and mean energy. Further anger effects include increases in high frequency energy and downward directed F0 contours. The rate of articulation usually increases.
Anxiety/Nervousness: High arousal levels would be expected with this emotion (see chapter 2), and this is supported by evidence showing increases in mean F0, in F0 range, and high frequency energy. Rate of articulation is reported to be higher. An increase in mean F0 has also been found for milder forms of the emotion such as worry or anxiety.
Resignation/Sadness: A decrease in mean F0, F0 range, and mean energy is usually found, as are downward directed F0 contours. There is evidence that high frequency energy and rate of articulation decrease.
Contentment/Happiness: Findings converge on increases in mean F0, F0 range, F0 variability and mean energy. There is some evidence for an increase in high frequency energy and rate of articulation.
In summary, acoustic parameters, such as F0, intensity and duration, which have seemingly clear relationships with the most prominent perceptual characteristics of speech (i.e. pitch, loudness and speech rate) have received the most attention. These parameters also tend to be the easiest to analyse.
In comparison, the modulation of spectral components of the speech signal, as revealed by the formant structure, and overall structure of the average spectrum, has been less studied. Such aspects of the speech signal, more related to voice quality attributes, do not have so obvious perceptual correlates and they are more difficult and time consuming to analyse. However, some evidences indicate that adding voice quality features could provide resolving power when distinguishing between emotions with otherwise similar prosodic profiles.
As a conclusion, both prosodic and voice quality features must be considered within the emotional content of the speech. Next sections, 3.2 and 3.3, introduce how these two acoustic profiles particularly help the discourse in the aim of expressing and understanding emotions.
Emotion is an integral component of human speech, and prosody is the principle conveyer of the speaker’s state and hence is significant in recovering information that is fundamental to communication. Some studies have tried to demonstrate how slightly differentiated meanings are introduced in speech communication by prosodic variations and how differences in shape communicate the grade of uncertainty or certainty with respect to the speaker’s knowledge state, specific emotional states, the intensity of emotion, and the effects of other co-occurring emotions (s. [Yan01, Pae00]).
Classification of prosody.
Prosody is a general term for those aspects of speech that span groups of syllables or words [Par86]. These properties are not typical attributes of the individual speech sounds, but are characteristic of longer stretches of speech. Prosody conveys information between a speaker and a listener on several layers. In [Kie96] two main categories of prosodic feature levels are distinguished: acoustic and linguistic prosodic features.
The acoustic prosodic features are signal-based attributes that usually span over speech units that are larger than phonemes (syllables, words, turns, etc). Within this group two types can be further distinguished:
Basic prosodic features are extracted from the pure signal without any explicit segmentation into prosodic units. Here, fundamental frequency (F0) and energy frame-based extraction, as detailed in section 7.1, are included. These features are not normally used directly for prosodic classification; instead, they are the basis to calculate more complex prosodic features.
Structured prosodic features can be seen as variations of basic prosodic attributes over time. Consequently, they are computed over a larger speech unit. Structured prosodic features can derive from the prosodic basic features or can be based on segmental information provided e.g. from the output of a work recognizer. The structures or compound prosodic features used during this Thesis are detailed in section 7.2.
The second main category [Kie96] includes linguistic prosodic features, which can be extracted from other knowledge sources, as lexicon, syntax or semantics and usually have an intensifying or an inhibitory effect on the acoustic prosodic features. These linguistic properties are not taken into consideration in the framework of the present Thesis. In the following, the term “prosodic features” only refers to the acoustical prosodic features.
The principal prosodic variables of interest, as indicated in section 3.1.3, are energy, pitch and durational measurements such as speaking rate. Every prosodic feature has an acoustic correlated. However, this correspondence is not unique, since prosodic attributes refer to perceived phenomena.
Energy as a prosodic cue for emotion detection.
Energy is the acoustic correlated of loudness; their relation is not linear, because it strongly depends on the sensitivity of the human auditory system to different frequencies. Hence, the influence of energy on emotional perception is a combination of different surrounding factors and valuable conclusions can also be extracted from global statistics directly derived from the energy contour.
Figure 3.2. Energy mean of the sentences uttered by (a) a male speaker and (b) a female speaker. In terms of global statistics, energy is proved to be higher in emotions whose activation is also high. On the contrary, low energy levels of energy are found in emotional states with a low activation value (see table 3.1). Figure 3.2 represents the mean of the energy means over a whole sentence for all the utterances, separated by emotions, of a male speaker and (b) a female speaker. Equally, figure 3.3 represents the energy maximum under the same conditions.
Figure 3.3. Pitch mean of the sentences uttered by (a) a male speaker and (b) a female speaker. Both statistics are in agreement with the relational theory proposed in [Mur93] summarised in table 3.1. High energy levels are found in angry and happy utterances, while sadness and boredom yield lower intensity values. Energy is one of the most intuitive indicators in the relation voice-emotion, since, as we said before, it is unequivocally related to the acoustical loudness. Even if we are not experts in this matter, we could easier imagine someone angry shouting than gently whispering.
Pitch as a prosodic cue for emotional detection.
The acoustic correlate of pitch is the fundamental frequency or F0. Voice pitch is certainly a key parameter in the detection of emotion. Pitch variation over a sentence, also called intonation, is used in most languages to give shape to a sentence and indicate its structure, although the way in which this is done varies widely between languages. In some languages pitch is also used to help indicate the meaning of words. For instance, a Chinese word does not use a stress system like that in English. Besides that, Chinese word lacks flexion, so that its formation needs the help of prosody. Words formatted in prosodic features are the basic units of rhythmic construction in Chinese.
Pitch ranges between 80 and 160 Hz for male talkers and between 160 and 400 Hz for females. Pitch can be characterised by means of its level or its contour. In English, pitch is quantized into four levels (low, mid, high and extra high) and three terminal contours are distinguished: fading (decreasing pitch and amplitude), rising (increasing pitch, amplitude nearly constant until the end), and sustained (pitch and amplitude approximately constant). Fundamental frequency is considered to be one of the most important attributes in emotion expression and detection (s. [Mon02, Abe01]).
Figure 3.4. Pitch mean of the sentences uttered by (a) a male speaker and (b) a female speaker. Statistics extracted from the basic prosodic features contain relevant information about the uttered emotion. From the pitch contour of one utterance we extract the mean, maximum, minimum, variance and standard deviation among other features listed in chapter 7. Figure 3.4 represents the mean of the pitch mean over a whole sentence for all the utterances, separated by emotions, of (a) a male speaker and (b) a female speaker. According to the theory [Mur93], we obtain clearly differentiate statistics for the pitch mean depending on the uttered emotion (see figure 3.4). In both cases, regardless of the scale difference due to the sex condition, happiness and anger present a higher pitch average, while boredom and sadness means are slightly slower with reference in the neutral emotion (absence of emotion).
Pitch range, calculated over the same two speakers, also depends strongly on the emotion, as it can be seen in figure 3.5, according to the conclusions achieved by Murray and Arnott [Mur93] and resumed in table 3.1. Different statistics for other prosodic features can be found in section 7.2.
Apart from global statistics extracted from fundamental frequency, pitch shape characteristics such as pitch slope and concavity and convexity are very important features in differentiating intonational meaning. For instance, the concavity or convexity of slope is critically related to the perceived grade of severity or softness of the utterance and these shape characteristics reflect the underlying expressive states that often arise from the discourse process itself. Therefore, not only global statistics of the pitch, but also attributes related to the pitch contour, should be taken into consideration. The pitch contour can be estimated following different models:
Mathematical approximation, i.e. linear regression coefficients.
Symbolic representation, i.e. stylisation into concatenation of straight lines according to rules (local maximum and minimum).
Intonation models, i.e. two-component model superposition of fast pitch movements on a slowly declining line (baseline)
Despite the fact that some sources claim that none of the models preserve the information necessary for emotional speech, various studies stated the importance of intonation as a medium for expressing emotions (s.[Wil72, Cos83, Cah90]). Accordingly, Sylvie J.L. Mozziconacci completed a study concerning the expression of emotions when the pitch variation is represented in the theoretical framework of an intonation model [Moz00]. There, the power of existing intonation models for modelling expressiveness was contemplated. She considered that studying speech variability (pitch level and range) through such crude measures as pitch mean and standard deviation, can obscure a substantial part of the variation present in the speech signal, and do not provide any information concerning relevant deviations. Finally, a cluster analysis of different emotional pitch contours was performed and it was concluded that predominantly the final configuration of the contour plays a role in conveying emotion.
On the other hand, most of the world-known languages employ a structural parameter called stress. It’s characteristic of these languages that certain parts are felt to be more prominent than others, whether in isolated polysyllabic words or in larger stretches of continuous speech [Slu95]. Such prominent parts stand out from their environment due to (among other things) increased vocal effort (intensity), more accurate articulation, longer duration and pitch changes; stress tends to raise pitch and stress-produced pitch irregularities are superimposed on the pitch contour. In spite of being introduced in this section, stress could be also considered as an acoustic correlated of pitch, energy and duration, due to the influence of each one of these attributes on the stress perception. A related study by Paeschke, Kienast and Sendlmeier, showed that the intensity of stressed syllables differentiated between the groups excited and non-excited emotions.
Figure 3.5. Pitch range of the sentences uttered by (a) a male speaker and (b) a female speaker.
Durational cues for emotional detection.
Prosody involves also duration-related measurements. One of the most important durational measurements in the aim to discriminate among speaker’s emotional states is the speaking rate. An acoustic correlated of the speaking rate can be defined as the inverse of the average of the voiced region length within a certain time interval. Also pauses contribute to the prosody and can be divided into two classes: unfilled and filled pauses. An unfilled pause is simply silence or it may contain breathing or background noises, whereas a filled pause is a relatively long speech segment of rather uniform spectral characteristics. During a filled pause a hesitation form is usually uttered indicating that one is about to start or to continue speaking: e.g. English [[schwa][long]], [[schwa ]mm], etc.
In conclusion, the significance of prosodic meaning to communicate judgements, attitudes and the cognitive state of the speaker, this makes it essential for speech understanding projects such as emotion and intention tracking and to the development of natural-sounding spoken language systems.
3.3 Voice Quality
Most of the studies of the past decades related to the expression of emotions through the voice dealt with the investigation of prosodic parameters, mainly fundamental frequency, intensity and duration characteristics. In addition, parameters describing laryngeal processes on voice quality have been recently taken into account (s.[Kla00, Ban96]).
Voice quality perception.
Voice quality is defined by Trask [Tra96] as the characteristic auditory colouring of an individual’s voice, derived from a variety of laryngeal and supralaryngeal features and running continuously through the individual’s speech. A wide range of phonetic variables contributes to the subjective impression of voice quality.
The human voice possesses a wide range of variation, and the natural and distinctive tone of speech sounds produced by a particular person yields a particular voice, which is rarely constant in the course of a conversation. The reasons for these changes are diverse; voice quality is influenced by the social context and situation and phonetic habits are also a part of our personal style. It’s possible to change our voice by modifying the phonetic settings, but several components of voice quality are innate and outside the speaker’s control. Voice quality is usually changed to strengthen the impression of emotions. As it was said in section 3.1, prosody suffers some changes relying on the speaker’s emotional state, but voice quality is an additional valuable phonetic cue for the listener.
Definitions for voice quality or phonation types used in normal speech manly come from studies of pathological speech (laryngeal settings) and it is hard to describe voice quality, especially variations in a normal voice. The term voice quality subsumes attributes, which concern the overall phone independent spectral structure, for example, shimmer, formant frequencies variation or some relative energy spectral measurements.