Tc-star project Deliverable no. D8 Title: tts baselines & Specifications



Download 1.1 Mb.
Page6/6
Date conversion20.07.2018
Size1.1 Mb.
1   2   3   4   5   6

Specifications of Evaluation of Speech Synthesis

  1. Introduction


The development of speech technology in TC-STAR is evaluation driven. Assessment of speech synthesis is needed to determine how well a system or technique compares to others or how it compares with previous version of the system. In order to make a useful diagnose of the system in TC-STAR we will not only make a test of the whole component but also specific tests for each module of the speech synthesis system. In this way we can assess better the progress on specific modules. Furthermore, it allows identifying the best techniques in the different processes that are involved in speech synthesis. To allow the comparison of different modules we have defined a common specification of the modules and specific test.

In order to increase the critical mass in speech synthesis and the impact of the evaluation, the TC-STAR project opens the evaluation on speech synthesis to external partners. The needed language resources produced at TC-STAR are shared with partners willing to participate in the evaluation. With this goal, the TC-STAR partners involved on Work package 3 have founded ECESS, the European Centre of Excellence on Speech Synthesis [ECESS]. This Centre is open to any partner willing to participate on the evaluation and to share the technology for research purposes. This document includes the result of discussion with ECESS partners.

Although some objective metrics have been proposed to evaluate some modules of the speech synthesis system, in most of the cases the evaluation relies on human judges. The evaluation can be carried out using a web interface. This makes possible that the subjects can perform the test from their home computer that has to be equipped with a high-speed internet connection, a standard sound card and closed headphones. For each language, between 15 and 20 subjects participate in the evaluation. At every evaluation campaign, the sampling rate of the synthetic speech will be defined (recommended: 24 kHz).

The outline of this document is as follows. Section 3.2 of this document describes the three modules of the speech synthesis systems, functionalities and an overview of the interfaces. Section 2.3, defines the evaluation tests for modules. Section 3.4 focuses on the evaluation of specific research activities: voice conversion and expressive speech. Section 3.5 defines the evaluation of the speech synthesis component (whole system).


    1. Definition of speech synthesis modules


Text-to-speech systems perform a range of processes, from text normalization, pronunciation, several aspects on symbolic and acoustic prosody, etc. Finally we are interested on the quality of the overall system. However, the evaluation of the whole (black box evaluation) does not allow pinpointing which part of the system causes the most relevant problem. Furthermore, this method does not allow participating on the evaluation to small teams of researchers whose specialty of research is in one specific topic. In TC-STAR we will certainly evaluate whole systems (see section 3.5), but we also want to evaluate different tasks to drive more valid conclusions about the results of different algorithms. Defining modules, with well defined input and output allows keeping constant all the modules except one and comparing the results caused by the algorithms involved on that module (glass box evaluation).

There are many processes involved in speech synthesis. Researchers working in a particular one would prefer to make specific tests to evaluate that process. For instance, some tests have being proposed to evaluate each aspect of prosody, from intonation, pausing, accentuation, etc. However, from a pragmatic point of view, when designing a general evaluation framework, the number of modules needs to be limited. The evaluation of speech synthesis involves in many cases human evaluation and is needed to limit the number of test for each campaign. Also, in order to compare different systems only generic modules can be defined because not all the systems are built up of the same processes. Furthermore, although we assume that speech synthesis is built up of independent modules, in fact this is not absolutely true. For instance, a promising area of research is modeling the correlation between the different features related with prosody (f0, duration, etc.). Keeping these processes together allows modeling this interaction.

Therefore, there is a compromise in the number of modules. In TC-STAR we define three broad modules: symbolic preprocessing, prosody generation and acoustic synthesis. The modules have been defined through their interfaces, i.e., the formal description of the input and output [ecess_iface].
Symbolic preprocessing.

The first module has to perform three tasks:



  • Word normalization: the input text is transformed into words that would be found in common lexica with complete coverage. For instance, dates, postal address, numbers, abbreviations, are transformed into these words. One of the tasks of this module is the tokenization. This includes, for languages like Mandarin, word segmentation. It also includes correct tokenization of punctuation and detection of the end of the sentence.

  • Pronunciation: the pronunciation of each word is derived. The representation depends on the language and also on the used technology. For English and Spanish and most of western languages, the pronunciation is represented using the phonetic transcription, including lexical stress and syllabic boundaries. For Mandarin, the pronunciation is represented using syllables (therefore, syllables boundaries are not needed), including the lexical tones.

  • POS tagging: each word is tagged with the disambiguated POS (Part-of-Speech).

The text input to the TTS system will be formatted into a SSML [SSML] conforming document. SSML (Speech Synthesis Markup Language) is a W3C Recommendation, a XML-based markup language specifically designed for assisting the generation of synthetic speech. SSML defines tags to control the text structure (paragraphs and sentences) and to give information about the desired prosody and style (voice gender, speaker age, specific processor voice name, emphasis, pitch contour, duration, etc.). Since all the tags defined in SSML are optional, plain text can easily be transformed into a SSML document by embracing it between and SSML definition tags. In fact, during evaluation, no control marks will be included in the text.

In the second and third evaluation campaign, the mark-up language will be extended to fit the research results of the project. It is expected that prosody and segmental information will be derived from the source voice and included in the input text using appropriated mark-up. This information will be used during in speech synthesis.

Basically, the output of the module is a sequence of words. For each word, information about POS and pronunciation is included. The words present in the output will be those originally present in the input text and those resulting from the normalization of dates, abbreviations, acronyms, etc.

As a general rule, the POS are coded using the tagset defined in LC-STAR (only the main tag, not the attributes). However, for each language, a different tagset can be defined depending on the available resources. In particular, it has been agreed that in the first evaluation campaign, the English tagset will be the defined in the Penn Treebank [Marcus93].

For English and Spanish, the phonetic transcription will be coded using the SAMPA phonetic alphabet [SAMPA]. The phonetic transcription includes syllable boundaries and lexical stress. The phonetic transcription corresponds to words uttered in isolation. Determining the pitch accents of the sentences or applying coarticulation rules is postponed till module 2. For Mandarin, the pronunciation is represented using the syllables including the tones.
Prosody generation

This module generates the acoustic prosody representation for the sentence. In the first phase of the project, the acoustic prosody is specified by means of F0-countour, intensity contour and energy contour. Optionally, the interface supports other parameters related to voice quality or symbolic prosody. These optional parameters will not be used in the evaluation of module 2 during the first phase of the project.

The prosody generation module associates to each word a list of corresponding phonemes. This may not necessarily be equal to the phonetic transcription itself, since assimilation of vowels, creation of diphthongs and similar phenomena may be considered. For Mandarin, the state of the art systems are based on syllable so each word is represented by a list of syllables.

For each phoneme (syllables for Mandarin) the following information is mandatory:



  • The duration of the phonemes (syllables in Mandarin) is expressed in milliseconds

  • The fundamental frequency of the phonemes (syllables for Mandarin) is expressed as pairs of time_position (in milliseconds) and f0_value (in Hz). This is a flexible method allowing for several approaches for frequency specification to be used in the system. The simplest case is one single value for the whole phoneme, allowing, for instance, the common approach of indicating the frequency value at the middle of the phoneme. More detailed fundamental frequency curves can be implemented by sampling this curve with different resolutions according to the prosody model. For unvoiced phonemes this value is irrelevant.

  • The power contour or intensity is specified in a similar way, using pair’s time_position and power_value. The energy is expressed by the mean power in dB. For a given pair (time_position, power_value), the value measures the power from the last time_position to the actual time_position. For instance, to give the mean power for each phoneme, the ending position of the phoneme should be used as time_position.

The description of prosody using only duration, f0 and intensity is not complete. For instance, it is known that the prominence of the syllable is correlated with the spectral distribution of the energy. Therefore, many systems use symbolic information to select units that are more appropriated. Therefore, the interface definition allows specifying some phonological information as the stress and the presence of intonation break. However, in this version of the evaluation specifications, in the sake of simplicity, these parameters are ignored.

As stated above, the prosody information is expressed adding information to the phoneme. The information associated to the syllable is added to the first phoneme of the syllable. We have preferred not to impose a hierarchical representation (word composed of syllables and syllables composed of phonemes) to allow the representation of syllables that span from the end of a word to the beginning of next word. Adding syllable information to the first phoneme of the syllable preserves the information about syllable boundaries.

The accent level will be labeled with positive integers indicating the importance of the accent (1 indicates normal, 2 indicates emphatic). If the syllable is the last of the word, information about the break index tier will be added. The break level is specified using the categories defined in the standard markup language SSML [2]: none, x-weak, weak, medium, strong and x-strong.

As a summary, for each word, the phonemes (syllables for Mandarin) of the word are related. Then for each unit, the following information is added:



  • Information about segmental duration, f0 contour and intensity contour

  • For each syllable include symbolic information. For English/Spanish, specify information about the pitch accent (normal or emphatic). The information is added to the first phoneme of the syllable. For Mandarin, the Pinyin transcription system already includes tone information.

  • For the last syllable of the word, information about the break index tier is included. If the units are phonemes (English/Spanish), this information is included in the first phoneme of the syllable..


Acoustic synthesis

The last module produces synthetic speech based on the prosody representation. For evaluation a MS-WAV file will be produced. The baseline systems are based on concatenative speech using unit selection from a speech database but the defined interface allows other synthesis methods.


    1. Evaluation of the speech synthesis modules

      1. Module 1: Text analysis


The goal of the text analysis is to transform the orthographic input string to the representation of the sounds. It involves text normalization, which transform ambiguous text such as numbers, dots and abbreviations into non-ambiguous words (which are known as “standard words”). In the case of Mandarin, this module segments the character stream into words. This module also copes with grapheme to phoneme conversion and with the assignment of lexical stress and syllable boundaries. Furthermore, this module tags the words with the POS (part-of-speech label), which is needed for prosody assignment.

Therefore, the input of the module is orthographic text and the output consists on three layers: standard words, their phonetic representation and the POS-tags. In TC-STAR we adopt the criteria defined in LC-STAR [LC-STAR D2]. Given the orthographic text, the output of the module is not unique because sometimes the same text can be read using different words (normalization ambiguity) and also because some words accept several pronunciations. The module has to produce one particular output and the evaluation metrics has to accept it, if it is one of the correct ones.

The correct pronunciation of words depends on prosody. Specifically, depending on the pausing and on the speaking rate assimilation to adjacent words can occur. However, usually the evaluation of grapheme to phoneme it is based on isolation words. One of the reasons is that the rules for word pronunciation in context are believed to be straightforward [EAGLES, pp. 514]. For sake of simplicity, in TC-STAR the pronunciation algorithm will be evaluated on isolation.

The input and output format of the following test is the one defined for the text analysis module. Therefore, in any test, the output includes the orthographic transliteration, POS tagging and phonetic transcription. The evaluation agency should select the needed information for each specific test.


Test M1.1: Text Normalization.

  1. The evaluation agency selects N1 running words presented in paragraphs or sentences. The words are selected from:

    • 50% from the domain C3.2 (frequent phrases), in Section 2.

    • 25% from text transcriptions from the parliamentary domain26.

    • 25% from text transcriptions from the parliamentary domain, formatted according to the rules of the translation engine (WP1)

  2. Each system processes the text and produces a tokenized version of the text using only “standard words”.

  3. The evaluation agency produces the reference transcription following the TC-STAR convention. If more than one transcription is acceptable then they have to be coded in a format compatible with the evaluation tool (either lists of transcriptions or graphs coding the acceptable transcriptions). In some cases, the normalization requires deep knowledge of the domain, as in the case of expand technical abbreviations. In this case, one possible expansion is the one that will be used for a native speaker without specific knowledge of the domain.

  4. For each system, the output of module 1 is evaluated, taking into account insertions, deletions and substitutions. The figure used to assess the systems is the word error rate, as defined in the tool provide by NIST to asses continuous speech recognition.


Test M1.2: Word Segmentation (Mandarin)

  1. The evaluation agency selects N1 running words presented in paragraphs or sentences. The words are selected from the domain defined in LSP for Mandarin.

  2. Each system processes the text and produces a tokenized version of the text including word boundaries.

  3. The evaluation agency checks the word segmentation. The figures used are precision and recall.



Test M1.3: Evaluation of POS-tagger.

  1. The evaluation agency select >N2 running words. Approximately 50% is from the parliament domain and 50% of other general domains (for instance, news). The evaluation corpus is split into paragraphs. The text should be selected so that it is not expected to have problems in the tokenization.

  2. Each system tags the text.

  3. The evaluation agency creates the reference tagged text using the POS-tags defined in LC-STAR (only the main tag, not the morphologic attributes). For UK-English, as the LC-STAR lexicon is not available, the tagset is the defined in the Penntree-bank project will be used.

  4. For each system, the output of module 1 is evaluated comparing the output with the reference. The metric is the percentage of errors. In the case that the number of tags is different to the reference, the evaluation agency will report this and will supervise the alignment (for instance, deleting the problematic sentences) to be able to count the errors.


Test M1.4: Evaluation of grapheme-to-phoneme

  1. The evaluation agency selects > N3 words. Half of the words are derived from the parliamentary domain and 50% from other domains. The words are classified in:

  • Common words

  • Geographic places such as towns, countries, etc.

  • Name of persons (family and given names) and organizations.

  1. Each system produces the pronunciation for the selected words, the lexical stress, and the syllable boundaries.

  2. The evaluation agency produces the reference pronunciation, including pronunciation alternatives if needed.

  3. For each system, the evaluation agency computes the pronunciation word error rate: in case there is an error in one or more phonemes in the word, the pronunciation is not correct. Analogously, the evaluation agency computes the error rate, at the word level, for lexical stress and syllable boundaries. The evaluation should be given for the different domains and inform about the percentage of unknown words (with respect to the lexicon used in TC-STAR).

  4. The foreign names will not be taken into account when computing the error.27
      1. Module 2: Prosody.


The output of the second module is acoustic prosody. Many systems predict symbolic prosody as a first step to produce the acoustic parameters. However, while many research laboratories use the same values as acoustic prosody (pauses, f0 contour, segmental duration, energy contour), the coding of symbolic prosody in some cases depends on the theory behind the models. Furthermore, some experiments reveal that the evaluation at the symbolic level cannot substitute the acoustic tests [1, pp. 518].

Usually, acoustic objective measures are used to evaluate models and to estimate their parameters. For instance, to have a first evaluation of the segmental duration model usually the MSE (mean square error) is used. This metrics compares, for each phoneme, the prediction of the model with the duration measured in a human sentence (reference speech). However, the correlation between these objective measures and the perceptual judgment is not very high (for instance, in melody modelling, a straight line can give acceptable results in terms of correlation and MSE but the synthetic speech is monotonous). Therefore, in order to evaluate prosody we rely on judgment tests of prosody.

To assess this module all the systems under comparison will share both the input and the backend. The input will consist on normalized words and the correct pronunciation, stress, POS and syllable boundaries, as detailed in the interface definition. For the output, the same backend will be applied. The backend will consist in re-synthesis of natural sentences. This avoids distortions that can occur in synthetic speech. Therefore, the assessment of naturalness and quality of intonation is easier.

To do the speech re-synthesis a toolkit as Praat [Praat] will be used to change the prosody (f0, duration and energy) according to the output of the module under evaluation. This requires that the natural utterances are segmented into phonemes. The toolkit will be agreed before each campaign by the partners. All the partners will be able to use it during development.

The natural sentences should be uttered by the baseline speaker. If this is not possible (speakers are not available), other speaker should be selected taking into account that the pitch mean is the same that the one form the speaker producing the baseline voices. The selection should also consider voice quality after F0 manipulation using pitch synchronously labeled speech units. The speaker has to be instructed to speak at the same mean speaking rate that the one in the baseline voice.

The evaluation of the prosody will be based on paragraphs. The evaluation of sentences in isolation usually gives better but unrealistic results with respect to their use in continuous speech as is the case of broadcast news and parliamentary speeches (applications in TC-STAR).

To evaluate the prosody we will focus on “naturalness of the prosody: intonation, rhythm, etc.)”. The subjects are instructed not to take into account noises or acoustic distortions. The systems are evaluated using an absolute scale going from 1 (very unnatural) to 5 (completely natural).

One major problem in the evaluation of prosody is the influence of the segmental component on speech perception. In TC-STAR we will adopt the BLURR method proposed by Sonntag and Portele [Son98]. This method generates delexicalized utterances were the lexical information is lost and only the melody and temporal structure is presented. The basic idea is to generated, for voiced sounds, a harmonic signal using only the first and second harmonics. The amplitude of the second harmonic component is one fourth with respect the first one. The signal is generated taken into account the prosodic description (f0, duration and energy). The unvoiced sounds are reflected like pauses. Based on these delexicalized utterances, two tests are proposed. A judgment test is used to rate if the prosody is appropriated to a given test. Furthermore, a functional test is defined so that the subjects have to choose the text sentence more appropriated to a given delexicalized utterance.



For judgments tests, the number of subjects and items to be rated depends on the number of systems. The general recommendations are:

  • All the systems produce synthetic speech based on the same items (paragraphs).

  • At least S=20 subjects participate in the evaluation. Each subject evaluates all the systems unless the number of systems to be evaluated is too big (>20).

  • Each subject listens to each paragraph only once.

  • Each system should receive at least 40 ratings.

  • For each subject, the presentation order is randomized both in systems and in items to eliminate any dependency on the order.

Table 3.1 specifies the number of subjects and number of items as a function of the number of systems. The last column shows the number of ratings to be done for each subject.


Y = #Systems

#Subjects

#items

#ratings/subject

#ratings/system

1

20

6

5

100

2-6

20

Y*3

Y*3

60

7-10

20

Y*2

Y*2

40

>10

2*Y

20

20

40


Table 3.1: Number of subjects and items as a function of the number of
systems to be evaluated

Test M2.1: Evaluation of prosody (using segmental information).

  1. The evaluation agency selects N4 items (paragraphs) from the parliamentary domain, distributed over different melodic domains taking into account the real distribution: declaratives, questions, list, etc. The number of items depends on the number of systems (see Table 3.1).

  2. For each item, the evaluation agency produces the input to module 2 (words, POS, pronunciation, syllable boundaries, lexical stress).

  3. Each system produces the prosody description for these items. In the first campaign, one additional baseline system will be produced as reference.

  4. The evaluation agency generates synthetic speech based on the prosody description using the generation toolkit.

  5. A judgment test is performed by naive subjects. Each subject has to rate in a scale of 5, the naturalness of the voice. The subjects are instructed to pay attention to prosody (not speech quality or noise). To avoid the learning effect, each subject judges each item produced only by one system. The number of subjects depends on the number of systems to be evaluated (see Table 3.1).

Test M2.2: Judgment test using delexicalized utterances.

  1. The evaluation agency produces the input to module 2 in the same way that defined in test M2.1 (cf. Test M2.1, points 1 and 2).

  2. Each system produces the prosody description for these items.

  3. The evaluation agency generates delexicalized utterances based on the prosody description: voiced sounds, the signal is generated using two harmonic sinusoidal functions; unvoiced sounds are rendered as silence. The f0 and energy (for voiced sounds) and duration (for voiced and unvoiced) are consistent with the prosody description.

  4. A judgment test is performed by naive subjects. Each subject reads the original text sentences and judge is the prosody is good or not for that sentence, using a 5 points scale.


Test M2.3: Functional test using delexicalized utterances.

  1. The procedure for generating the stimulus (delexicalized utterances) is the same than in test M2.2.

  2. For each utterance, the subjects have to choose which sentence, from a set of 5, is more appropriated to that prosody. The sentences should differ either in phrase modality, boundaries, phrase accent or number of syllables. One of the sentences is the correct one, i.e., the original sentence presented to each system.



      1. Module 3: Speech generation.


The third module produces speech from the phonetic and (acoustic) prosody description. Segmental quality or segmental identification is one of the main factors in getting good overall quality, in particular in words which cannot be easily predicted from the context as is the case of proper names, figures, etc.

Some decades ago several tests were designed to evaluate segmental quality. For instance, Diagnostic Rhyme Test (DRT) and the Modified Rhyme Test (MRT), evaluate the segmental intelligibility by identification of the initial or final consonants of CVC words from a close set of options (two or six). However, these tests are not very suited to evaluate state-of-the-art methods. Corpus based systems look for speech segments as longer as possible with prosodic restrictions. To synthesize short words prosody is not so relevant and therefore the system could be tuned to a very different working mode. The system could find the CVC words found in the database. Therefore, the quality would be near the same than natural speech. The SAM Standard Segmental Test is other method that evaluates the segmental quality using meaningfulness words. As stated before, the unit selection systems are designed for reading sentences, no words, and the segmental quality depends on the task. Therefore, we propose to evaluate segmental quality based on sentences.

One important effect of segment quality is intelligibility. Furthermore, other aspects, as naturalness is also affected by segment quality. Intelligibility can be evaluated by functional tests (subject transcript what they listen) but naturalness requires judgment tests similar to the ones needed to evaluate the overall quality of the system. Both aspects, intelligibility and naturalness, need to be evaluated because sometimes there is a trade off between them. It is possible to produce speech perfectly intelligibly but very unnatural (for instance choosing a very slow speech rate). So, both tests (functional and judgment) will be used to evaluate segment quality.
Intelligibility: functional test.

To evaluate the intelligibility we will use the Semantically Unpredictable Sentences test (SUS), based on the one proposed by SAM. This test consists of a set of syntactic structures or templates. The lexical slots are filled with words from the parliament domain. The words should be chosen so that they are known for most of the people (for instance not technical names too specific). The sentences are semantically unpredictable but syntactically correct. For each sentence, the input to the speech generation module is prepared. Based on that, the systems produce synthetic voice which is presented to the subjects. They transcribe what they listen. The measure is the word error rate.


Judgment

Segmentation quality will also be evaluated based on judgment. Some sentences will be selected from the parliamentary domain. For each sentence, the input to the speech generation module will be produced. Based on this input, all the systems produce the synthetic speech. The synthetic sentences are presented to the subjects who rate the sentences in terms of naturalness and intelligibility.



Note that in both test, the correct prosody has to be provided. The most problematic part is the prosody. The prosody should be as natural as possible. To do that, the prosody features (acoustic prosody) will be based on the reading of the sentences by a professional speaker. However, the acoustic modules are tuned to one particular speaker. This is especially true for concatenative systems based on unit selection. In order to provide to the modules with a prosody description matched with their voices, if it is possible, the sentences will be read by baseline speakers, i.e., the speakers that same speakers that recorded the baseline corpus. If this is not possible, a speaker with the same tone will be used. In this case, the evaluation agency will provide with an adaptation corpus.
Test M3.1: Evaluation of speech generation module: functional test

  1. The evaluation agency selects N sentences to be used as templates. These sentences should be syntactically correct. The length of the templates is approx. 10-15 words.

  2. For each template, several sentences are produced changing the lexical words (names, adjectives, verbs, etc.) by other words, with the same morphosyntactic features chosen from the parliament domain.

  3. For each sentence, the evaluation agency produces the input to the module 3: words, phonetic transcription, and prosody description. This is based on the reading of the sentences by one professional speaker which should be the same that the baseline voice speaker. If this is not possible (speaker is not available) then a speaker with the same tone and speech rate will be selected and an adaptation corpus will be provided.

  4. Each system generates the synthetic speech based on the input. The systems have to produce the words in the input but they are not forced to respect exactly the input features (phonetic or prosodic). Some systems use the prosody description only as a rough guide but the final prosody depends on the selected segments.

  5. The synthetic utterances are degraded with an additive or multiplicative noise so that the intelligibility of the synthetic speech drops. The noise characteristics will be included in the evaluation report.

  6. The evaluation agency presents the synthetic voice to the subjects. The subjects transcribe in words the listened sentence. The sentences can be listened twice. The evaluation metrics is the word error rate as used in speech recognition (nist tool).


Test M3.2: Evaluation of speech generation module: judgment test

  1. The evaluation agency selects N items (sentences) from the parliamentary domain. The length of the templates is approx. 10 words.

  2. For each sentence, the evaluation agency produces the input to the module 3: words, phonetic transcription, and prosody description. This is based on the reading of the sentences by one professional speaker which should be the same that the baseline voice speaker. If this is not possible (speaker is not available) then a speaker with the same tone and speech rate will be selected and an adaptation corpus will be provided.

  3. Each system generates the synthetic speech based on the input. The systems have to produce the words in the input but they are not forced to respect exactly the input features (phonetic or prosodic). Some systems use the prosody description only as a rough guide but the final prosody depends on the selected segments.

  4. The evaluation agency presents the stimuli to the subjects. The stimuli are the synthetic sentences. Furthermore, if the baseline speaker was available, the recordings are added as a top-line reference.

  5. The subjects rate the naturalness and the intelligibility of the sentence in a scale from 1 to 5.



    1. Evaluation of specific research topics


The evaluation described previously refers to conventional text-to-speech systems. It is needed significant improvement to achieve natural speech. Furthermore, in TC-STAR it is planned to investigate on two specific research areas: voice conversion and expressive speech. These two topics require specific evaluation test.
      1. Voice conversion (VC)


Voice conversion is the adaptation of the characteristics of a source speaker’s voice to those of a target speaker. When evaluating voice conversion technology, generally, we have two questions in mind:

  • Does the technique change the speaker identity in the intended way?

  • How is the overall sound quality of the converted speech?

The answers can be found applying subjective and objective error criteria. The former is based on listening tests. The latter expresses the distance between the converted speech and corresponding reference speech of the target speaker. However, our experience shows that the objective evaluation of voice conversion technology features severe shortcomings. Consequently, in this document, we develop a plan limited to subjective measures.

In TC-STAR, both conventional intralingua and cross-language voice conversion are to be investigated. The considered languages are English, Spanish and Mandarin, the combinations for cross-language voice conversion are English-Spanish and English-Mandarin.


The Training Corpus

As stated in the specifications on LR, the voice conversion corpus consist of four bilingual speakers (two female and two male). Each speaker produces about one hour of speech of both covered languages. The read contents are based on parallel texts taken from parliamentary speeches.


The Evaluation Corpora

For subjective evaluation, we found that none of the conventional procedures provides the information required for completely answering the first above question. Therefore, we suggest an evaluation method to be used in TC-STAR that, in some respects, is based on a proposal of Kain and Macon [Kain01]. Having a look at state-of-the-art voice conversion technology, we note that most of the systems only transform vocal tract and excitation whereas some approaches aim at transforming the speaker-dependent prosody as well. To be applicable to both kinds of systems, we propose to create two separate evaluation corpora that exclude or include prosody conversion, respectively.


The Evaluation Corpus Excluding Prosody. In order to achieve a similar prosody of all involved speakers, we apply an extension of the ‘mimic’ approach presented in [Kain01]. In the first evaluation the voice conversion is applied only to the speakers specified in this deliverable (D8a). In the next evaluation campaigns, new speakers will be recorded only for the evaluation.

The mimic voice conversion corpus contains 200 sentences. These sentences are split into two sets, development corpus (150 sentences) and evaluation corpus (50 sentences).

In the first evaluation campaign 4 voice transformations are defined:


  • One transformation from a female voice into a female voice.

  • One transformation from a female voice into a male voice.

  • One transformation from a male voice into a female voice.

  • One transformation from a male voice into a male voice.

For the evaluation of English intralingual conversion, we choose those four speakers that have the most native-like pronunciation. For cross-language conversion to English, we take those speakers that have the source language as mother tongue.

In second and third evaluation campaign, new speakers will be developed for evaluation. The development corpus has to be provided to estimate the transformation and the evaluation corpus will be used for testing the performance (evaluation tests).


The Evaluation Corpus Including Prosody. Here, we expect the corpus speakers to use their individual prosody, i.e., no template speaker is required.
Subjective Evaluation

In order to prevent the subjects from interpreting their decisions, they should not be familiar to the background of the test. In particular, they must not know the contents of this evaluation plan. I.e., ideal evaluation subjects are persons that do not have specific knowledge about speech processing at all.

The evaluation web page contains a clear instruction of what the subjects are to do, e.g.:

We are analyzing differences of voices. For this reason, you are asked to identify if two samples come from the same person or not. Please, do not pay attention to the recording conditions or quality of each sample, only to the identity of the person.

So, for each pair of voices, do you think they are

(1) definitely different,

(2) probably different,

(3) not sure,

(4) probably identical,

(5) definitely identical? ”
Voice Identity Conversion

To keep the evaluation task as convenient and clear as possible, two speech samples are presented at a time. Each speech sample consists of 10 sentences that are randomly chosen from the evaluation corpus consisting of 50 sentences. The subjects are not forced to listen to the complete sample but can stop the playback whenever they want. The samples of two compared voices are based on identical sentences, whereas, for each comparison, the randomization is executed anew to prevent the subjects from becoming bored. Each subject evaluates the same test, i.e., the randomizations are executed beforehand. The evaluated voice conversion system has to convert the determined 10 sentences using the four defined transformations. During the evaluation, the subjects listen to 4 voice pairs consisting of the conversion results and the respective reference (target) speech. Besides, they have to rate the similarity of the unconverted voices, i.e., we have 4 more pairs that consist of the source speech and the reference. These 8 voice pairs are randomized, thus the subject does not know if he compares the converted voice with the source or the target.


Preparing the Test. During the recording of the evaluation corpus excluding prosody, we adjust the pitch of the template speaker by adding a pitch offset in the way that the respective corpus speaker feels comfortable. To make the prosody as speaker-independent as possible, in the test, this offset is to be deducted. This is done by providing the evaluated voice conversion system with the values of the pitch offset of the source and target speaker for each considered pair of utterances in the test. As each voice conversion system should include a pitch modification facility, this pitch offset is to be taken into account when synthesizing the converted speech. When comparing unconverted source and target utterances, the mean pitch of the source speech is adapted to that of the target speech by means of a PSOLA technique. A deterioration of the speech quality could be accepted as the subjects are asked to ignore it when evaluating the voice identity conversion.
Voice Conversion Score. In order to compare the performance of different voice conversion systems or to control a system’s progress from one evaluation to the next, we define a voice conversion score that has to have similar properties as the mean opinion score used for quality assessment. Since the performance of the conversion highly depends on the difference of the involved voices (source and target), this score should take into account both the distance between the converted and the target voice and that between source and target voice. Here we define a score to measure the subjective distance between the target speaker and the transformed speaker, which takes values between 0.0 and 1.0.

  • Let’s be s(converted, target) and s(source, target) the rate of the subject when comparing the converted and target voices and source and the target voices. The value of s(.) goes from 1 (voice conversion success) to 5 (voice conversion failure).

  • For each subject, and for each transformation, we define:

Ds = [5-s(converted,target)]/[5-s(source,target)]

Note that



  • If s(converted,target) < s(source,target), one should set Ds=1.0 per definition.

  • If s(converted,target) = s(source,target)=5, the sample should not be counted.

  • In the other cases, this equation becomes 1.0 if s(converted,target) = s(source,target), i.e., if the conversion showed no progress.

The final voice conversion score is the mean over all considered samples of all involved subjects.
Overall Speech Quality

Since it is widely used in telecommunications, for measuring the quality of the converted speech, we apply a mean opinion score test [ITU.P800]. The listeners are asked to assess certain sentences according to the following scale: (1) bad; (2) poor; (3) fair; (4) good; (5) excellent. The mean opinion score is the arithmetic mean of all subjects’ individual scores.

Test Definition. To determine the best achievable conversion quality, the eight voices contained in the training and in the evaluation corpus are also considered. For the test, they are mixed up with the 16 conversion outputs
Test VC.1: Evaluation of research on voice conversion excluding prosody


  1. The voice conversion will be evaluated in several languages (English, Mandarin and Spanish). For each language, 4 voice transformations are defined: female-to-female, male-to-male, male-to-female, female-to-male,. In the first evaluation campaign, the speakers are included in the TC-STAR voice conversion corpus. For the following campaign, two new speakers are added. (This can be either intra-lingual or cross-lingual, according to the evaluation schedule)

  2. The new speakers will read 200 sentences selected from the voice conversion corpus. The sentences will be read using the mimicking style, as defined in the LR specifications. (D8a).

  3. From these sentences, 150 will be used as training data and will be sent to the partners. The other 50 will be used for evaluation purposes.

  4. Each voice conversion system transforms 50 sentences using the four defined transformation.

  5. The subject rate if a given voice pair come or not from the same person. The number of pairs is 4 pairs to compare the target and the converted voices and 4 pairs to compare the target and the source voices. Obviously, the pitch of the transformed voice should be similar to the pitch of the target voice. Furthermore, the pitch of the original source voice is shifted to be similar to the pitch of the target voice

For each pair, the subjects listen two files containing 10 evaluation sentences (they are not required to listen all the sentences) and rate the identity of both voices from 1 (definitely identical) to 5 (definitely different).

The subjects are also asked to assess certain transformed sentences using a mean opinion score test.


Test VC.2: Evaluation of research on voice conversion including prosody

This test is very similar to the test VC.1. The only difference is that all the original sentences (source and target) are uttered using the natural prosody of the speaker. Furthermore, the source voice is presented without pitch adjustment.



      1. Evaluation of research on expressive speech (ES)


Most of the evaluation procedures in expressive speech are functional tests related with emotion: synthetic speech is produced using one of a given predefined set of emotions. The subjects are asked to identify the emotion on the speech (close set answer). The aim of TC-STAR is not to produce emotional speech but expressive speech. One characteristic of expressive speech is that it can signal para-linguistic information using prosody (in the broad sense).

Produce expressive speech from general text requires very high knowledge of the world and high cognitive capabilities. However, in TC-STAR we want to explore how some para-linguistic information can be derived from the source speech and used to produce the synthetic voice.

It is difficult to establish a general functional test for expressiveness even for restricted spoken styles like the ones found in the parliament. In this first design we rely on a judgment test related with the expressiveness. The subjects will be asked about the degree of expressiveness. Furthermore they will judge if the speech is appropriated.

To evaluate the expressive speech component in the TC-STAR framework we require that the items include broad linguistic context. If a sentence or paragraph is presented without context it may be difficult to infer the attitude of the speaker and other affects. This general statement holds both for generating the synthetic speech (systems) and for evaluating the synthetic speech (judges).



Furthermore, one research direction requires the source speech in order to extract information from the source speech. This speech is transcribed and translated using the same conventions that in TC-STAR WP1 and WP2. For each word in the transcription of the source speech, the starting and ending time are provided. In this first stage of the research we propose to use recorded speech (not real speech from the parliament): a professional speaker reads a paragraph taken from the parliament. This allows to record a training/adaptation corpus and to get recording with the highest quality.
Test ES: Evaluation of research on expressive speech.

  1. The evaluation agency prepares the inputs using the following procedure:

  • Selects N=8 documents (transcription of one complete interventions from a parliament) in the source language (English). Take also the original speech and the translation into Spanish.

  • For each document in the source language, select one paragraph.

  • Record the paragraph by one speaker which is able to imitate the original speech from the parliament. The reading should reflect what appear in the transcription, avoiding big speech disorders (repetitions, reformulations, etc.). The speaker can be either one speaker from the voice conversion speakers or other professional speaker. In this case, an adaptation corpus has to be provided.

  • Label the source speech using the WP1 orthographic conventions. Furthermore, state the starting and ending times for each word.

  • The input to the speech synthesis system is a) the text of whole document in the target language (linguistic context); b) the text of the selected paragraph in the source and target language; c) the reading of the selected paragraph, in the source voice and the labeling.

  1. Each system produces synthetic speech related to selected paragraph in the target language. As bottom-line, one of the systems is the baseline system (without introducing features for expressive speech).

  2. M subjects (M=20) evaluate the synthetic speech from all the pairs paragraph-system. The subjects are presented a) the document in the target language and b) the synthetic speech from all the speakers. For each signal they have to answer the following questionnaire:

Q1: A given voice is expressive if it transmits not only the content but also feelings of the speaker, or position of the speaker with respect what is being said or about the listener, or which part is more relevant, etc. Please listen to the following speeches and judge the expressiveness of the voice:

    1. The voice is not specially expressive

    2. The voice is slightly expressive but not appropriated in this context

    3. The voice is very expressive but not appropriated in this context

    4. The voice is slightly expressive and appropriated in this context

    5. The voice is very expressive and appropriated in this context

Q2: Rate from 1 to 5 the following statement: The prosody (intonation, speed, etc.) is natural and appropriated along the paragraph. (1: absolutely disagree; 5: completely agree).
    1. Evaluation of the speech synthesis component


In order to evaluate the system as a whole we will use a black box test. Subjects are asked to indicate their subjective impression of global quality aspects of synthetic output by means of rating scales. The evaluation protocol will be based on the ITU-P85 recommendation [ITU-P85]. In particular, we will follow the recommendations of a recent review of ITU-P80 [CS]. As in the other cases, the best system on each evaluation campaign will be used as baseline for the next campaigns..

In the first evaluation campaign, this test will be used to define the baselines systems. The test will not evaluate voice conversion and expressive speech. Therefore no information about the source speech or speaker will be provided. In the second campaign this test will be redefined taking into account the experience on the first campaign with tests VC (voice conversion) and ES (expressive speech).


Test S1: Evaluation of speech synthesis component

  1. The evaluation agency selects N4 items (paragraphs) from the parliamentary domain. The number of items depends on the number of systems (see Table 3.1). Half o the items will be represented using the normal orthographic convention found in the parliamentary transcriptions. The rest will be represented following the conventions of the output of the spoken translation module (WP1).

  2. Each system produces synthetic voice.

  3. A judgment test is performed by naive subjects. Each subject has to rate in a scale of 5, several aspects of the voice. To avoid the learning effect, each subject judges each item produced only by one system. The number of subjects depends on the number of systems to be evaluated (see Table 3.1).


    1. Bibliography


[EAGLES] Handbook of Standards and Resources for Spoken Language Systems. Edited by Dafydd Gibbon, Roger Moore and Richard Winski; Walter de Gruyter Publishers, Berlin & New York, 1997

[ECESS] European Center of Excelence on Speech Synthesis, www.ecess.org.

[ITU.P85] ITU-T Recommendation P.85, “A method for subjective performance assessment of the quality of speech output devices”, International Telecommunications Union publication 1994.

[ITU.P800] Methods for Subjective Determination of Transmission Quality," ITU, Geneva, Switzerland, Tech. Rep. ITU-T Recommendation P.800, 1996.

[Kain01] A. Kain and M. W. Macon, “Design and Evaluation of a Voice Conversion Algorithm Based on Spectral Envelope Mapping and Residual Prediction”, in Proc: of ICASSP'01, Salt Lake City, USA, 2001.

[Marcus93] M. P. Marcus, B. Santorini, M. A. Marcinkiewicz: Building a Large Annotated Corpus of English: The Penn Treebank, in Computational Linguistics, Volume 19, Number 2 (June 1993), pp. 313--330 (Special Issue on Using Large Corpora).

[SAMPA] SAMPA computer readable phonetic alphabet, http://www.phon.ucl.ac.uk/home/sampa/home.htm

[Sonntag98] G. P. Sonntag, T. Portele, “PURR - a method for prosody evaluation and investigation”, Journal of Computer Speech and Language, Vol.12, No.4, October 1998 Special Issue on Evaluation in Language and Speech Technology, 437-451

[SSML] D. C. Burnett, M. R. Walker, A.Hunt, “Speech synthesis markup language (SSML) version 1.0,” W3C Recommendation, Sept. 2004. http://www.w3.org/TR/speech-synthesis/






  1. XML Interface Specification

    1. Introduction


One of the objectives of TC-STAR is to design a modular synthesis system consisting of three main modules: symbolic pre-processing, prosody generation and acoustic synthesis. The modules will be implemented by different partners, and existing components will require some adaptation to fit into this modular system. Thus, a common definition of the interfaces for inter-module communications is required.

In the following sections we provide a description of the system input requirements (Section 4.2), the interfaces between the text processing and the prosody generation modules (Section 4.3), and the prosody generation and acoustic synthesis modules (Section 4.4). All the information will be formally coded in XML using the DTD included in Section 4.6. Only one DTD will be used in our modular paradigm, each module filling the corresponding part of the XML document.

For the sake of simplicity and efficiency, we will take advantage of the formal definitions achieved in the LC-STAR [Mal 04] project. We will use DTD coded there to implement the POS tagging formal definitions. This DTD is included in Section 4.7, and it is also available at the homepage of the project [Mal 04]

This document includes several examples to clarify the different parts of the problem and to illustrate some practical considerations. Whenever possible, examples have been inserted in the text of the corresponding section (e.g. the SSML examples in section 4.2). However, long examples showing documents written in XML using the TC-STAR DTD have been included in the appendixes at the end of this report (Section 4.8). In particular, sec.4.8.1 shows an input SSML document using different available SSML tags, the corresponding TC-STAR XML document that will be feed into the prosody module appears in Section 4.8.2, and the corresponding input to the synthesis systems is shown in Section 4. 8.3.




    1. System input


The text input to the TTS system will be formatted into a SSML [Bur 04] conforming document. SSML (Speech Synthesis Markup Language) is a W3C Recommendation, a XML-based markup language specifically designed for assisting the generation of synthetic speech.

SSML defines tags to control the text structure (paragraphs and sentences) and to give information about the desired prosody and style (voice gender, speaker age, specific processor voice name, emphasis, pitch contour, duration, etc.). Since all the tags defined in SSML are optional, plain text can easily be transformed into a SSML document by embracing it between and SSML definition tags.

Here we present two illustrative examples showing some of the capabilities of the markup language, as they appear in [Bur 04].

      1. SSML example 1



"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">



xmlns="http://www.w3.org/2001/10/synthesis"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

xml:lang="en-US">


You have 4 new messages.

The first is from Stephanie Williams and arrived at 3:45pm.



The subject is


ski trip





      1. SSML example 2


"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">



xmlns="http://www.w3.org/2001/10/synthesis"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

xml:lang="en-US">





Today we preview the latest romantic music from Example.
Hear what the Software Reviews said about Example's newest hit.




He sings about issues that touch us all.






Here's a sample.

Would you like to buy it?






    1. Interface: Text processing – Prosody generation


The Text Processing Module is in charge of the tokenization, POS tagging and phonetic transcription of the input text. A token is an individually distinguishable element of the input text. Each token is divided in words, each of them having an associated transcription and one POS tag. The words present in a document will be those originally present in the input text, those resulting from the normalization of dates, abbreviations, acronyms, etc., and those included by the different modules (text normalization or prosody generation) as a result of different necessities (e.g. fillers in case of expressive speech).

Phonetic transcription information will be coded for each word as the way the word is spoken in isolation. We will use the SAMPA phonetic alphabet with syllable boundary marker (-), stress marker (“) and tone markers for tonal languages. The codification of the POS is the one used in the LC-STAR project, and all the details can be found in [Mal 04]. For each POS different attributes can be marked (e.g. number, person, case, mood, ...) and the not_specified default attribute is always implied.

In section 4.5, we provide a hierarchical structure illustrating the XML tags (, , etc.) and its relation with each other. Refer to section 4.8 to see a working XML example using the TC-STAR DTD (note that the example is provided for illustration or the different tags only, it does not necessarily need to be correct).

    1. Interface: Prosody generation – Acoustic synthesis


The prosody generation module will fill in the information regarding the syllabic structure of the words, corresponding phonemes, and their associated pauses, accents, fundamental frequency and voice-quality attributes. As before, section 4.5 clarifies the hierarchical structure of the XML tags (
, , , etc.) and its relation with each other clearly. Refer to section 4.8.3 to see a working XML example using the TC-STAR DTD (note that the example is provided for illustration or the different tags only, it does not necessarily need to be correct).

      1. Phonemic and syllabic information


The prosody generation module will associate to each word a list of corresponding phonemes. This may not necessarily be equal to the phonetic transcription itself, since assimilation of vowels, creation of diphthongs and similar phenomena will be considered.

We will add all the information related to a syllable to the first phoneme of that syllable. This methodology allows for the disassociation of words and syllables, and phonemes of different words can be easily associated to the same syllable (particularly useful in case of the association phenomena, for instance).

We will use information regarding the beginning of a syllable (the first phoneme of each syllable will have an extra XML element representing the syllable), and whether it is the last syllable of a word (in which case a flag will be present in the syllable structure to indicate it; absence of this flag means non-final syllable).

In order to label the break index tier, we will follow the guidelines set by SSML [Bur 04], where five categories are defined: none, x-weak, weak, medium, strong and x-strong.

The accent level will be labelled with positive integers indicating the importance of the accent (1 indicates primary accent, 2 indicates secondary, and so on).

      1. Intensity, duration and frequency


Each phoneme will have reference duration, as estimated by the prosody module, expressed in milliseconds and will apply to the whole phoneme. The synthesis module is required to match this duration as close as possible.

The fundamental frequency of the phonemes will be expressed as pairs of time_position (in milliseconds) and f0_value (in Hz). This is a flexible method allowing for several approaches for frequency specification to be used in the system. The simplest case would be one single value for the whole phoneme, allowing, for instance, the common approach of indicating the frequency value at the middle of the phoneme. More detailed fundamental frequency curves can be implemented by sampling this curve with different resolutions according to their needs. Each sampled point will be specified as the time value where it occurred, and the frequency value.



The energy or intensity of the phoneme will be specified in a similar way. As a measure of intensity we will use the mean power in dB, which is defined for a segment of the phoneme as: 10·log(1/N · sum(s[n]^2)) where s[n] is the digitized speech segment of length N. Each segment will be defined with a time position indicating where it ends. The start of the segment does not have to be explicitly specified, since it can be obtained from the end position of the previous segment. In case of specifying the energy of the whole phoneme, or dealing with the first segment, the start time position will be taken as zero.
      1. Voice Quality


As there is no widely accepted definition of voice quality, we will contemplate different options to incorporate this knowledge into the synthesis system. The approaches proposed in subsections 4.4.3.1 and 4.4.3.2 closely follow the studies of Laver [Lav 80] as presented by E. Keller in [Kel --]. Subsection 4.4.3.3Error: Reference source not found is based on the voice-source parameterization studies by C. Gobl [Gob 03]. Voice quality information should be considered optional since not all synthesis procedures require this knowledge.
        1. Laryngeal parameters


In order to select different voice properties using articulatory correlates or the larynx and associated muscles, the following voice-forms are available:

  • modal,

  • falsetto,

  • whisper (can be combined with falsetto or modal),

  • creak (can be combined with falsetto or modal),

  • harshness (can be combined with other voice-forms),

  • breathiness (can be combined with other voice-forms).

In order to create new voice-forms, more than one element can be used, indicating its relevance with a percentage of the total (e.g. modal 50%, creak 15% and breathiness 35%).
        1. Tension settings


Different qualities can also be indicated by using the following classification of voices based on a general tensing or laxing of the entire vocal tract musculature:


tense

sharp

shrill

metallic

strident

lax

soft

dull

guttural




mellow



Only one voice type can be selected at a time, they can not be combined as in section 4.4.3.1


        1. Glottal source related parameters


In order to determine the voice quality of the synthetic output, we will use the following features related to voice-source [Gob 03]:

  • EE excitation energy (overall strength of source excitation) [dB],

  • OQ open quotient (proportion of the glottal period during which the glottis remains open) [%],

  • AS aspiration noise [dB],

  • RA return time (sharpness of glottal closure) [%],

  • RG glottal frequency (degree of boosting in the areas of the first and the second harmonic) [%],

  • RK glottal asymmetry (relation between opening and closing phase of the glottal period) [%].

EE and AS are power measures and will be expressed in dB (see 4.4.2). OQ, RA, RG and RK are expressed as fractions of the pitch period (in %).
    1. Interface structure


Here it follows a structured description of the data involved in the TTS process, and its internal organization in the document. A formal DTD description is included in Section 4.6. Optional elements are explicitly marked below (no mark indicates the element is mandatory).

Please note that the OPTIONAL mark in voice quality is used to indicate that voice quality information is a completely optional information that does not need to be used at all in the TTS system. On the other hand, the OPTIONAL mark in PHON merely indicates that either the word results in no associated phonemes, or that the text has not yet passed through the prosodic module. PHON is the basic information unit of the synthesis module, necessary to produce any speech output.




  1. TOKEN

    1. WORD (“word”)

      • POS (see [1])

      • PHONETIC (“transcription”)

      • PHON (“phoneme”) [FILLED BY TEXT-PROCESSING MODULE]

        1. Duration (milliseconds)

        2. Frequency (sampled curve)

          • Time pos (milliseconds), value (Hz)



        3. energy

          • time pos (milliseconds), value (Hz)



        4. voice-quality [OPTIONAL]

          • laryngeal [OPTIONAL]

            • modal [%]

            • falsetto [%]

            • whisper [%]

            • creak [%]

            • harshness [%]

            • breathiness [%]

          • tension [OPTIONAL]

            • tense, sharp, shrill, metallic, strident, lax, soft, dull, guttural or mellow

          • source [OPTIONAL]

            • EE [dB]

            • OQ [%]

            • AS [dB]

            • RA [%]

            • RG [%]

            • RK [%]

        5. SYL [ONLY IN FIRST PHONEME OF SYLLABLE]

          • Last-syllable (boolean flag)

          • Accent level (1, 2, etc.)

          • Break level: duration (milliseconds) and strength (none, x-weak, weak, medium, strong or x-strong)

      • PHON (“phoneme”) [FILLED BY TEXT-PROCESSING MODULE]



      • PHON (“phoneme”) [FILLED BY TEXT-PROCESSING MODULE]

    2. WORD



    3. WORD

  2. TOKEN



  3. TOKEN





    1. TC-STAR DTD



TC-STAR TTS DTD

Specification of the interfaces in the TTS modular system.
Based on the SSML DTD as provided by W3C. The original copyright

is reproduced below.

-->

SSML DTD (20031204)


Copyright 1998-2003 W3C (MIT, ERCIM, Keio), All Rights Reserved.
Permission to use, copy, modify and distribute the SSML DTD and

its accompanying documentation for any purpose and without fee is

hereby granted in perpetuity, provided that the above copyright

notice and this paragraph appear in all copies.


The copyright holders make no representation about the suitability

of the DTD for any purpose. It is provided "as is" without expressed

or implied warranty.
-->
%lc-star;
lax | soft | dull | guttural | mellow | %ns;">

xml:lang NMTOKEN #REQUIRED

>
xml:lang NMTOKEN #IMPLIED

>
xml:lang NMTOKEN #IMPLIED

>
xml:lang NMTOKEN #IMPLIED

gender (male | female | neutral) #IMPLIED

age %integer; #IMPLIED

variant %integer; #IMPLIED

name CDATA #IMPLIED

>

pitch CDATA #IMPLIED



contour CDATA #IMPLIED

range CDATA #IMPLIED

rate CDATA #IMPLIED

duration %duration; #IMPLIED

volume CDATA #IMPLIED

>

src %uri; #REQUIRED



>
xml:lang NMTOKEN #IMPLIED

>
level (strong | moderate | none | reduced) "moderate"

>
interpret-as NMTOKEN #REQUIRED

format NMTOKEN #IMPLIED

detail NMTOKEN #IMPLIED

>

alias CDATA #REQUIRED



>

ph CDATA #REQUIRED

alphabet CDATA #IMPLIED

>

time CDATA #IMPLIED



strength (none | x-weak | weak | medium | strong | x-strong) "medium"

>

name CDATA #REQUIRED



>
uri %uri; #REQUIRED

type CDATA #IMPLIED

>

name NMTOKEN #IMPLIED



content CDATA #REQUIRED

http-equiv NMTOKEN #IMPLIED

>

duration CDATA "not_specified"



>

falsetto CDATA "not_specified"

whisper CDATA "not_specified"

creak CDATA "not_specified"

harshness CDATA "not_specified"

breathiness CDATA "not_specified"

>

OQ CDATA "not_specified"



AS CDATA "not_specified"

RA CDATA "not_specified"

RG CDATA "not_specified"

RK CDATA "not_specified"

>

value CDATA #REQUIRED

>
accent (%accent;) "not_specified"

>

    1. LC-STAR DTD


This DTD (copied from [Mal 04]) is the base of the POS tagging and the phonetic transcription in the TC-STAR definition of the inter-module interfaces. It was created during the LC-STAR project in order to provide a formal definition of the lexica needed for the different languages supported there, that at the time of writing are:

• Catalan,

• Finnish,

• German,

• Greek,

• Hebrew,

• Italian,

• Mandarin Chinese,

• Russian,

• Slovenian,

• Spanish,

• Standard Arabic,

• Turkish,

• US-English.

See  [Mal 04] for a more detailed explanation.

ART | ADV | CON | ADP | INT | PAR | PRE |

ONO | MEW | AUW | IDI | PUN | ABB | LET">

1.1.2. | 1.1.3. | 1.1.4. | 1.2.1. | 1.2.2. | 1.3. |

1.4. | 1.5.1. | 1.5.2. | 1.6. | 2.1.1. | 2.1.2. |

2.1.3. | 2.1.4. | 2.2.1. | 2.2.2. | 2.2.3. |

3.1.1. | 3.1.2. | 3.1.3. | 3.1.4. | 3.1.5. | 4.1.1. |

4.1.2. | 4.1.3. | 4.1.4. | 4.1.5. | 4.1.6. | 5.1.1. |

5.1.2. | 5.1.3. | 5.1.4. | 5.1.5. | 5.2. |

6.1.1. | 6.1.2. | 6.1.3. | 6.1.4. | 6.1.5. |

6.2.1. | 6.2.2. | 6.2.3. | 6.2.4. | 6.2.5. |

6.2.6.">


CIT | STR | COM | BRA | TOU | HLD">

dual | %ns;">

genitive_partitional | essive | translative |

inessive | elative | illative | adessive |

ablative | allative | abessive | instructive |

comitative | accusative | vocative | dative |

locative | instrumentative | equative |

prepositional | indeclinable | invariant | %ns;">

construct_case | agent | ness | zero |

past_participle | future_participle |

infinitive | feel_like | not_state |

not_able_state | act_of | diminutive | %ns;">

pejorative | %ns;">

PL3 | %ns;">

fit_for | in_between | agent | past_participle |

future_participle | present_participle | construct_case |

feel_like | related | just_like | zero | %ns;">

aorist | future | narrative_past | future_past |

future_narrative | past_past | narrative_narrative |

imperative | aorist_passive | %ns;">

imperative | infinitive | infinitiveI |

infinitiveII | infinitiveIII | infinitiveIV |

necessitative | desirative | participle |

adverbial_participle |participleI | participleII |

gerund | potential | progressive | progressiveII |

participle_present | participle_perfect | finite |

progressive | %ns;">

xml:lang NMTOKEN #IMPLIED >

type (%subdomain;) #REQUIRED

entries CDATA #REQUIRED>


class (%class_noun;) #REQUIRED

number (%number;) "not_specified"

gender (%gender;) "not_specified"

case (%case;) "not_specified"

type (%type_noun;) "not_specified"

appreciative (%appreciative;) "not_specified"

poss_agreem (%possessive_agreement;) "not_specified">

number (%number_adj;) "not_specified"

gender (%gender_adj;) "not_specified"

case (%case;) "not_specified"

degree (%degree;) "not_specified"

form (%form;) "not_specified"

type (%type_adj;) "not_specified"

appreciative (%appreciative;) "not_specified"

poss_agreem (%possessive_agreement;) "not_specified">

number (%number;) "not_specified"

gender (%gender_adj;) "not_specified"

person (%person;) "not_specified"

case (%case;) "not_specified"

form (%form;) "not_specified"

type (possessive | demonstrative | indefinite | interrogative |

exclamative | relative | pronominal | definite | negative |

definite_article | attributive | %ns;) "not_specified"

degree (%degree;) "not_specified">

number (%number;) "not_specified"

gender (%gender;) "not_specified"

case (%case;) "not_specified"

type (ordinal | cardinal | multiplicative | collective |

percentage | real | range | ratio | distributive | relative |

time | construct_case | indefinite | %ns;) "not_specified">

number (%number;) "not_specified"

gender (%gender;) "not_specified"

person (%person_ver;) "not_specified"

case (%case;) "not_specified"

mood (%mood;) "not_specified"

tense (%tense;) "not_specified"

voice (active | passive | reflexive | pronominal | %ns;)

"not_specified"

polarity (positive | negative | %ns;) "not_specified"

aspect (perfect | imperfect | progressiveI |

progressiveII | aorist | %ns;) "not_specified"

form (%form;) "not_specified"

copula (%flag;) "not_specified"

type (causative | reflexive | passive | reciprocal_collective |

become | acquire | able | repeat | hastily | ever_since |

almost | stay | start | continue | zero | %ns;) "not_specified">

number (%number;) "not_specified"

gender (%gender;) "not_specified"

person (%person_ver;) "not_specified"

case (%case;) "not_specified"

tense (%tense;) "not_specified"

mood (%mood;) "not_specified"

voice (active | passive | reflexive | %ns;) "not_specified"

polarity (positive | negative | %ns;) "not_specified"

aspect (perfect | imperfect | %ns;) "not_specified"

form (%form;) "not_specified"

type (finite | modal | %ns;) "not_specified">

number (%number;) "not_specified"

poss_agreem (%possessive_agreement;) "not_specified"

gender (%gender; | indeterminate) "not_specified"

person (%person;) "not_specified"

case (%case; | oblique) "not_specified"

type (personal | demonstrative | reflexive | indefinite |

interrogative | reciprocal | relative | possessive | definite |

exclamative | quantifying | negative | %ns;) "not_specified"

politeness (%flag;) "not_specified">

number (%number;) "not_specified"

gender (%gender;) "not_specified"

case (%case;) "not_specified"

type (definite | indefinite | partitive | %ns;) "not_specified">

degree (%degree;) "not_specified"

type (time | place | after_doing_so | since | when | by_doing_so |

while | as_if | without_having_done_so | ly | adamantly |

without_being_able_to_have_done_so | as_long_as |

since_doing_so | manner | %ns;) "not_specified">

type (coordinating | subordinating | %ns;) "not_specified">

number (%number;) "not_specified"

gender (%gender;) "not_specified"

person (%person;) "not_specified"

type (simple | articulated | possessive | %ns;) "not_specified">

number (%number;) "not_specified"

person (%person;) "not_specified"

tense (present | past | narrative | %ns;) "not_specified"

mood (conditional | %ns;) "not_specified"

copula (yes | %ns;) "not_specified">


    1. TC-STAR XML Examples


Below you will find a short example of an input text in SSML format, and its correspondent TC-STAR XML documents at the input of the prosody and synthesis modules. As you will see, the SSML text contains a mark to specify a prosodic change in the pronunciation of one word. Although the text processing module does not know anything about how the prosody is specified, it maintains the mark so the prosodic module can use the information (resulting in longer durations of the affected phonemes).

Please note that the examples below are merely informative and are intended to illustrate the usage and position of the different tags available to the developers. In particular, voice quality tags have been randomly chosen to show several valid combinations.


      1. SSML input


"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">



xmlns="http://www.w3.org/2001/10/synthesis"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

xml:lang="es">


Cuerpo
gaseiforme
que sin embargo ofrece resistencia.



      1. Prosody module input


"TC-STAR.dtd">












kw'eR - po








Ga-sej-f'or-me











...


...









...


...





...


...









...


...








...


...





...


...








      1. Synthesis module input


"TC-STAR.dtd">












kw'eR - po
































































































Ga-sej-f'or-me
























































































































































































...


...

...


...









...


...

...


...





...


...

...


...









...


...

...


...








...


...

...


...





...


...









    1. References

[Mal 04] G. Maltese and C. Montecchio, “General and language-specific specification of contents of lexica in 13 languages,” LC-STAR Deliverable, May 2004. [Online]. Available: http://www.lc-star.com/WP2_deliverable_D2_v2.1.doc =0pt

[Bur 04] D. C. Burnett, M. R. Walker, and A. Hunt, “Speech synthesis markup language (SSML) version 1.0,” W3C Recommendation, Sept. 2004. [Online]. Available: http://www.w3.org/TR/speech-synthesis/ =0pt

[Lav 80] J. Laver, The Phonetic Description of Voice Quality.1em plus 0.5em minus 0.4emCambridge University Press, 1980.

[Kel --] E. Keller, Lecture Notes in Computer Science.1em plus 0.5em minus 0.4emSpringer Verlag, ch. The Analysis of Voice Quality in Speech Processing, to be published.

[Gob 03] C. Gobl, “The voice source in speech communication,” Ph.D. dissertation, Department of Speech, Music and Hearing, KTH, Stockholm, 2003.



[Spr 99] Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. “Normalization of non-standard words: WS'99 final report”. Technical Report, September 1999.

1 www.tc-star.org

2 www.ecess.org

3 BLARK is an initiative of the HLT community to make available needed language resources for each language

4 This experience has been made during the process to specify LR for speech recognition (see specifications developed for the SpeechDat family: www.speechdat.org).

5 Within the DARPA projects communicative situations are defined for supporting research in ASR. For this purpose LR for different communicative situations as ‘read speech’, ‘conversational speech’, and different domains as‘broad cast news’, ‘call home’ etc. are defined and related LRs are provided

6 www.LC-STAR.com

7 Within TC-STAR a great part of the corpus ‘transcribed speech’ is derived from ‘parliamentary speech’. This domain is used for the TC-STAR speech to speech translation demonstrator for the language pairs UK - ES and UK - Mandarin

8 If the concatenated speech segments are derived from speech sections with different speaking modes the synthetic speech sounds not ‘consistent’. Personal communication from Nick Campbell (ECESS, Maribor June 2004).

9 e.g. For Mandarin we will have only one single baseline voice; either male or female.

10 This voice can be produced by one of the baseline speakers. Beforehand however it has to be controlled if the bilingual speakers are able to mimic the baseline speaker’s prosody. Otherwise another suited template speaker has to be found.

11 For this voice the bilingual male speakers can be used.

12 For this voice the bilingual female speakers can be used.

13 Typical speech defects even among trained speakers might be related to s-, th-, r- or other sounds, missing distinction between voiced/unvoiced, incorrect nasalization or aspiration, etc..

14 For TC-STAR C1.1_T should be transcribed ‘parliamentary speeches’ translated from English to Spanish and Mandarin.

15 The specific domains selected are defined in the LSP.

16 All ‘root expressions’ building numbers should be included together with their prosodic variations ( e.g. prosody with respect to the position (end, middle, beginning) of a composed number).

17 E.g. for Mandarin

18 Lexical stress does not exits in languages like Mandarin; this issue will be addressed in LSP.

19 cf. http://www.lc-star.com/WP2_deliverable_D2_v2.1.doc; for Mandarin and Spanish already validated LC-STAR lexica exist.

20 A lower limit of 0,1s < RT60 is recommended in order to achieve a natural sounding voice. Recordings from anechoic chambers can be made natural in a post-processing step by applying reverberation algorithms.

21 The quality of the signal of the laryngograph has to be defined.

22 Personal communication by H. Tillmann and H. Pfitzinger (Phonetic Institute of Munich): from their investigations it can be concluded that these small varying delays are not relevant for marking pitch pulses for concatenative synthesis based on PSOLA principle.

23 This proposal follows the specifications of the LILA project (cf. http://www.lilaproject.org/)

24 A Matlab program calculating dBA by approximating the filter of Fig B1 is available at Siemens. This program has been made available by the acoustic group (Prof. Hugo Fastl) of the Institute of Man-Machine Communication at the Technical University of Munich (TUM).

25 http://www.sweetwater.com/shop/studio/acoustic-treatment/glossary.php#55

26 In this document, parliamentary domain means transcriptions from the European Parliament. In case that this is not applicable for a given language (e.g. Mandarin), a similar domain will be defined.

27 The pronunciation of foreign proper names is out of the scope of the project and will not be considered in the first phase of TC-STAR. It is expected that frequent foreign names will appear in the lexicon (LC-STAR proper names or baselines voices). However, the main goal of the test M1.3 is not to evaluate lexicons but methods to cope with out-of-vocabulary words.


© TC-STAR Consortium page
1   2   3   4   5   6


The database is protected by copyright ©dentisty.org 2016
send message

    Main page