Tc-star project Deliverable no. D8 Title: tts baselines & Specifications



Download 1.1 Mb.
Page1/6
Date conversion20.07.2018
Size1.1 Mb.
  1   2   3   4   5   6

TC-STAR Project Deliverable no. D8 Title: TTS Baselines & Specifications




Technology and Corpora for Speech to Speech Translation
http://www.tc-star.org
















Project no.:

FP6-506738

Project Acronym:

TC-STAR

Project Title:

Technology and Corpora for Speech to Speech Translation

Instrument:

Integrated Project

Thematic Priority:

IST

Deliverable no.: D8
Title: TTS Baselines and specifications



Due date of the deliverable:

30th of September 2004

Actual submission date:

31st of March 2005

Start date of the project:

1st of April 2004

Duration:

36 months

Lead contractor for this deliverable:

UPC

Authors:

Antonio Bonafonte (UPC), Harald Höge(Siemens AG), Herbert S. Tropf (Siemens AG), Asuncion Moreno (UPC), Henk van der Heuvel (SPEX), David Sündermann (UPC), Ute Ziegenhain (Siemens AG), Javier Pérez (UPC), Imre Kiss (Nokia)


Revision: [final]

Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006)

Dissemination Level

PU

Public

X

PP

Restricted to other programme participants (including the Commission Services)




RE

Restricted to a group specified by the consortium (including the Commission Services)




CO

Confidential, only for members of the consortium (including the Commission Services)





1 Introduction 5

2 Specifications of LR for Speech Synthesis 5

2.1 The Rationale of the Specifications 5

2.1.1 Focus of the section 2 6

2.1.2 Notation of Corpora 6

2.1.3 Design Principles of the Text Corpora 6

2.1.4 Size of the Text Corpora 8

2.1.5 Building Voices and Related Recorded Corpora 8

2.1.6 Speaking Mode 9

2.1.7 Selection of the Speakers and Related Corpora 10

2.1.8 Studio for Recording, Speech Quality and Pitch Marking 10

2.1.9 Annotation 10

2.1.10 Database interchange format 10

2.1.11 Validation Criteria 10

2.2 Languages 11

2.3 Speakers and Speaking Modes 11

2.3.1 Number of Speakers 11

2.3.2 Speaker Profile 12

2.3.3 Speaking Modes 13

2.3.4 Casting of speakers 14

2.4 Specification of Corpora 15

2.4.1 Amount of Corpora 15

2.4.2 Kind and Size of Sub-corpora of Corpus C_T 15

2.4.3 Coverage Issues of the Text Corpus C_T 17

2.4.4 Prompt Texts C_PT 19

2.4.5 Corpus for the Pre-Selection of the Baseline Voices 19

2.4.6 Corpus for the Final Selection of the Baseline Voices 20

2.4.7 Corpus for the Selection of the Conversion Voices and Expressive speech voices (C_5MR) 20

2.4.8 Baseline Corpus 21

2.4.9 Cross-language Voice Conversion Corpus 21

2.4.10 Intra-Lingual Voice Conversion Corpus 21

2.4.11 Corpus for expressive speech 22

2.5 TTS Lexicon 22

2.5.1 Common Word Lexicon 22

2.5.2 Proper Name Lexicon 22

2.6 Recording Environment and Recording Platforms 22

2.6.1 Quality of Speech Signal 23

2.6.2 Precision of Marking Epochs 23

2.6.3 Recording platform 24

2.6.4 Recording Devices 24

2.6.5 Recording procedure 25

2.7 Segmentation and annotation 25

2.7.1 Transcription of the Recorded Speech 25

2.7.2 Segmentation 27

2.7.3 Pitch Marking 28

2.8 Database interchange format 29

2.8.1 Storage Media and Character set 29

2.8.2 File Types 29

2.8.3 Directory structure 29

2.8.4 Speech and label file system hierarchy 30

2.8.5 Documentation directories 30

2.8.6 File name conventions 31

2.8.7 Speech file format 31

2.8.8 SAM Labels 31

2.8.9 SAM Label Files 34

2.8.10 Other label files 35

2.8.11 Table files 36

2.8.12 Lexicon files 37

2.8.13 Documentation files 39

2.8.14 Recommendations 44

2.9 References 44

Appendices A and B 45

A1 Algorithms to Achieve High Triphone and Phoneme Coverage 45

A1.1 Algorithm to Achieve High Triphone Coverage 45

A2 Mimic Sentences Adaptation and Diphone Sentences (C_10SR) 46

A2.1 Mimic Sentences: Calibration of the Template Speech 46

A2.2 Generation of the Diphone Sentences (C_10SR) from the corpus C_200SR 46

B1 Noise, Frequency Range, Reverberation and Recording 47

B1.1 Frequency Range 47

B1.2 Noise 47

B2 Reverberation RT-60 49

B3 Recording 49

In the following proposals for recording hardware and software are given however each partner is free to use whatever best fits and is in accordance with the specifications. 49

B3.1 Proposals for recording software 49

B3.2 Proposals for recording hardware 49

B3.3 Proposals for large membrane condenser microphone 50

B3.4 Proposals for the laryngograph 50

B3.5 Proposals for the close-talk microphone 50

3 Specifications of Evaluation of Speech Synthesis 51

3.1 Introduction 51

3.2 Definition of speech synthesis modules 51

3.3 Evaluation of the speech synthesis modules 54

3.3.1 Module 1: Text analysis 54

3.3.2 Module 2: Prosody. 56

3.3.3 Module 3: Speech generation. 58

3.4 Evaluation of specific research topics 60

3.4.1 Voice conversion (VC) 60

3.4.2 Evaluation of research on expressive speech (ES) 63

3.5 Evaluation of the speech synthesis component 64

3.6 Bibliography 65

4 XML Interface Specification 66

4.1 Introduction 66

4.2 System input 66

4.2.1 SSML example 1 66

4.2.2 SSML example 2 67

4.3 Interface: Text processing – Prosody generation 67

4.4 Interface: Prosody generation – Acoustic synthesis 68

4.4.1 Phonemic and syllabic information 68

4.4.2 Intensity, duration and frequency 68

4.4.3 Voice Quality 69

4.5 Interface structure 70

4.6 TC-STAR DTD 71

4.7 LC-STAR DTD 74

4.8 TC-STAR XML Examples 78

4.8.1 SSML input 78

4.8.2 Prosody module input 78

4.8.3 Synthesis module input 79

4.9 References 84


  1   2   3   4   5   6


The database is protected by copyright ©dentisty.org 2016
send message

    Main page