V2 Voice Clone Tutorial

V2 Voice Clone Tutorial

Text to Speech Fine-grained Control

Advanced control over speech generation

Getting Started

Disabling normalization may reduce the stability of reading numbers, dates, and URLs. You'll need to handle these cases manually for best results.

Phoneme Control

Phoneme control allows you to specify exact pronunciations for words or characters. Currently, we support:

  • CMU Arpabet (for English)
  • Pinyin (for Chinese)

To use phoneme control, wrap the desired pronunciation in <|phoneme_start|> and <|phoneme_end|> tags. Each tag should contain a single word or character.

Examples

Standard: I am an engineer.

With control: I am an <|phoneme_start|>EH N JH AH N IH R<|phoneme_end|>.

标准: 我是一个工程师。

控制: 我是一个<|phoneme_start|>gong1<|phoneme_end|><|phoneme_start|>cheng2<|phoneme_end|><|phoneme_start|>shi1<|phoneme_end|>。

Paralanguage

Paralanguage controls allow you to add natural speech elements and pauses to make the generated speech sound more human-like. There are two main types of controls:

Pause Words

You can use common pause words like "um", "uh", "嗯", "啊" to control the rhythm of the speech.

Special Effects

The following special effects can be added using parentheses:

EffectDescriptionFirst AvailableStage
(break)Short pauseV2Experimental
(long-break)Extended pauseV2Experimental
(breath)Breathing soundV2Experimental
(laugh)Laughter soundV2Experimental
(cough)Coughing soundV2Experimental
(lip-smacking)Lip smacking soundV2Experimental
(sigh)Sighing soundV2Experimental

The effects (laugh), (cough), (lip-smacking), and (sigh) are developing. You may need to repeat them multiple times for better results.

English Example:

Standard: I am an engineer.

With paralanguage: I am, um, an (break) engineer.

中文示例:

标准: 我是一名工程师。

添加副语言: 我,嗯,是一名(break)工程师。