Gemini 3.1 TTS提示编写指南-开发者社区

Gemini 3.1 Flash 文本转语音 (TTS) 是一个新模型，你可以通过指导它来获得精确的音频表现。在这篇博文中，我将分享一些关于如何通过提示词引导模型的技巧，并展示它的一些优势。

开箱即用，gemini-3.1-flash-tts-preview会自然地解读文本，并决定你的文字应该如何被演绎。不附加任何提示词的简单文本听起来就很自然。但 3.1 Flash TTS 也提供了可以用来引导它的工具。

你可以给模型提供丰富的上下文，比如音频档案——谁在说话、他们怎么说、他们的声音听起来像什么等等。你还可以描述场景、他们在哪里、他们在做什么、环境如何，并提供任何额外的"导演笔记"来指导表演。模型会利用这些信息生成适合该上下文的语音。

你现在还可以使用标签来控制文本中特定部分的演绎方式。标签是内联修饰符，如[whispers]或[laughs]，让你对演绎有精细的控制。你可以用它们来改变一句话或文本某个部分的语调、节奏和情感氛围。你还可以用它们添加感叹词和一些其他非语言声音到表演中，比如 [cough]、[sighs] 或 [gasp]。

你可以使用的标签没有限制。你可以在[]括号中自由发挥创意，模型总是会尽力理解和解读它们。

1、简单文本和创意标签

为了展示仅凭标签就能获得的多样性，以下是一组示例，每个都说同样的话，使用相同的语音，但演绎方式根据我使用的标签而改变。我选择了Algenib语音，一种男性、略带沙哑的嗓音。

以下是不带标签时的效果：

Hey there, I’m a new text to speech model, and I can say things in many different ways. How can I help you today?

让我们从改变语气开始，我们的说话者可能感到无聊、不情愿或兴奋，我们可以听出来：

[excitedly] Hey there, I’m a new text to speech model…

[bored] Hey there, I’m a new text to speech model…

[reluctantly] Hey there, I’m a new text to speech model…

我们还可以使用标签来改变演绎的速度，并将它们与语气结合：

[very fast] Hey there, I’m a new text to speech model…

[very slowly] Hey there, I’m a new text to speech model…

[sarcastically, one painfully slow word at a time] Hey there, I’m a new text to speech model…

标签还提供了对各个部分的精确控制，所以我们可以轻声耳语，然后大声喊叫，或者任何你想要的组合：

[asmr] Hey there, I’m a new text to speech model, [deep and loud shouting] and I can say things in many different ways. [asmr] How can I help you today?

你真的可以尝试各种各样的事情：

[like a dog] Hey there, I’m a new text to speech model…

[like dracula] Hey there, I’m a new text to speech model…

[singing] Hey there, I’m a new text to speech model…

更多你可以尝试的标签：

[amazed]
[crying]
[curious]
[gasp]
[giggles]
[mischievously]
[panicked]
[sarcastic]
[serious]
[sighs]
[snorts]
[tired]
[trembling]
标签让我们快速轻松地控制文本的演绎。我们还可以将它们与上下文提示词结合，来设定表演的整体语调和氛围。

2、上下文与表演

通过提供精细的指令，比如精确的地区口音、特定特征如气声或节奏，你可以利用模型的上下文感知能力来生成动态、自然且富有表现力的音频表演。这避免了需要为每一个微小的编辑使用标签。

当文本和提示词保持一致时效果最好，这样"谁在说"就与"说了什么"以及"怎么说"相匹配。

3、提示词结构

一个好的提示词在文本之前包含几个关键元素：

音频档案
场景
导演笔记
这些部分都是可选的，但它们可以帮助模型理解你想要的上下文和表演。你可以把它们看作是一种系统指令，用于从不同的文本中创建一致的输出。

3.1 音频档案

这是你的语音人设。你可以定义角色身份、原型以及任何其他特征，如年龄或背景。

给你的角色一个名字有助于稳定模型并将表演连贯起来。你可以在设定场景和上下文时通过名字引用角色。定义他们的身份也很有帮助，比如他们是电台 DJ、播客主持人还是新闻记者。

3.2 场景

场景设定了舞台。位置、氛围和环境细节定义了语调和氛围。你应该描述角色周围正在发生的事情以及这如何影响他们。场景为整个交互提供了环境上下文，将以微妙而有机的方式引导表演。比如繁忙的清晨咖啡店的对话、专业录音棚里的 DJ，或者繁忙机场中的广播。

## THE SCENE: The London Studio It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

3.3 导演笔记

导演笔记是对模型的表现指导。最常见的指导是风格、节奏和口音，但模型不仅限于这些。随意包含自定义指令来涵盖对表演重要的任何额外细节，根据需要详细或简略。

### DIRECTOR'S NOTES Style: Enthusiastic and Sassy GenZ beauty YouTuber Accent: Southern california valley girl from Laguna Beach Pacing: Speaks at an energetic pace, keeping up with the extremely fast, rapid delivery influencers use in short form videos.

3.4 风格

风格设定了生成语音的语调。包括诸如欢快、充满活力、放松或无聊等内容来引导表演。要描述性并提供必要的细节。说"具有感染力的热情。听众应该感觉自己是一个庞大的、令人兴奋的社区活动的一部分"比简单地说"充满活力和热情"效果好得多。

你甚至可以尝试在配音行业中流行的术语，比如"vocal smile"。你可以叠加任意多的风格特征。

Style: Sassy GenZ beauty YouTuber, who mostly creates content for YouTube Shorts.

3.5 口音

描述所需的口音。你越具体，结果越好。例如，使用"英国英格兰克罗伊登地区口音"而不仅仅是"英国口音"。

Accent: Jaz is a DJ from Brixton, London

3.6 节奏

你还可以指定整体的节奏以及整个作品中的节奏变化。

Pacing: The "Drift": The tempo is incredibly slow and liquid. Words bleed into each other. There is zero urgency.

3.7 完整提示词示例

以下是一个完整提示词可能的样子的示例：

# AUDIO PROFILE: Jaz R. ## "The Morning Hype" ## THE SCENE: The London Studio It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation. ### DIRECTOR'S NOTES Style: * The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting. * Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning"). Accent: Jaz is from Brixton, London Pace: Speaks at an energetic pace, keeping up with the fast music. Speaks with a "bouncing" cadence. High-speed delivery with fluid transitions—no dead air, no gaps. ### SAMPLE CONTEXT Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy. #### TRANSCRIPT [excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you. [shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!

4、让 Gemini 帮助你

如果你在寻找合适的词语时遇到困难，Gemini 可以很好地充当联合导演的角色。以下是一个很好的系统指令，可以从简单的提示词生成上下文：

You are a scriptwriter and audio director. I have a simple context but NO TRANSCRIPT. TASK: 1. Write a creative, engaging script based on the given context. 2. Format the entire output as a structured TTS prompt. Follow the strict output format exactly. You may include emotion and interjection tags in brackets within the script to direct the TTS model's performance. For example, you can write: "[amused] Oh, really?" or "[sigh] I suppose so". You can be creative with the tags you use, and the model will always do its best to understand and interpret them. STRICT OUTPUT FORMAT: # AUDIO PROFILE: [Invent a Name] ## "[Invent a Title]" ## THE SCENE: [Invent a Scene Title] [Vivid description of the scene] ### DIRECTOR'S NOTES Style: [Style instructions] Pace: [Pace instructions] Accent: [Accent instructions] ### SAMPLE CONTEXT [Role/Persona description] #### TRANSCRIPT [Script] ---------------- INPUT CONTEXT: ... CRITICAL RULE: Ensure the divider "#### TRANSCRIPT" is used exactly as written before the spoken text.