Chatterbox TTS
Chatterbox transforms text into rich, natural speech with adjustable emotional expressiveness for diverse applications.
API
If you're looking for an API, you can choose from your desired programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import requests
import base64
# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
with open(image_path, 'rb') as f:
image_data = f.read()
return base64.b64encode(image_data).decode('utf-8')
# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
response = requests.get(image_url)
image_data = response.content
return base64.b64encode(image_data).decode('utf-8')
# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
return [image_url_to_base64(url) for url in image_urls]
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/chatterbox-tts"
# Request payload
data = {
"text": "Welcome to Chatterbox TTS, where your text turns into captivating audio effortlessly.",
"reference_audio": "https://segmind-resources.s3.amazonaws.com/input/ef2a2b5c-3e3a-4051-a437-20a72bf175de-sample_audio.mp3",
"exaggeration": 0.5,
"temperature": 0.8,
"seed": 42,
"cfg_weight": 0.5,
"min_p": 0.05,
"top_p": 1,
"repetition_penalty": 1.2
}
headers = {'x-api-key': api_key}
response = requests.post(url, json=data, headers=headers)
print(response.content) # The response is the generated image
Attributes
The input text is synthesized into speech. Use longer text for detailed narration, shorter for concise messages.
Provides a sample audio for voice style matching
Adjusts speech expressiveness. Use lower values for neutrality, higher for dramatic effect.
min : 0,
max : 2
Controls speech variation. Use lower for consistent tone, higher for diverse expressions.
min : 0,
max : 2
Ensures consistent output with the same input. Adjust for diverse generations.
Balances creativity and adherence to text. Use lower for strict interpretation, higher for flexibility.
min : 0,
max : 2
Ensures minimum probability for content inclusion. Useful for removing unlikely phrases.
min : 0,
max : 1
Determines output randomness. Lower for focused content, higher for creative diversity.
min : 0,
max : 1
Penalizes repeated words in speech. Higher values reduce redundancy.
min : 1,
max : 2
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Resources to get you started
Everything you need to know to get the most out of Chatterbox TTS
# Guide to Effective Use of Chatterbox TTS
Chatterbox is a high-fidelity, open-source text-to-speech model that turns plain text into expressive, natural audio. With a 0.5 B-parameter Llama backbone and emotion exaggeration control, you can craft anything from flat announcements to theatrical character voices. Follow this guide to get the best results in your project.
## 1. Basic Workflow
1. **Provide `text`**
– Short alerts (5–20 words) for notifications
– Medium passages (50–200 words) for promos or clips
– Long-form scripts (200+ words) for narration and audiobooks
2. **Optional `reference_audio`**
– Supply an MP3/URL sample to clone or match tone and timbre
3. **Adjust parameters** (defaults shown below)
– `exaggeration`: 0.7
– `temperature`: 0.9
– `cfg_weight`: 0.6
– `top_p`: 0.95, `min_p`: 0.1
– `repetition_penalty`: 1.3
4. **Generate & iterate**
– Tweak sliders and seed for consistent or varied outputs
## 2. Core Parameters
- **exaggeration (0–2)**: Controls emotional intensity
• 0–0.5: neutral, robotic
• 0.7: natural, balanced (default)
• 1.5–2.0: dramatic, theatrical
- **temperature (0–2)**: Governs randomness
• 0.2–0.5: consistent, predictable
• 0.8–1.2: moderate variation
• 1.5–2.0: highly varied
- **cfg_weight (0–2)**: Text vs. creativity balance
• Low (0.2–0.5): strict text adherence
• Mid (0.6–1.0): balanced
• High (1.2–2.0): more interpretive
## 3. Advanced Controls
- **top_p / min_p (0–1)**: Nucleus sampling for diversity
- **repetition_penalty (1–2)**: Discourage repeated phrases
- **seed**: Lock randomness for reproducibility
## 4. Parameter Presets by Use Case
| Use Case | Exag. | Temp. | CFG | top_p | Rep. Penalty | Notes |
|-------------------------------|-------|-------|------|-------|--------------|---------------------------|
| Interactive AI Agent | 0.7 | 0.5 | 0.6 | 0.8 | 1.2 | Friendly, clear responses |
| Game Dialogue & Cinematics | 1.2 | 1.0 | 0.8 | 0.9 | 1.3 | Character-driven |
| Video Narration & Explainers | 0.5 | 0.3 | 0.5 | 0.7 | 1.1 | Steady, professional tone |
| Memes & Social Clips | 1.8 | 1.2 | 1.0 | 1.0 | 1.0 | High energy, playful |
| Podcasts & Audiobooks | 0.6 | 0.4 | 0.5 | 0.8 | 1.3 | Consistent pacing |
## 5. Prompt Tips & Best Practices
- **Chunk long scripts** into paragraphs to avoid timing hiccups.
- **Use punctuation** (commas, dashes) to guide natural pauses.
- **Reference audio**: match gender, pace, accent for best cloning.
- **Watermark**: Every file includes an inaudible trace to promote ethical use.
Experiment with these settings to find the sweet spot for your project. Happy synthesizing!
Other Popular Models
Discover other models you might be interested in.
storydiffusion
Story Diffusion turns your written narratives into stunning image sequences.

faceswap-v2
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

sdxl1.0-txt2img
The SDXL model is the official upgrade to the v1.5 model. The model is released as open-source software

sd2.1-faceswapper
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training
