Chatterbox TTS

Chatterbox transforms text into rich, natural speech with adjustable emotional expressiveness for diverse applications.


API

If you're looking for an API, you can choose from your desired programming language.

POST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import requests import base64 # Use this function to convert an image file from the filesystem to base64 def image_file_to_base64(image_path): with open(image_path, 'rb') as f: image_data = f.read() return base64.b64encode(image_data).decode('utf-8') # Use this function to fetch an image from a URL and convert it to base64 def image_url_to_base64(image_url): response = requests.get(image_url) image_data = response.content return base64.b64encode(image_data).decode('utf-8') # Use this function to convert a list of image URLs to base64 def image_urls_to_base64(image_urls): return [image_url_to_base64(url) for url in image_urls] api_key = "YOUR_API_KEY" url = "https://api.segmind.com/v1/chatterbox-tts" # Request payload data = { "text": "Welcome to Chatterbox TTS, where your text turns into captivating audio effortlessly.", "reference_audio": "https://segmind-resources.s3.amazonaws.com/input/ef2a2b5c-3e3a-4051-a437-20a72bf175de-sample_audio.mp3", "exaggeration": 0.5, "temperature": 0.8, "seed": 42, "cfg_weight": 0.5, "min_p": 0.05, "top_p": 1, "repetition_penalty": 1.2 } headers = {'x-api-key': api_key} response = requests.post(url, json=data, headers=headers) print(response.content) # The response is the generated image
RESPONSE
audio/mp3
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


textstr *

The input text is synthesized into speech. Use longer text for detailed narration, shorter for concise messages.


reference_audiostr ( default: https://segmind-resources.s3.amazonaws.com/input/ef2a2b5c-3e3a-4051-a437-20a72bf175de-sample_audio.mp3 )

Provides a sample audio for voice style matching


exaggerationfloat ( default: 0.5 )

Adjusts speech expressiveness. Use lower values for neutrality, higher for dramatic effect.

min : 0,

max : 2


temperaturefloat ( default: 0.8 )

Controls speech variation. Use lower for consistent tone, higher for diverse expressions.

min : 0,

max : 2


seedint ( default: 42 )

Ensures consistent output with the same input. Adjust for diverse generations.


cfg_weightfloat ( default: 0.5 )

Balances creativity and adherence to text. Use lower for strict interpretation, higher for flexibility.

min : 0,

max : 2


min_pfloat ( default: 0.05 )

Ensures minimum probability for content inclusion. Useful for removing unlikely phrases.

min : 0,

max : 1


top_pfloat ( default: 1 )

Determines output randomness. Lower for focused content, higher for creative diversity.

min : 0,

max : 1


repetition_penaltyfloat ( default: 1.2 )

Penalizes repeated words in speech. Higher values reduce redundancy.

min : 1,

max : 2

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Resources to get you started

Everything you need to know to get the most out of Chatterbox TTS

# Guide to Effective Use of Chatterbox TTS

Chatterbox is a high-fidelity, open-source text-to-speech model that turns plain text into expressive, natural audio. With a 0.5 B-parameter Llama backbone and emotion exaggeration control, you can craft anything from flat announcements to theatrical character voices. Follow this guide to get the best results in your project.

## 1. Basic Workflow  
1. **Provide `text`**  
   – Short alerts (5–20 words) for notifications  
   – Medium passages (50–200 words) for promos or clips  
   – Long-form scripts (200+ words) for narration and audiobooks  
2. **Optional `reference_audio`**  
   – Supply an MP3/URL sample to clone or match tone and timbre  
3. **Adjust parameters** (defaults shown below)  
   – `exaggeration`: 0.7  
   – `temperature`: 0.9  
   – `cfg_weight`: 0.6  
   – `top_p`: 0.95, `min_p`: 0.1  
   – `repetition_penalty`: 1.3  
4. **Generate & iterate**  
   – Tweak sliders and seed for consistent or varied outputs

## 2. Core Parameters  
- **exaggeration (0–2)**: Controls emotional intensity  
  • 0–0.5: neutral, robotic  
  • 0.7: natural, balanced (default)  
  • 1.5–2.0: dramatic, theatrical  
- **temperature (0–2)**: Governs randomness  
  • 0.2–0.5: consistent, predictable  
  • 0.8–1.2: moderate variation  
  • 1.5–2.0: highly varied  
- **cfg_weight (0–2)**: Text vs. creativity balance  
  • Low (0.2–0.5): strict text adherence  
  • Mid (0.6–1.0): balanced  
  • High (1.2–2.0): more interpretive  

## 3. Advanced Controls  
- **top_p / min_p (0–1)**: Nucleus sampling for diversity  
- **repetition_penalty (1–2)**: Discourage repeated phrases  
- **seed**: Lock randomness for reproducibility  

## 4. Parameter Presets by Use Case  
| Use Case                     | Exag. | Temp. | CFG  | top_p | Rep. Penalty | Notes                     |
|-------------------------------|-------|-------|------|-------|--------------|---------------------------|
| Interactive AI Agent          | 0.7   | 0.5   | 0.6  | 0.8   | 1.2          | Friendly, clear responses |
| Game Dialogue & Cinematics    | 1.2   | 1.0   | 0.8  | 0.9   | 1.3          | Character-driven          |
| Video Narration & Explainers  | 0.5   | 0.3   | 0.5  | 0.7   | 1.1          | Steady, professional tone |
| Memes & Social Clips          | 1.8   | 1.2   | 1.0  | 1.0   | 1.0          | High energy, playful      |
| Podcasts & Audiobooks         | 0.6   | 0.4   | 0.5  | 0.8   | 1.3          | Consistent pacing         |

## 5. Prompt Tips & Best Practices  
- **Chunk long scripts** into paragraphs to avoid timing hiccups.  
- **Use punctuation** (commas, dashes) to guide natural pauses.  
- **Reference audio**: match gender, pace, accent for best cloning.  
- **Watermark**: Every file includes an inaudible trace to promote ethical use.  

Experiment with these settings to find the sweet spot for your project. Happy synthesizing!

Other Popular Models

Discover other models you might be interested in.

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.