Chatterbox TTS

Chatterbox transforms text into rich, natural speech with adjustable emotional expressiveness for diverse applications.

API

If you're looking for an API, you can choose from your desired programming language.

POST

import requests
import base64

# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
    with open(image_path, 'rb') as f:
        image_data = f.read()
    return base64.b64encode(image_data).decode('utf-8')

# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
    response = requests.get(image_url)
    image_data = response.content
    return base64.b64encode(image_data).decode('utf-8')

# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
    return [image_url_to_base64(url) for url in image_urls]

api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/chatterbox-tts"

# Request payload
data = {
  "text": "Welcome to Chatterbox TTS, where your text turns into captivating audio effortlessly.",
  "reference_audio": "https://segmind-resources.s3.amazonaws.com/input/ef2a2b5c-3e3a-4051-a437-20a72bf175de-sample_audio.mp3",
  "exaggeration": 0.5,
  "temperature": 0.8,
  "seed": 42,
  "cfg_weight": 0.5,
  "min_p": 0.05,
  "top_p": 1,
  "repetition_penalty": 1.2
}

headers = {'x-api-key': api_key}

response = requests.post(url, json=data, headers=headers)
print(response.content)  # The response is the generated image

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

textstr *

The input text is synthesized into speech. Use longer text for detailed narration, shorter for concise messages.

reference_audiostr ( default: https://segmind-resources.s3.amazonaws.com/input/ef2a2b5c-3e3a-4051-a437-20a72bf175de-sample_audio.mp3 )

Provides a sample audio for voice style matching

exaggerationfloat ( default: 0.5 )

Adjusts speech expressiveness. Use lower values for neutrality, higher for dramatic effect.

min : 0,

max : 2

temperaturefloat ( default: 0.8 )

Controls speech variation. Use lower for consistent tone, higher for diverse expressions.

min : 0,

max : 2

seedint ( default: 42 )

Ensures consistent output with the same input. Adjust for diverse generations.

cfg_weightfloat ( default: 0.5 )

Balances creativity and adherence to text. Use lower for strict interpretation, higher for flexibility.

min : 0,

max : 2

min_pfloat ( default: 0.05 )

Ensures minimum probability for content inclusion. Useful for removing unlikely phrases.

min : 0,

max : 1

top_pfloat ( default: 1 )

Determines output randomness. Lower for focused content, higher for creative diversity.

min : 0,

max : 1

repetition_penaltyfloat ( default: 1.2 )

Penalizes repeated words in speech. Higher values reduce redundancy.

min : 1,

max : 2

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Chatterbox – Text-to-Speech Model

What is Chatterbox?

Chatterbox is an open-source, high-fidelity text-to-speech (TTS) model developed by Resemble AI. Built on a 0.5 billion-parameter Llama backbone, it transforms plain text into natural, expressive speech. Trained on 0.5 million hours of cleaned audio, Chatterbox leverages alignment-informed synthesis to maintain precise lip-sync and timing. Unique to Chatterbox is its emotion exaggeration control, enabling developers to dial up or tone down expressiveness for dramatic narration, character voices, and dynamic AI agents. Outputs include a subtle watermark to promote ethical usage and traceability.

Key Features

0.5 Billion Parameter Llama Backbone: Balances model size with ultra-natural speech quality.
Emotion Exaggeration Control: User-adjustable “exaggeration” slider (0–2) for varied expressive styles.
Alignment-Informed Synthesis: Stable, consistent timing between text and audio.
Watermarked Outputs: Embedded inaudible watermark for responsible AI deployment.
Voice Conversion Support: Match or clone voices using a reference audio clip.
Ultra-Stable Generation: Outperforms leading commercial TTS like ElevenLabs in stability and nuance.
Advanced Sampling Controls: Temperature, CFG weight, top_p, min_p, and repetition penalty for fine-tuning.

Best Use Cases

Interactive AI Agents & Chatbots: Lifelike responses with adjustable emotion.
Game Dialogue & Cinematics: Character voices with dynamic intensity control.
Video Narration & Explainers: Professional voiceover with rich expressiveness.
Memes & Social Clips: Create humorous or dramatic one-liners instantly.
Podcasts & Audiobooks: Long-form narration with consistent tone and pacing.

Prompt Tips and Output Quality

Input Text Length: Use longer passages for storytelling; shorter prompts for concise alerts.
Reference Audio: Supply a sample clip (e.g., MP3 URL) to match tone and timbre.
Exaggeration (0–2):
• 0–0.5 for neutral/flat delivery
• 0.7 (default) for mild expressiveness
• 1.5–2.0 for theatrical or character voices
Temperature (0–2): Lower values (0.2–0.5) yield consistent, predictable speech; higher (1.0–1.5) adds variation.
CFG Weight (0–2): Balances strict adherence to text (lower) vs. creative interpretation (higher).
Top_p & Min_p: Tailor randomness—reduce top_p (0.7–0.9) for focused output; raise for more diversity.
Repetition Penalty (1–2): Increase to avoid word repetition in verbose content.

FAQs

Q: How do I control emotion intensity?
Use the exaggeration parameter: values below 0.7 tone down expression, values above 1.0 heighten drama.

Q: Can I match a custom voice?
Yes. Provide a reference_audio URL to steer Chatterbox toward the same style and pitch.

Q: Is Chatterbox multilingual?
Chatterbox is optimized for English. Community contributions are welcome to extend language support.

Q: How does the watermark work?
An inaudible digital watermark is embedded in each output to ensure traceability and discourage misuse.

Q: Is Chatterbox open source?
Absolutely. Chatterbox’s code and model checkpoints are available under an open-source license on Resemble AI’s GitHub.

Other Popular Models

storydiffusion

Story Diffusion turns your written narratives into stunning image sequences.

faceswap-v2

Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

sdxl1.0-txt2img

The SDXL model is the official upgrade to the v1.5 model. The model is released as open-source software

sd2.1-faceswapper

Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

Chatterbox TTS

API

Attributes

Chatterbox – Text-to-Speech Model

What is Chatterbox?

Key Features

Best Use Cases

Prompt Tips and Output Quality

FAQs

Other Popular Models

storydiffusion

faceswap-v2

sdxl1.0-txt2img

sd2.1-faceswapper

Cookie settings