Veena TTS

Veena transforms text into high-fidelity, expressive speech in Hindi and English for real-time applications.

API

If you're looking for an API, you can choose from your desired programming language.

POST

import requests
import base64

# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
    with open(image_path, 'rb') as f:
        image_data = f.read()
    return base64.b64encode(image_data).decode('utf-8')

# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
    response = requests.get(image_url)
    image_data = response.content
    return base64.b64encode(image_data).decode('utf-8')

# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
    return [image_url_to_base64(url) for url in image_urls]

api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/veena-tts"

# Request payload
data = {
  "text": "Kya tumne kabhi socha hai... ki hum sab sirf waqt ke musafir hain?",
  "speaker": "kavya",
  "temperature": 0.4,
  "top_p": 0.9,
  "repetition_penalty": 1.05
}

headers = {'x-api-key': api_key}

response = requests.post(url, json=data, headers=headers)
print(response.content)  # The response is the generated image

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

textstr *

Provide input text for speech synthesis. Use simple phrases for clarity, complex for detailed expressions.

speakerstr ( default: kavya )

Choose speaker for voice style. Kavya for warmth, Agastya for depth.

temperaturefloat ( default: 0.4 )

Set speech variation. Use 0.2 for monotone, 0.7 for lively expression.

min : 0,

max : 2

top_pfloat ( default: 0.9 )

Control output randomness. Set 0.5 for focused, 0.95 for diverse speech.

min : 0,

max : 1

repetition_penaltyfloat ( default: 1.05 )

Minimize word repetition. Use 1.2 for minimal repeats.

min : 1,

max : 2

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Veena – Text-to-Speech Model

What is Veena?

Veena, developed by Maya Research, is a state-of-the-art text-to-speech (TTS) model built on a 3 billion-parameter Llama-based autoregressive transformer. It delivers natural, expressive speech in Hindi and English—handling mixed-language inputs seamlessly. Leveraging the SNAC neural codec at 24 kHz, Veena generates studio-quality audio with four distinct speaker personas (Kavya, Agastya, Maitri, Vinaya). Optimized for ultra-low latency (sub-80 ms on high-end GPUs) and production-ready deployment via 4-bit quantization, Veena is engineered for real-time applications in accessibility, customer service, content creation, and voice-enabled devices.

Key Features

High-Fidelity Audio: 24 kHz sampling rate with SNAC neural codec for crystal-clear voice output
Multilingual & Code-Switching: Fluent in Hindi and English; natural transitions in mixed-language text
Four Unique Voices:
- Kavya (warm, friendly)
- Agastya (deep, authoritative)
- Maitri (clear, neutral)
- Vinaya (bright, youthful)
Low Latency: Sub-80 ms response time on top-tier GPUs—ideal for live interactions
Efficient Quantization: 4-bit precision reduces memory footprint without compromising quality
Transformer-Based: 3 billion parameters capture complex intonation, stress, and pacing patterns

Best Use Cases

Accessibility Tools: Screen readers, assistive communication devices
Customer Service: Interactive voice response (IVR), chatbots, automated agents
Content Creation: Podcasts, e-learning narrations, audiobooks
Voice-Enabled Devices: Smart speakers, wearables, IoT interfaces
Multilingual Platforms: Apps requiring seamless Hindi-English dialogue

Prompt Tips and Output Quality

Input Text: For clarity, use simple, declarative sentences; combine complex phrases for emotional nuance.
Speaker Selection (speaker):
- Default “kavya” for a warm, conversational tone
- Switch to “agastya” for a more commanding presence
Advanced Controls:
- temperature (0–2): 0.2 for monotone, 0.7 for lively expressiveness
- top_p (0–1): 0.5 for focused delivery, 0.95 for varied intonation
- repetition_penalty (1–2): 1.05 default; increase to 1.2 to minimize repeats
Audio Quality: Adjust sampling rate and codec settings for bandwidth or storage constraints without losing clarity

FAQs

Can Veena handle Hindi-English code-switching?
Yes. Veena’s transformer backbone is trained on mixed-language corpora for seamless transitions.

What latency should I expect in production?
On high-end GPUs, Veena delivers sub-80 ms end-to-end latency—perfect for real-time use.

How do I pick the best speaker voice?
Choose based on your brand or application tone: Kavya for warmth, Agastya for depth, Maitri for neutrality, Vinaya for energy.

Is a quantized version available?
Absolutely. Veena supports 4-bit quantization for reduced memory usage and faster inference.

What sample rate does Veena output?
Audio is synthesized at 24 kHz using the SNAC neural codec for smooth, high-quality playback.

Other Popular Models

faceswap-v2

Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

sdxl-inpaint

This model is capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask

codeformer

CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.

sd2.1-faceswapper