API
If you're looking for an API, you can choose from your desired programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import requests
import base64
# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
with open(image_path, 'rb') as f:
image_data = f.read()
return base64.b64encode(image_data).decode('utf-8')
# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
response = requests.get(image_url)
image_data = response.content
return base64.b64encode(image_data).decode('utf-8')
# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
return [image_url_to_base64(url) for url in image_urls]
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/veena-tts"
# Request payload
data = {
"text": "Kya tumne kabhi socha hai... ki hum sab sirf waqt ke musafir hain?",
"speaker": "kavya",
"temperature": 0.4,
"top_p": 0.9,
"repetition_penalty": 1.05
}
headers = {'x-api-key': api_key}
response = requests.post(url, json=data, headers=headers)
print(response.content) # The response is the generated image
Attributes
Provide input text for speech synthesis. Use simple phrases for clarity, complex for detailed expressions.
Choose speaker for voice style. Kavya for warmth, Agastya for depth.
Set speech variation. Use 0.2 for monotone, 0.7 for lively expression.
min : 0,
max : 2
Control output randomness. Set 0.5 for focused, 0.95 for diverse speech.
min : 0,
max : 1
Minimize word repetition. Use 1.2 for minimal repeats.
min : 1,
max : 2
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Veena – Text-to-Speech Model
What is Veena?
Veena, developed by Maya Research, is a state-of-the-art text-to-speech (TTS) model built on a 3 billion-parameter Llama-based autoregressive transformer. It delivers natural, expressive speech in Hindi and English—handling mixed-language inputs seamlessly. Leveraging the SNAC neural codec at 24 kHz, Veena generates studio-quality audio with four distinct speaker personas (Kavya, Agastya, Maitri, Vinaya). Optimized for ultra-low latency (sub-80 ms on high-end GPUs) and production-ready deployment via 4-bit quantization, Veena is engineered for real-time applications in accessibility, customer service, content creation, and voice-enabled devices.
Key Features
- High-Fidelity Audio: 24 kHz sampling rate with SNAC neural codec for crystal-clear voice output
- Multilingual & Code-Switching: Fluent in Hindi and English; natural transitions in mixed-language text
- Four Unique Voices:
- Kavya (warm, friendly)
- Agastya (deep, authoritative)
- Maitri (clear, neutral)
- Vinaya (bright, youthful)
- Low Latency: Sub-80 ms response time on top-tier GPUs—ideal for live interactions
- Efficient Quantization: 4-bit precision reduces memory footprint without compromising quality
- Transformer-Based: 3 billion parameters capture complex intonation, stress, and pacing patterns
Best Use Cases
- Accessibility Tools: Screen readers, assistive communication devices
- Customer Service: Interactive voice response (IVR), chatbots, automated agents
- Content Creation: Podcasts, e-learning narrations, audiobooks
- Voice-Enabled Devices: Smart speakers, wearables, IoT interfaces
- Multilingual Platforms: Apps requiring seamless Hindi-English dialogue
Prompt Tips and Output Quality
- Input Text: For clarity, use simple, declarative sentences; combine complex phrases for emotional nuance.
- Speaker Selection (
speaker
):- Default “kavya” for a warm, conversational tone
- Switch to “agastya” for a more commanding presence
- Advanced Controls:
temperature
(0–2): 0.2 for monotone, 0.7 for lively expressivenesstop_p
(0–1): 0.5 for focused delivery, 0.95 for varied intonationrepetition_penalty
(1–2): 1.05 default; increase to 1.2 to minimize repeats
- Audio Quality: Adjust sampling rate and codec settings for bandwidth or storage constraints without losing clarity
FAQs
Can Veena handle Hindi-English code-switching?
Yes. Veena’s transformer backbone is trained on mixed-language corpora for seamless transitions.
What latency should I expect in production?
On high-end GPUs, Veena delivers sub-80 ms end-to-end latency—perfect for real-time use.
How do I pick the best speaker voice?
Choose based on your brand or application tone: Kavya for warmth, Agastya for depth, Maitri for neutrality, Vinaya for energy.
Is a quantized version available?
Absolutely. Veena supports 4-bit quantization for reduced memory usage and faster inference.
What sample rate does Veena output?
Audio is synthesized at 24 kHz using the SNAC neural codec for smooth, high-quality playback.
Other Popular Models
faceswap-v2
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

sdxl-inpaint
This model is capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask

codeformer
CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.

sd2.1-faceswapper
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training
