Veena TTS

Veena transforms text into high-fidelity, expressive speech in Hindi and English for real-time applications.


Pricing

Serverless Pricing

Buy credits that can be used anywhere on Segmind

$ 0.0015 /per gpu second

Dedicated Cloud Pricing

For enterprise costs and dedicated endpoints

$ 0.0007 - $ 0.0031 /per gpu second

Veena – Text-to-Speech Model

What is Veena?

Veena, developed by Maya Research, is a state-of-the-art text-to-speech (TTS) model built on a 3 billion-parameter Llama-based autoregressive transformer. It delivers natural, expressive speech in Hindi and English—handling mixed-language inputs seamlessly. Leveraging the SNAC neural codec at 24 kHz, Veena generates studio-quality audio with four distinct speaker personas (Kavya, Agastya, Maitri, Vinaya). Optimized for ultra-low latency (sub-80 ms on high-end GPUs) and production-ready deployment via 4-bit quantization, Veena is engineered for real-time applications in accessibility, customer service, content creation, and voice-enabled devices.

Key Features

  • High-Fidelity Audio: 24 kHz sampling rate with SNAC neural codec for crystal-clear voice output
  • Multilingual & Code-Switching: Fluent in Hindi and English; natural transitions in mixed-language text
  • Four Unique Voices:
    • Kavya (warm, friendly)
    • Agastya (deep, authoritative)
    • Maitri (clear, neutral)
    • Vinaya (bright, youthful)
  • Low Latency: Sub-80 ms response time on top-tier GPUs—ideal for live interactions
  • Efficient Quantization: 4-bit precision reduces memory footprint without compromising quality
  • Transformer-Based: 3 billion parameters capture complex intonation, stress, and pacing patterns

Best Use Cases

  • Accessibility Tools: Screen readers, assistive communication devices
  • Customer Service: Interactive voice response (IVR), chatbots, automated agents
  • Content Creation: Podcasts, e-learning narrations, audiobooks
  • Voice-Enabled Devices: Smart speakers, wearables, IoT interfaces
  • Multilingual Platforms: Apps requiring seamless Hindi-English dialogue

Prompt Tips and Output Quality

  • Input Text: For clarity, use simple, declarative sentences; combine complex phrases for emotional nuance.
  • Speaker Selection (speaker):
    • Default “kavya” for a warm, conversational tone
    • Switch to “agastya” for a more commanding presence
  • Advanced Controls:
    • temperature (0–2): 0.2 for monotone, 0.7 for lively expressiveness
    • top_p (0–1): 0.5 for focused delivery, 0.95 for varied intonation
    • repetition_penalty (1–2): 1.05 default; increase to 1.2 to minimize repeats
  • Audio Quality: Adjust sampling rate and codec settings for bandwidth or storage constraints without losing clarity

FAQs

Can Veena handle Hindi-English code-switching?
Yes. Veena’s transformer backbone is trained on mixed-language corpora for seamless transitions.

What latency should I expect in production?
On high-end GPUs, Veena delivers sub-80 ms end-to-end latency—perfect for real-time use.

How do I pick the best speaker voice?
Choose based on your brand or application tone: Kavya for warmth, Agastya for depth, Maitri for neutrality, Vinaya for energy.

Is a quantized version available?
Absolutely. Veena supports 4-bit quantization for reduced memory usage and faster inference.

What sample rate does Veena output?
Audio is synthesized at 24 kHz using the SNAC neural codec for smooth, high-quality playback.

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.