Veena TTS

Veena transforms text into high-fidelity, expressive speech in Hindi and English for real-time applications.

Playground

Try the model in real time below.

Provide input text for speech synthesis. Use simple phrases for clarity, complex for detailed expressions.

Choose speaker for voice style. Kavya for warmth, Agastya for depth.

For faster inference times click here

FEATURES

PixelFlow allows you to use all these features

Unlock the full potential of generative AI with Segmind. Create stunning visuals and innovative designs with total creative control. Take advantage of powerful development tools to automate processes and models, elevating your creative workflow.

Segmented Creation Workflow

Gain greater control by dividing the creative process into distinct steps, refining each phase.

Customized Output

Customize at various stages, from initial generation to final adjustments, ensuring tailored creative outputs.

Layering Different Models

Integrate and utilize multiple models simultaneously, producing complex and polished creative results.

Workflow APIs

Deploy Pixelflows as APIs quickly, without server setup, ensuring scalability and efficiency.

Veena – Text-to-Speech Model

What is Veena?

Veena, developed by Maya Research, is a state-of-the-art text-to-speech (TTS) model built on a 3 billion-parameter Llama-based autoregressive transformer. It delivers natural, expressive speech in Hindi and English—handling mixed-language inputs seamlessly. Leveraging the SNAC neural codec at 24 kHz, Veena generates studio-quality audio with four distinct speaker personas (Kavya, Agastya, Maitri, Vinaya). Optimized for ultra-low latency (sub-80 ms on high-end GPUs) and production-ready deployment via 4-bit quantization, Veena is engineered for real-time applications in accessibility, customer service, content creation, and voice-enabled devices.

Key Features

  • High-Fidelity Audio: 24 kHz sampling rate with SNAC neural codec for crystal-clear voice output
  • Multilingual & Code-Switching: Fluent in Hindi and English; natural transitions in mixed-language text
  • Four Unique Voices:
    • Kavya (warm, friendly)
    • Agastya (deep, authoritative)
    • Maitri (clear, neutral)
    • Vinaya (bright, youthful)
  • Low Latency: Sub-80 ms response time on top-tier GPUs—ideal for live interactions
  • Efficient Quantization: 4-bit precision reduces memory footprint without compromising quality
  • Transformer-Based: 3 billion parameters capture complex intonation, stress, and pacing patterns

Best Use Cases

  • Accessibility Tools: Screen readers, assistive communication devices
  • Customer Service: Interactive voice response (IVR), chatbots, automated agents
  • Content Creation: Podcasts, e-learning narrations, audiobooks
  • Voice-Enabled Devices: Smart speakers, wearables, IoT interfaces
  • Multilingual Platforms: Apps requiring seamless Hindi-English dialogue

Prompt Tips and Output Quality

  • Input Text: For clarity, use simple, declarative sentences; combine complex phrases for emotional nuance.
  • Speaker Selection (speaker):
    • Default “kavya” for a warm, conversational tone
    • Switch to “agastya” for a more commanding presence
  • Advanced Controls:
    • temperature (0–2): 0.2 for monotone, 0.7 for lively expressiveness
    • top_p (0–1): 0.5 for focused delivery, 0.95 for varied intonation
    • repetition_penalty (1–2): 1.05 default; increase to 1.2 to minimize repeats
  • Audio Quality: Adjust sampling rate and codec settings for bandwidth or storage constraints without losing clarity

FAQs

Can Veena handle Hindi-English code-switching?
Yes. Veena’s transformer backbone is trained on mixed-language corpora for seamless transitions.

What latency should I expect in production?
On high-end GPUs, Veena delivers sub-80 ms end-to-end latency—perfect for real-time use.

How do I pick the best speaker voice?
Choose based on your brand or application tone: Kavya for warmth, Agastya for depth, Maitri for neutrality, Vinaya for energy.

Is a quantized version available?
Absolutely. Veena supports 4-bit quantization for reduced memory usage and faster inference.

What sample rate does Veena output?
Audio is synthesized at 24 kHz using the SNAC neural codec for smooth, high-quality playback.

F.A.Q.

Frequently Asked Questions

Take creative control today and thrive.

Start building with a free account or consult an expert for your Pro or Enterprise needs. Segmind's tools empower you to transform your creative visions into reality.

Pixelflow Banner

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.