Dia (Text to Speech)

Dia by Nari Labs is an advanced open-weights TTS model that brings scripts to life with natural speech, emotions, and nonverbal cues. Easily control tone, voice, and delivery. Great alternative to ElevenLabs.

~91.26s
~$0.101

Inputs

Input text for speech generation. Use [S1], [S2] for speakers and ( ) for actions like (laughs) or (whispers). Verbal tags will be recognized, but might result in unexpected output.

Audio file in: .wav .mp3 .flac, for voice cloning. Model will clone this voice style.

Examples

--

Dia by Nari Labs: Next-gen Text-to-Speech with Emotion and Realism

Dia is a cutting-edge, 1.6 billion parameter text-to-speech model developed by Nari Labs — where "Nari" (나리) means lily in pure Korean. It is designed to produce ultra-realistic, podcast-style dialogue directly from text inputs. Unlike traditional TTS systems that often sound robotic or lack expressive nuance, Dia excels at generating lifelike, multi-speaker conversations complete with emotional tone adjustments and non-verbal cues such as pauses, laughter, and coughing. This level of expressiveness and control makes Dia a game changer in the field, enabling creators to craft engaging audio for podcasts, audiobooks, video game characters, and conversational interfaces without the need for high-end proprietary solutions. It is ideal for conversational AI, storytelling, dubbing, and interactive voice applications.

Technically, Dia is built as a 1.6 billion parameter model optimized specifically for natural dialogue synthesis, distinguishing it from general-purpose TTS models. The architecture supports advanced features such as audio conditioning, where users can guide the generated speech’s tone, emotion, or delivery style using short audio samples. It also allows script-level control with embedded commands for non-verbal sounds, enhancing the realism of the output. The model was trained using Google’s TPU Cloud, making it efficient enough to run on most modern computers, though the full version requires around 10GB of VRAM, with plans for a more lightweight, quantized release in the future. By releasing both the model weights and inference code openly, Nari Labs fosters community-driven innovation and transparency, positioning Dia as a versatile and accessible tool for next-generation speech synthesis.

Key Features

  • Multi-Speaker Dialogue Tags
Generate dynamic conversations using [S1], [S2] speaker tags.
  • Nonverbal Vocal Cues
Dia recognizes expressive cues: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles).
  • Zero-Shot Voice Variety
The model is not fine-tuned to a single voice, so it will produce a new synthetic voice with each run. This allows for variety but requires conditioning for consistency.
  • Voice Consistency Options - Audio Prompting: Upload a voice sample to guide tone and speaker identity. - Seed Fixing: Use the same seed for consistent voice generation across runs.
  • Voice Cloning
Clone any voice by uploading a sample and a matching transcript. The model will adapt and use that voice for the rest of your script.

Usage Tips

  • Speaker Identity Management: Use [S1], [S2] for clarity in conversations.
  • Conditioning for Emotional Delivery: Include nonverbal tags or an audio sample to control emotion and style.