Seed Audio 1.0 — Text-to-Audio Generation Model
What is Seed Audio 1.0?
Seed Audio 1.0 is ByteDance Seed's all-in-one text-to-audio model. Instead of stopping at speech, it turns a written prompt into a complete sound scene — spoken dialogue, background music, ambience, and foley-style sound effects — in a single generation. That makes it less like a traditional text-to-speech engine and more like an audio director you drive with text.
You can steer output three ways: a plain text prompt, up to three short reference audio clips for voice cloning, or a single reference image that informs mood and delivery. It supports multi-role dialogue, native-sounding accents, and emotional tone, and can produce up to two minutes of audio per request while keeping voices consistent across the take. It is served to international developers via BytePlus and available on Segmind's synchronous API.
Key Features
- •Full-scene audio from one prompt: voice, music, ambience, and sound effects together.
- •Zero-shot voice cloning from up to three reference clips, cited in the prompt as @Audio1, @Audio2, @Audio3.
- •Preset voices spanning English, Chinese, Spanish, Japanese, Indonesian, and Portuguese.
- •Image-guided generation as an alternative to audio references.
- •Fine-grained delivery control over speech rate, loudness, and pitch.
- •Flexible output: wav, mp3, pcm, or ogg_opus at sample rates from 8 kHz to 48 kHz.
Best Use Cases
Seed Audio 1.0 shines when a project needs more than narration. Use it for audio dramas, scripted podcast segments, game cutscene dialogue, ads and trailers, and cinematic voiceover where dialogue, ambience, and score should feel like one finished mix. Its voice-cloning path keeps a character recognizable across multiple clips, and the text-to-audio mode pairs naturally with silent AI video for prompt-to-finished pipelines.
Prompt Tips and Output Quality
The model rewards clear audio direction. Write prompts that read like mini scripts: name the environment, each speaker and their voice traits, the emotional delivery, the dialogue itself, and the sound design you want. Add descriptors such as young boy, breathless, panicky, fantasy film style to sharpen performance. For consistent multi-character scenes, structure the prompt by role and cite reference clips by order.
FAQs
Is Seed Audio 1.0 just text-to-speech? No. Plain TTS reads a script aloud; Seed Audio 1.0 produces the whole soundscape — voice plus music, ambience, and effects — in one pass.
Can it clone a voice? Yes. Provide up to three reference clips (each under 30 seconds) and reference them in the prompt as @Audio1, @Audio2, @Audio3.
What languages are supported? Preset voices cover English, Chinese, Spanish, Japanese, Indonesian, and Portuguese, with cross-lingual synthesis.
How long can the audio be? Up to about two minutes per generation, with voice identity held consistent across the clip.
What output formats does it return? wav, mp3, pcm, or ogg_opus at sample rates from 8 kHz to 48 kHz.
Can I use an image instead of reference audio? Yes, one reference image can guide mood and delivery, but it cannot be combined with reference audio.