ElevenLabs Transcript — AI Speech-to-Text API

What is ElevenLabs Transcript?

ElevenLabs Transcript (powered by Scribe) is a state-of-the-art automatic speech recognition (ASR) model built for developers who need accurate, structured transcriptions at scale. Trained to handle real-world audio — accented speech, background noise, overlapping voices, and domain-specific vocabulary — Scribe consistently outperforms competing models including OpenAI Whisper v3 and Google Gemini Flash on word error rate benchmarks across 99 languages.

Available in two versions, scribe_v1 and scribe_v2, the model converts audio files into rich, structured transcripts complete with speaker labels, word-level timestamps, and inline audio event annotations — all through a single API call.

Key Features

•Industry-Leading Accuracy — Achieves a 96.7% accuracy rate for English and record-low word error rates across 99 languages, outperforming Google Gemini 2.0 Flash, OpenAI Whisper v3, and Deepgram Nova-3 in third-party benchmarks.
•Speaker Diarization — Identifies and labels up to 32 distinct speakers in a single recording, with a configurable diarization threshold for fine-tuned separation control.
•Word-Level Timestamps — Captures the exact start and end time of every word or character, enabling frame-accurate subtitle generation and interactive audio players.
•Audio Event Tagging — Detects and annotates non-speech sounds such as (laughter), (applause), or (footsteps) inline within the transcript for richer contextual output.
•99-Language Support — Supports transcription across 99 languages with automatic language detection, including underserved languages like Serbian, Cantonese, and Malayalam where other models often exceed 40% error rates.
•Multi-Channel Audio — Processes up to 5 independent audio channels separately, ideal for call center recordings with distinct agent and customer tracks.
•Keyterm Prompting (scribe_v2) — Bias transcription towards up to 100 custom terms such as product names, medical terminology, or brand-specific vocabulary.
•Entity Detection (scribe_v2) — Automatically identifies and timestamps sensitive entities like names, credit card numbers, or medical conditions in the transcript.

Best Use Cases

Media and Entertainment — Generate frame-accurate subtitles and closed captions for films, YouTube videos, and streaming content. Word-level timestamps feed directly into SRT and VTT subtitle workflows.

Podcast and Interview Transcription — Diarize multi-speaker recordings automatically. Audio event tagging adds expressive context — marking moments of laughter or silence — that pure transcription misses.

Business Meeting Intelligence — Convert recorded calls, standups, and customer interviews into searchable, speaker-attributed transcripts. Pair with an LLM to extract action items or summaries.

Call Center Analytics — Use multi-channel mode to process separate agent and customer audio tracks simultaneously. Speaker diarization ensures attribution is always correct.

Medical and Legal Documentation — High accuracy on domain-specific vocabulary, combined with entity detection in scribe_v2, makes this suitable for clinical dictation and legal deposition transcription where precision is non-negotiable.

Accessibility — Generate captions for live events or archived content across 99 languages, helping creators meet accessibility requirements globally.

Prompt Tips and Output Quality

Start with scribe_v1 for general-purpose transcription — it handles most audio reliably with minimal configuration. Switch to scribe_v2 when your audio contains specialized terminology: use the keyterm prompting feature to pass up to 100 domain-specific words to the model before transcription begins.

For multi-speaker recordings, set diarize: true and provide a rough num_speakers count when known. If the model over-splits a single speaker into two, raise diarization_threshold above 0.22; if it merges two speakers into one, lower it toward 0.10.

Always set timestamp_granularity: word when building subtitle pipelines or interactive transcripts — character-level granularity is useful for specialized downstream processing but rarely needed in production.

For call center or interview audio recorded with separate microphone channels, enable use_multi_channel to ensure each speaker is processed on their own track, dramatically improving diarization accuracy.

Keep temperature at 0 for deterministic, production-grade output. Use a non-zero value only when experimenting with noisy or low-quality audio.

FAQs

What audio formats does ElevenLabs Transcript support? The model accepts most common audio formats including MP3, WAV, M4A, FLAC, OGG, and WebM. Files up to 3GB are supported, covering recordings up to approximately 10 hours.

How accurate is ElevenLabs Transcript compared to OpenAI Whisper? In independent benchmarks, Scribe outperforms OpenAI Whisper v3 on word error rate across the majority of supported languages, including English (96.7% accuracy), French, German, Spanish, and Japanese. Third-party analysis from Artificial Analysis confirmed Scribe also outperforms OpenAI 4o and 4o-mini transcription models on WER benchmarks.

What is speaker diarization and how do I enable it? Speaker diarization identifies which person is speaking at any given moment. Set diarize: true in your API request. Optionally provide num_speakers if you know the exact count, or adjust diarization_threshold to control how aggressively speakers are separated.

What is the difference between scribe_v1 and scribe_v2? scribe_v1 is ElevenLabs baseline production model optimized for speed and accuracy on general audio. scribe_v2 adds keyterm prompting (bias towards domain-specific vocabulary), entity detection (auto-label names, sensitive data), and improved multilingual performance. Use scribe_v2 for specialized or complex audio.

Can I transcribe audio in multiple languages in the same file? Yes. Set language_code to auto-detect (leave blank) and the model will handle language switches within the same recording. For best results on multilingual audio, use scribe_v2.

Is the API suitable for real-time transcription? The standard Scribe API is optimized for batch transcription of recorded audio. For real-time applications requiring low-latency output, ElevenLabs offers a separate Scribe v2 Realtime endpoint with approximately 150ms latency via WebSocket.

Elevenlabs Transcript

Inputs

Examples