Elevenlabs Transcript
Experience unmatched accuracy with ElevenLabs Transcript, the leading model for AI speech-to-text.
API
If you're looking for an API, you can choose from your desired programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import requests
import base64
# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
with open(image_path, 'rb') as f:
image_data = f.read()
return base64.b64encode(image_data).decode('utf-8')
# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
response = requests.get(image_url)
image_data = response.content
return base64.b64encode(image_data).decode('utf-8')
# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
return [image_url_to_base64(url) for url in image_urls]
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/eleven-labs-transcript"
# Request payload
data = {
"audio_url": "https://segmind-sd-models.s3.amazonaws.com/display_images/sad_talker/sad_talker_audio_input.mp3",
"language_code": "en",
"tag_audio_events": False,
"timestamp_granularity": "none",
"diarize": False
}
headers = {'x-api-key': api_key}
response = requests.post(url, json=data, headers=headers)
print(response.content) # The response is the generated image
Attributes
Input Audio URL
An ISO-639-1 or ISO-639-3 language_code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Defaults to null, in this case the language is predicted automatically.
Allowed values:
Model identifier
Allowed values:
Whether to tag audio events like (laughter), (footsteps), etc. in the transcription.
Number of speakers in audio (for diarization)
min : 1,
max : 32
Timestamp level for transcription
Allowed values:
Whether to annotate which speaker is currently talking in the uploaded file.
Diarization threshold to apply during speaker diarization. A higher value means there will be a lower chance of one speaker being diarized as two different speakers but also a higher chance of two different speakers being diarized as one speaker (less total speakers predicted). A low value means there will be a higher chance of one speaker being diarized as two different speakers but also a lower chance of two different speakers being diarized as one speaker (more total speakers predicted). Can only be set when diarize=True and num_speakers=None. Defaults to None, in which case we will choose a threshold based on the model_id (0.22 usually).
min : 0.1,
max : 0.4
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Resources to get you started
Everything you need to know to get the most out of Elevenlabs Transcript
ElevenLabs Transcript
ElevenLabs Transcript is the premier AI transcription for professionals needing flawless audio to text. With industry-leading accuracy, elevenLabs transcript is perfect for films, podcasts, meetings, and medical dictations. Experience unmatched precision and seamless integration with this advanced ASR (automatic speech recognition) technology.
Key Features
- •
Industry-Leading Accuracy - Achieve the lowest word error rate for perfectly accurate English transcription, outperforming Google Gemini and OpenAI Whisper in testing.
- •
Smart Speaker Diarization - Intuitively distinguishes and labels every speaker in any conversation for clear, organized transcripts.
- •
Precise Word-Level Timestamps - Capture the exact moment each word is spoken, enabling seamless subtitle syncing and interactive audio experiences.
- •
Dynamic Audio Tagging - Enriches your English transcripts with the full context of your audio by tagging every sound event, from laughter to footsteps.
- •
Global Language Support - Break language barriers with support for English and 98 other language
Use Cases
- •
Media & Entertainment - Generate accurate subtitles and closed captions for films and videos with precise timestamps.
- •
Business Meetings - Get clear, organized transcripts of meetings with speaker diarization, perfect for record-keeping and follow-up actions.
- •
Medical Dictations - Transcribe medical dictations with industry-leading accuracy, ensuring precision in healthcare documentation.
- •
Podcast Production - Transform audio content into text for show notes, scripts, and enhanced accessibility.
Other Popular Models
Discover other models you might be interested in.
storydiffusion
Story Diffusion turns your written narratives into stunning image sequences.

faceswap-v2
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

codeformer
CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.

sd2.1-faceswapper
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training
