Higgsfield Speech 2 Video
Transform images and audio into dynamic, lip-synced videos for engaging digital content.
API
If you're looking for an API, you can choose from your desired programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import requests
import base64
# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
with open(image_path, 'rb') as f:
image_data = f.read()
return base64.b64encode(image_data).decode('utf-8')
# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
response = requests.get(image_url)
image_data = response.content
return base64.b64encode(image_data).decode('utf-8')
# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
return [image_url_to_base64(url) for url in image_urls]
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/higgsfield-speech2video"
# Request payload
data = {
"input_image": "https://segmind-resources.s3.amazonaws.com/input/03cea2dd-87e9-41d7-9932-fbe45d4b2dd5-434b7481-1ddb-43da-a2df-10928effc900.png",
"input_audio": "https://segmind-resources.s3.amazonaws.com/input/a846542c-c555-43ae-bdb0-8795ef78e0bb-8fe7c335-9e7f-4729-8230-b3eabc2af49c.wav",
"prompt": "Generate an educational video with clear articulation, gentle hand gestures, and warm facial expressions appropriate for teaching content. All transitions needs to be super realistic and smooth.",
"quality": "high",
"enhance_prompt": False,
"seed": 42,
"duration": 10
}
headers = {'x-api-key': api_key}
response = requests.post(url, json=data, headers=headers)
print(response.content) # The response is the generated image
Attributes
Provide a URL of the image to drive animation. Use a clear, high-quality image for best results.
URL for the audio guiding avatar speech. Use articulate speech for clear lip-sync results.
Describe the video output scenario. Create an engaging, emotional prompt for vibrant expressions.
Choose video quality preference. 'High' is best for detailed videos, while 'mid' helps with speed.
Allowed values:
Automatically refine your prompt. Enable to achieve a balanced expression across the video.
Set a seed number for consistent outputs. Use different seeds for variation, 42 is common.
min : 1,
max : 1000000
Decide video length in seconds. Choose longer durations for in-depth content.
Allowed values:
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Resources to get you started
Everything you need to know to get the most out of Higgsfield Speech 2 Video
Speak v2: How to Use Effectively
Speak v2 is a state-of-the-art speech-to-video generation model that turns your static image and audio into a lifelike, lip-synced avatar video. Follow this guide to master the key parameters, tailor the settings to different scenarios, and produce polished results every time.
1. Preparing Your Inputs
- â˘input_image (URL): Use a clear, high-resolution, front-facing portrait. Well-lit and centered faces yield the most natural animations.
- â˘input_audio (URL, MP3): Choose clean, well-articulated recordings. Avoid background noise or abrupt volume changes to maintain precise lip-sync.
2. Core Parameters
Parameter | Description | Recommendations |
---|---|---|
prompt | Describe your scene and desired emotional tone. | âDeliver a warm corporate greetingâ |
quality | Video resolution/detail. | high (client-facing); mid (rapid tests) |
duration | Video length in seconds: 5, 10, or 15. | 5s (social snippets); 15s (training) |
enhance_prompt | Auto-refine text prompt for balanced expressions (true/false). | true (complex expressions) |
seed | Numerical seed for reproducibility (1â1,000,000). | Use 42 for baseline; vary for new look |
3. Use Case Presets
- â˘Corporate Spokesperson
- â˘prompt: âIntroduce our new product with confident tone.â
- â˘quality: high, duration: 10s, enhance_prompt: true, seed: 1234
- â˘E-Learning Instructor
- â˘prompt: âExplain the concept of photosynthesis with enthusiasm.â
- â˘quality: high, duration: 15s, enhance_prompt: true, seed: 5678
- â˘Social Media Influencer
- â˘prompt: âShare a quick style tip in a friendly voice.â
- â˘quality: mid, duration: 5s, enhance_prompt: false, seed: 42
4. Optimization Tips
- â˘Prompt Detail: Include emotion and pacing cues (e.g., âcalmly,â âenergeticâ).
- â˘Image Selection: Avoid glasses reflections or extreme head tilts.
- â˘Audio Quality: Record in a quiet room with a pop filter to ensure clear consonants.
- â˘Batch Testing: Run brief 5s clips (mid quality) to preview different seeds before finalizing.
- â˘Consistency: Lock the
seed
parameter when creating multi-segment videos to keep style uniform.
5. Best Practices
- â˘Start with default settings (quality=high, duration=5, seed=42).
- â˘Gradually tweak one parameter at a time to understand its impact.
- â˘Review outputs frame-by-frame to catch subtle lip-sync mismatches.
- â˘Store successful parameter sets as templates for rapid reuse.
By following this guide, youâll harness Speak v2âs full potentialâcreating professional, expressive avatar videos that captivate your audience.
Other Popular Models
Discover other models you might be interested in.
storydiffusion
Story Diffusion turns your written narratives into stunning image sequences.

idm-vton
Best-in-class clothing virtual try on in the wild

faceswap-v2
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

sdxl1.0-txt2img
The SDXL model is the official upgrade to the v1.5 model. The model is released as open-source software
