Bytedance HuMo: Human-Centric Video Generation
HuMo generates high-quality, human-centric videos from text, images, and audio with unparalleled control and precision.
API
If you're looking for an API, you can choose from your desired programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import requests
import base64
# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
with open(image_path, 'rb') as f:
image_data = f.read()
return base64.b64encode(image_data).decode('utf-8')
# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
response = requests.get(image_url)
image_data = response.content
return base64.b64encode(image_data).decode('utf-8')
# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
return [image_url_to_base64(url) for url in image_urls]
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/bytedance-humo"
# Request payload
data = {
"frames": 30,
"scale_a": 5,
"scale_t": 5,
"mode": "TA",
"height": 720,
"width": 1280,
"steps": 30
}
headers = {'x-api-key': api_key}
response = requests.post(url, json=data, headers=headers)
print(response.content) # The response is the generated image
Attributes
Number of frames for the generated video
min : 10,
max : 100
Strength of audio guidance. Higher = better audio-motion sync
min : 1,
max : 10
Strength of text guidance. Higher = better adherence to text prompts
min : 1,
max : 10
Input mode: TA for text+audio; TIA for text+image+audio.
Allowed values:
Video height (e.g., 720 or 480).
Allowed values:
Video width (e.g., 1280 or 832).
Allowed values:
min : 1,
max : 100
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Resources to get you started
Everything you need to know to get the most out of Bytedance HuMo: Human-Centric Video Generation
HuMo: Multimodal Human Video Generation Model
HuMo is a state-of-the-art video synthesis engine that transforms text, images, and audio into high-quality, human-centric videos. Whether youâre producing character animations, music-driven performances, or instructional clips, this guide will help you unlock HuMoâs full potential.
Getting Started
- â˘Choose your generation mode:
- â˘Text-to-image: Define appearance and scene.
- â˘Text-to-audio: Drive motion with sound.
- â˘Full multimodal: Combine text, reference images, and audio.
- â˘Install dependencies and load the HuMo model via your preferred SDK.
- â˘Prepare input assets: clear text prompts, high-quality reference images, and well-edited audio tracks.
Core Parameters
- â˘video_resolution (string, required)
⢠1080p for final delivery (slightly longer render times)
⢠720p for rapid iteration - â˘frame_rate (integer, required)
⢠30 fps for standard motion
⢠60 fps for fluid, dynamic scenes - â˘duration (integer, optional)
⢠10â60 seconds (short clips vs. extended sequences) - â˘background_music (string, optional)
⢠âupbeat.mp3â for energetic scenes
⢠ânature.wavâ for calm, ambient moods - â˘human_pose (string, required)
⢠âwalkingâ, âstandingâ, ârunningâ or custom labels - â˘scene (string, optional)
⢠âcityâ, âbeachâ, âstudioâ etc. - â˘emotion (string, optional)
⢠âhappyâ, âneutralâ, âsadâ to influence facial/body language
Parameter Recommendations by Use Case
- â˘Digital Content Creation
- â˘Resolution: 1080p
- â˘Frame Rate: 30
- â˘Duration: 15â30s
- â˘Pose: âstandingâ or âwalkingâ
- â˘Scene: specify context (e.g., âmodern loftâ)
- â˘Emotion: âneutralâ for product showcases
- â˘Music-Synced Performances
- â˘Resolution: 720p (faster presets)
- â˘Frame Rate: 60
- â˘Duration: match track length (30â60s)
- â˘Background Music: âupbeat.mp3â
- â˘Pose: âdancingâ or ârunningâ
- â˘Emotion: âhappyâ
- â˘Educational Demonstrations
- â˘Resolution: 1080p
- â˘Frame Rate: 30
- â˘Duration: 20â60s
- â˘Pose: âstandingâ, âpointingâ
- â˘Scene: âclassroomâ or âlaboratoryâ
- â˘Emotion: âneutralâ or âfocusedâ
- â˘Marketing & Ads
- â˘Resolution: 1080p
- â˘Frame Rate: 30
- â˘Duration: 10â20s
- â˘Pose: âwalkingâ toward camera
- â˘Scene: branded environment
- â˘Emotion: âhappyâ
- â˘Virtual Pre-viz
- â˘Resolution: 720p
- â˘Frame Rate: 30
- â˘Duration: 10â30s
- â˘Pose: ârunningâ or âfightingâ
- â˘Scene: âforestâ or âurban streetâ
- â˘Emotion: âintenseâ
Prompting Tips
- â˘Start with a vivid scene descriptor: âsunlit seaside boardwalk.â
- â˘Embed human_pose and emotion early: â(human_pose: dancing, emotion: joyful).â
- â˘Reference image quality matters: crisp, well-lit, front-facing shots.
- â˘Sync audio beats: align key frames with musical peaks.
By following these guidelines and fine-tuning HuMoâs parameters, you can generate polished, expressive human videos suited to any project. Enjoy creating!
Other Popular Models
Discover other models you might be interested in.
idm-vton
Best-in-class clothing virtual try on in the wild

instantid
InstantID aims to generate customized images with various poses or styles from only a single reference ID image while ensuring high fidelity

sd1.5-majicmix
The most versatile photorealistic model that blends various models to achieve the amazing realistic images.

sd1.5-epicrealism
This model corresponds to the Stable Diffusion Epic Realism checkpoint for detailed images at the cost of a super detailed prompt
